How can I minimize data transfers when working with Spark

Sep 19, 2018 in Apache Spark by shams
• 3,670 points • 3,325 views

1 answer to this question.

Minimizing data transfers and avoiding shuffling helps in writing spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

I hope this will help!!

answered Sep 19, 2018 by zombie
• 3,790 points

Difference Between rdd dataframe dataset

Sep 13, 2019 in Apache Spark by Rajesh pagadala

closed Sep 13, 2019 by Omkar • 1,208 views

Related Questions In Apache Spark

0 votes

1 answer

How can we optimize and minimize the memory when work with scala use case?

Hi, There is a term in Scala that is ...READ MORE

answered Jul 5, 2019 in Apache Spark by Gitika
• 65,770 points • 977 views

0 votes

2 answers

In a Spark DataFrame how can I flatten the struct?

// Collect data from input avro file ...READ MORE

answered Jul 4, 2019 in Apache Spark by Dhara dhruve
• 6,311 views

+1 vote

1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,490 points • 8,645 views

0 votes

1 answer

Spark: How can i create temp views in user defined database instead of default database?

You can try the below code: df.registerTempTable(“airports”) sqlContext.sql(" create ...READ MORE

answered Jul 14, 2019 in Apache Spark by Ishan
• 4,677 views

+2 votes

0 answers

How can I import zip files and process the excel files ( inside the zip files ) by using pyspark connecting with pymongo ?

How can I import zip files and ...READ MORE

Aug 6, 2019 in Apache Spark by Ahmed
• 921 views

0 votes

1 answer

Can I read a CSV represented as a string into Apache Spark?

You can use the following command. This ...READ MORE

answered May 3, 2018 in Apache Spark by kurt_cobain
• 9,350 points • 2,680 views

0 votes

1 answer

How can I compare the elements of the RDD using MapReduce?

You have to use the comparison operator ...READ MORE

answered May 24, 2018 in Apache Spark by Shubham
• 13,490 points • 3,576 views

0 votes

1 answer

When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?

No, it is not necessary to install ...READ MORE

answered Jun 14, 2018 in Apache Spark by nitinrawat895
• 11,380 points • 6,503 views

+1 vote

2 answers

How can I convert Spark Dataframe to Spark RDD?

Assuming your RDD[row] is called rdd, you ...READ MORE

answered Jul 9, 2018 in Apache Spark by zombie
• 3,790 points • 20,979 views

0 votes

1 answer

How is RDD in Spark different from Distributed Storage Management? Can anyone help me with this ?

Some of the key differences between an RDD and ...READ MORE

answered Jul 26, 2018 in Apache Spark by zombie
• 3,790 points • 1,727 views

Subscribe to our Newsletter, and get personalized recommendations.

REGISTER FOR FREE WEBINAR

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP