Making Data Sense: Performance Tuning in Spark SQL

Thinking about Apache Spark, things that come on everyone's mind is:-

It's going to be a lightning fast in-memory computing.
It's 100 times faster than MapReduce.

But sometimes, we find that the spark application is not performing to the expected level. What would be the possible reasons for it? The solution to it is very simple:

"You might have not tune your Spark Application properly."

Based on my experience in tuning the Spark Application for a live project, I would like to share some of the findings that helped me improve the performance of my Spark SQL job. Following are the configurations that must be taken into consideration while tuning your Spark SQL job:-

Auto broadcast Join: Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. This can be achieved by adding the configuration in SqlContext as:

Configuration :
sqlContext.setConf("spark.sql.autoBroadcastJoinThreshold", "10485760")

Default value is “10485760 ”(10MB)

Here the value “ 10485760 ” is in bytes which could be set accordingly. This specifies the maximum size of the table to be broadcasted. All the tables under this size will be broadcasted.

As per observation, it is seen that by setting this parameter all the tables created from the parquet files are broadcasted but the in-memory table(intermediate tables) are not broadcasted. To overcome this issue, there are two options:

1. Save those tables in hive meta store by using the function : df.write().mode(mode).saveAsTable(tableName).

After saving the data frame as a table when accessed, the table will be automatically broadcasted.

2. Save the data frame as a parquet file and then read that parquet. After reading the parquet, register it as a temporary table. When the join operation will be performed in that table, it will automatically get broadcasted.

The second option is better because there is no need to connect to the hive meta store.

Note : - If autoBroadcastJoin is enabled, the cache should not be used, as it changes the BroadcastHash join to SortMerged join.

Shuffle partitions: Configures the number of partitions to use when shuffling data for joins or aggregations.

Configuration :
sqlContext.setConf("spark.sql.shuffle.partitions", "200")

The default value for partition is 200. The number of partitions should vary with the size of the dataset. If the dataset is small, then no. of partitions should be kept minimum and when the dataset is huge, the number of partitions should be increased accordingly. This can be managed by passing arguments to the application.

Shuffle partitions should be configured as per the requirement. One value of shuffle

partition should not be configured for the whole application, this should be explicitly managed. For small size data, the value of shuffle partition should be small and for the large partition, the value should be large. The Too large value of shuffle partition is also not required.

Data Serialization: Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application.

• By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. Java serialization is flexible but often quite slow.

• Kryo is significantly faster and more compact than Java serialization (often as much as 10x).

Configuration : sqlContext.setConf("spark.serializer","org.apache.spark.serializer.KryoSerializer")

These are the few configurations that can be used to tune the performance of apache-spark application. In my next post, I would be sharing another set of configurations.

Thank you for reading my post. Hope it helps you to tune your spark application.
Stay connected for future posts.

1 comment:

Venu23 September 2021 at 07:26
Thanks to share ur knowledge . I have one doubt. spark.sql.shuffle.partitions by default 200 partition ok but if large amount increase partition and small amount data decrease ok but if based on what factor we judge the number of partitions
let eg: 20 tb data we have 100 nodes with each node 50gb memory each node ... Now increase 100000 partitions is it recommended?

please tell me
Venu
spark training institute in Hyderabad

bigdata training in Hyderabad

Making Data Sense

Sunday, 9 April 2017

Performance Tuning in Spark SQL

1 comment:

A Brief Introduction to the Architecture of Cloudera Impala

Search This Blog