groupBy() vs groupByKey()

December 21, 2016

groupBy() vs groupByKey()

In this post, I would like to cover about two grouping operations on RDD.Which operation should I use? groupBy() or groupByKey() ?.This is a common question every newbies would get when he come across a scenario where data to be grouped based on some key.

These operations are mostly suitable for data that has key-value structure. You should avoid this operations when your intention is beyond just grouping. You can consider much more efficient reduceByKey() or combineByKey() as an alternative for grouping plus aggregation scenarios.Before we get into the difference, let's first have a quick overview on each of the operations.

GroupBy()

GroupBy is to group data together which has same key and is a transformation operation on RDD which means its lazily evaluated.This is a wide operation which will result in data shuffling hence it a costlier one.This operation can be used on both Pair and unpaired RDD but mostly it will be used on unpaired.This let the programmer to explicitly mention the key to group.

val baseRdd = sc.parallelize(Seq((1, 2), (1, 3), (2, 1), (2, 2), (2, 3)))
baseRdd.groupBy(_._1)

GroupByKey()

GroupByKey is also to group data together which has same key but this is meant only for Pair RDD. Meaning programmer has no way to explicitly mention the key field to group like groupBy. So this operation is specialised for RDD which already defined its key. As like groupBy(), this also a transformation, wide and costlier operation in nature. Important thing to be noted here is groupByKey() always results in Hash-Partitioned RDDs. Let's cover this aspect very shortly.

val baseRdd = sc.parallelize(Seq((1, 2), (1, 3), (2, 1)(2,2), (2, 3)))
baseRdd.groupByKey()

Now Differences

I can hear what you are thinking “then whats the difference between them apart from syntactical ?”. Well, the intention of my article is to show you just that and let's focus now onwards on differences. Let me explain you with an example.

val data = for {
    x <- 1 to 200000
    y <- 1 to 20
  } yield (x, y)

  val baseRDD = sc.parallelize(data)
  baseRDD.groupBy(_._1).take(1)

When you execute the above job (code) with 2 core node, we get this performance metrics from Spark UI.

Now let's modify the job with groupByKey() and check the results

val data = for {
    x <- 1 to 200000
    y <- 1 to 20
  } yield (x, y)

  val baseRDD = sc.parallelize(data)
  baseRDD.groupByKey().take(1)

Conclusion
Got it? groupBykey() gives roughly 4secs performance advantage over groupBy().Also look the read and write data shuffle difference.So clearly groupByKey() is more efficient than groupBy() and this significant performance gain is primarily because of Hash-Partitioned implementation.So when you get the scenario of just grouping, then go ahead with groupByKey.

Comments

Unknown7 May 2019 at 21:08
Hi Balaji

ArtArti is super. But due to page background fonts are not vivisib clearcl while reading in mobile . Please fix it
ReplyDelete
Replies
Deep13 May 2019 at 11:10
Mm
ReplyDelete
Replies
Balaji Reddy1 October 2019 at 23:54
Thank you Teju for your kind words.
ReplyDelete
Replies
Anonymous23 July 2021 at 20:51
Really good way of explaining the concept
ReplyDelete
Replies

Add comment

Search This Blog

Balaji Reddy 's Blog

groupBy() vs groupByKey()

GroupBy()

Now Differences

I can hear what you are thinking “then whats the difference between them apart from syntactical ?”. Well, the intention of my article is to show you just that and let's focus now onwards on differences. Let me explain you with an example.

Comments

Post a Comment

Popular Posts

map(), flatMap() vs mapValues(),flatMapValues()