Map Operation

In this article, let's discuss about map operation in Spark and before we move into Spark aspect of map, let's see what is map in general.In mathematics, map is a higher order function and it primarily does at least one of the following.

  1. Takes one or more functions as its input.
  2. Returns a function as its result. 
With the given basic understanding, let's jump into Spark map operation.In Spark, map() is a transformation operation which means it is lazily evaluated.It takes a function and applies the given function on each of the data element and returns a new RDD. It's a narrow operation and no data shuffling will take place. It's important to note that RDDs are readonly in nature so the input RDD is left unedited. Let’s  see it with an example

val additionFun = (input: Int) => input + input
val multiplFun = (input: Int) => input * input

val baseRDD = spark.sparkContext.parallelize(1 to 10)

//apply addition function on baseRDD
baseRDD.map { additionFun }.foreach(println)

println(“Multiplication function")
//apply multiplication operation on baseRDD
baseRDD.map { multiplFun }.foreach(println)
Result
2
4
6
8
10
12
14
16
18
20
Multiplication function
1
4
9
16
25
36
49
64
81
100

The output shows that map() performs two different operation on the same dataset.It might be little confusing to the newbies to functional world. In functional world, a function can take another function as input and  return a function as output.

Another import aspect of map() is it’s return type.In most of the spark in-built operations, it’s  return type should be same as its input type (ex reduce()).In our previous example, we have seen that both our baseRDD and the result as Integer type.But map() gives us the flexibility to have your return type different from your input type. 

Lets tweak our previous code a little and see it
val additionFun = (input: Int) => (input + input).toString
val baseRDD = spark.sparkContext.parallelize(1 to 10)

//apply addition function on baseRDD
baseRDD.map { additionFun }.foreach(println)
The outcome would be same but in String type. By many developers, map() is considered as general purpose operation which also does ETL. Given additionFun and multiplFun will act as closure.In other words, the methods will be run across by many executors in the cluster.

Comments

Popular Posts