Skip to main content

Posts

Featured

zip()

zip(...) is a simple and interesting operation which can be considered as general purpose operation. Basically  it returns key-value pairs (ZippedPartitionsRDD2) with the first element in each RDD, second element in each RDD, etc.  It is an element-wise , narrow transformation operation. Following example demonstrates the above definitions. val rdd1 = spark . sparkContext . parallelize ( Seq ( 1 , 2 , 3 , 4 , 5 ) ) val rdd2 = spark . sparkContext . parallelize ( Seq ( 6 , 7 , 8 , 9 , 10 ) ) rdd1 . zip ( rdd2) . foreach ( println ( _ ) ) Result (1,6) (2,7) (4,9) (5,10) (3,8) DAG The above result shows that, first element of  rdd1  is zipped with first element of  rdd2  and so on.Also the DAG confirms that zip(...) is a narrow operation. If you look at the code, you could find both   rdd1  and  rdd2  has same number of record counts but interesting point here is, following error will be thrown if your rdds has different record counts. Caused by: org.apache.sp

Latest posts

filter()

mapPartitions()

Intersection()

Subtract()

Collect vs Map Operation

Union Operation

Reduce() Operation

Map Operation

Spark Execution

map(), flatMap() vs mapValues(),flatMapValues()