Different ways to word count in apache spark

yashwanth2804 profile image kambala yashwanth ・2 min read

Hi Big Data Devs,

When it comes to provide an example for a big-data framework, WordCount program is like a hello world programme.The main reason it gives a snapshot of Map-shuffle-reduce for the beginners.Here I am providing different ways to achieve it

ReduceByKey (Transformation)

Return type is same as input RDD type

JavaRDD<String> file = sc.textFile("/<Path_To_File>/README.md");

JavaRDD<String> words = file
.flatMap(f -> Arrays.asList(f.split(" ")).iterator())
.filter(f -> !f.isEmpty());

// grouping words with number 1
JavaPairRDD<String,Integer> wordMap_to_pair = words
.mapToPair(f -> new Tuple2<String,Integer>(f,1) );

//ReduceBykey,will merge at partition level and sends result to driver
JavaPairRDD<String,Integer> reducebyKey = wordMap_to_pair.reduceByKey((a,b) -> a+b);


foldByKey (Transformation)

It is similar to reeducebykey but takes the zeroValue.The user does not need to specify a combiner

JavaPairRDD<String, Integer> Foldbykey = wordMap_to_pair.foldByKey(0, ((acc, val) -> acc+val));


aggregateByKey (Transformation)

Return type need not be same as input RDD type

initalValue @primitive types. 0 in Addition and subtraction , 1 in multiplication /division , [] on list ,"" on string

SequenceFunction @operates within partition level ,

CombinerFunction @operates across partition level

JavaPairRDD<String,Integer> aggreagteBykey = wordMap_to_pair.aggregateByKey(
0 // initial value
, ( (a,v) -> a+v ) //seq,counting operation within partition
, ( (a,v) -> a+v )); // merge, counting operation across partition

System.out.println( aggreagteBykey.collect());

CombineByKey (Transformation)

The user can specify a combiner function and customize combining behavior unlike aggregate and fold bykeys.

Return type need not be same as input RDD type


createCombiner @A function accepts current values and returns new value

mergeValue @merge/combine values within partition level

mergeCombiners @merge/combine values across partition level

i -> i, //createCombiner
(a,v) -> (a+v) //mergeValue
,(a,b) -> a+b) //mergeCombiners

groupByKey (Transformation)

Returns a Tuple of key and collection of values.performs hash join across partitions

JavaPairRDD<String,Iterable<Integer>> groupByKey = wordMap_to_pair.groupByKey();

.mapValues( v -> Iterables.size(v)) // [1,1,1,1] -> [4]


Editor guide