Hi Big Data Devs,
When it comes to provide an example for a big-data framework, WordCount program is like a hello world programme.The main reason it gives a snapshot of Map-shuffle-reduce for the beginners.Here I am providing different ways to achieve it
ReduceByKey (Transformation)
Return type is same as input RDD type
JavaRDD<String> file = sc.textFile("/<Path_To_File>/README.md");
JavaRDD<String> words = file
.flatMap(f -> Arrays.asList(f.split(" ")).iterator())
.filter(f -> !f.isEmpty());
// grouping words with number 1
JavaPairRDD<String,Integer> wordMap_to_pair = words
.mapToPair(f -> new Tuple2<String,Integer>(f,1) );
//ReduceBykey,will merge at partition level and sends result to driver
JavaPairRDD<String,Integer> reducebyKey = wordMap_to_pair.reduceByKey((a,b) -> a+b);
System.out.println(reducebyKey.collect());
foldByKey (Transformation)
It is similar to reeducebykey but takes the zeroValue.The user does not need to specify a combiner
JavaPairRDD<String, Integer> Foldbykey = wordMap_to_pair.foldByKey(0, ((acc, val) -> acc+val));
System.out.println(Foldbykey.collect());
aggregateByKey (Transformation)
Return type need not be same as input RDD type
parameters
initalValue @primitive types. 0
in Addition and subtraction , 1
in multiplication /division , []
on list ,""
on string
SequenceFunction @operates within partition level ,
CombinerFunction @operates across partition level
JavaPairRDD<String,Integer> aggreagteBykey = wordMap_to_pair.aggregateByKey(
0 // initial value
, ( (a,v) -> a+v ) //seq,counting operation within partition
, ( (a,v) -> a+v )); // merge, counting operation across partition
System.out.println( aggreagteBykey.collect());
CombineByKey (Transformation)
The user can specify a combiner function and customize combining behavior unlike aggregate and fold bykeys.
Return type need not be same as input RDD type
parameters
createCombiner @A function accepts current values and returns new value
mergeValue @merge/combine values within partition level
mergeCombiners @merge/combine values across partition level
wordMap_to_pair.combineByKey(
i -> i, //createCombiner
(a,v) -> (a+v) //mergeValue
,(a,b) -> a+b) //mergeCombiners
.collect()
.forEach(System.out::println);
groupByKey (Transformation)
Returns a Tuple of key and collection of values.performs hash join across partitions
JavaPairRDD<String,Iterable<Integer>> groupByKey = wordMap_to_pair.groupByKey();
groupByKey
.mapValues( v -> Iterables.size(v)) // [1,1,1,1] -> [4]
.collect()
.forEach(System.out:println);
Top comments (0)