DEV Community

kambala yashwanth
kambala yashwanth

Posted on

3

Different ways to word count in apache spark

Hi Big Data Devs,

When it comes to provide an example for a big-data framework, WordCount program is like a hello world programme.The main reason it gives a snapshot of Map-shuffle-reduce for the beginners.Here I am providing different ways to achieve it

ReduceByKey (Transformation)

Return type is same as input RDD type


JavaRDD<String> file = sc.textFile("/<Path_To_File>/README.md");

JavaRDD<String> words = file
.flatMap(f -> Arrays.asList(f.split(" ")).iterator())
.filter(f -> !f.isEmpty());

// grouping words with number 1
JavaPairRDD<String,Integer> wordMap_to_pair = words
.mapToPair(f -> new Tuple2<String,Integer>(f,1) );

//ReduceBykey,will merge at partition level and sends result to driver
JavaPairRDD<String,Integer> reducebyKey = wordMap_to_pair.reduceByKey((a,b) -> a+b);

System.out.println(reducebyKey.collect());

foldByKey (Transformation)


It is similar to reeducebykey but takes the zeroValue.The user does not need to specify a combiner


JavaPairRDD<String, Integer> Foldbykey = wordMap_to_pair.foldByKey(0, ((acc, val) -> acc+val));

System.out.println(Foldbykey.collect());

aggregateByKey (Transformation)


Return type need not be same as input RDD type
parameters

initalValue @primitive types. 0 in Addition and subtraction , 1 in multiplication /division , [] on list ,"" on string

SequenceFunction @operates within partition level ,

CombinerFunction @operates across partition level

JavaPairRDD<String,Integer> aggreagteBykey = wordMap_to_pair.aggregateByKey(
0 // initial value
, ( (a,v) -> a+v ) //seq,counting operation within partition
, ( (a,v) -> a+v )); // merge, counting operation across partition

System.out.println( aggreagteBykey.collect());

CombineByKey (Transformation)


The user can specify a combiner function and customize combining behavior unlike aggregate and fold bykeys.

Return type need not be same as input RDD type

parameters

createCombiner @A function accepts current values and returns new value

mergeValue @merge/combine values within partition level

mergeCombiners @merge/combine values across partition level


wordMap_to_pair.combineByKey(
i -> i, //createCombiner
(a,v) -> (a+v) //mergeValue
,(a,b) -> a+b) //mergeCombiners
.collect()
.forEach(System.out::println);

groupByKey (Transformation)


Returns a Tuple of key and collection of values.performs hash join across partitions


JavaPairRDD<String,Iterable<Integer>> groupByKey = wordMap_to_pair.groupByKey();

groupByKey
.mapValues( v -> Iterables.size(v)) // [1,1,1,1] -> [4]
.collect()
.forEach(System.out:println);

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay