DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

3 2 2 2 2

MapReduce Basics (Part 1)

Hi there! I'm Shrijith Venkatrama, the founder of Hexmos. Right now, I’m building LiveAPI, a super-convenient tool that simplifies engineering workflows by generating awesome API docs from your code in minutes.

  • The MapReduce (MR) programming model was invented to solve large-scale data processing problems
  • In particular, Google, eBay and some specialized academic communities such as particle physics needed petabyte scale (10^15 bytes) data processing on a daily basis
  • MapReduce is a set of principles for solving large-scaale data processing problems
  • The two main principles in action are: (1) Divide and Conquer (2) Parallelization
  • Divide and Conquer is about - how to decompose a large problem into smaller ones
  • Parallelization is about - how to solve each smaller subproblem in parallel and finally integrate them to get a consolidated solution to the original large problem
  • Most traditional solutions/frameworks for parallelization the developer has to worry about many details: how to decompose the problem, how to distribute compute across cores, machines, how to distribute data efficiently, how to deal with errors/failures in the distributed system, how to synchronize different workers, etc. The older approaches have big cognitive burden and due to that room for errors in implementation.
  • MapReduce provides solutions for handling petabyte data efficiently. Instead of moving data where computation will happen, MR brings computation to data.
  • MapReduce has roots in functional programming. In particular, it is rooted in two functions: map and fold. Given a list of values, the map function is applied to each element to get a transformed value. Map is inherently parallelizable. The next thing is the fold function. A fold function takes two values: initial value (or prev value) and the next value (which is the result of map, usually).

Map & Fold

  • In summary - map is the transformation operation, while fold is the aggregation operation
  • The fold operation requires at least two data elements to be "brought together" in a distributed setup (hence more tricky sometimes)
  • In real-world scenarios, many times fold is not required for all elements; rather fold happens in "groups", leading to higher parallelization.
  • For commutative and associative operations, fold can be made much faster through local aggregation and sensible reordering
  • MR is practically implemented at Google (proprietary) and also open sourced in the Hadoop project.

Next Steps

In the next part of the article series, I will explore:

  1. Mappers and Reducers
  2. Partitioners and Combiners

Billboard image

Deploy and scale your apps on AWS and GCP with a world class developer experience

Coherence makes it easy to set up and maintain cloud infrastructure. Harness the extensibility, compliance and cost efficiency of the cloud.

Learn more

Top comments (0)

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay