DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on • Edited on

MapReduce Basics (Part 2)

Hello, I'm Shrijith. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on Github. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

In my previous post, I introduced the MapReduce programming model at a high level. Head over to part 1 for that context.

In this post, we will dive deeper into the MapReduce programming model.

Mappers & Reducers

The basic data structure in MapReduce is key-value pairs.

Keys can be primitive data types (ints, strings, etc.) or more complex data types (lists, tuples, etc.).

For example, when representing web pages, the key could be the URL (string), and the value could be the page contents (string).

The programmer must define two functions with the following signatures:

map: (k1, v1) -> [(k2, v2)]
reduce: (k2, [v2]) -> [(k3, v3)]
Enter fullscreen mode Exit fullscreen mode

Here, the convention [...] denotes a list.

The mapper is applied to an arbitrary number of input key/value pairs and produces an arbitrary number of intermediate key/value pairs.

Finally, the reducer processes all the intermediate values to generate the output key/value pairs.

Between the map and reduce phases, a "group by" operation is performed on the intermediate keys.

Each reducer receives a subset of intermediate keys in an ordered fashion for a particular group.

The output consists of r files in the distributed file system, where r is the number of reducers.

Word Counter Example

The goal of the word counter is to take in a collection of documents, process each word, and compute the total count of each word in the entire collection.

Word counts are useful for building search indices, among other applications.

Here’s the mapper and reducer for the word counter:

Word Counter

A Bit More Nuance: Partitioners and Combiners

Partitioner

After the mapping phase, we are left with a collection of key-value pairs.

These key-value pairs must be assigned to reducers in some manner.

Partitioning determines the target reducer for each key-value pair.

By default, Hadoop uses the following formula:

partition = (key.hashCode() mod numReducers)
Enter fullscreen mode Exit fullscreen mode

Note that in default partitioning, only the key is considered.

This usually results in a fairly balanced distribution. However, if the values vary significantly in size, it can lead to imbalances.

Adjusting the partitioning logic to be more sophisticated can help mitigate such imbalances.

Combiners (or "Aggregate Intermediate Results")

A combiner acts as a "pre-processing" step at the mapper before sending data to the reducer.

It reduces bandwidth usage, improves performance, and performs local aggregation.

If a function is both associative and commutative, using combiners can yield substantial performance gains.

Combiners are optional components in the MapReduce paradigm.

git-lrc
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

GitHub logo HexmosTech / git-lrc

Free, Unlimited AI Code Reviews That Run on Commit

git-lrc logo

git-lrc

Free, Unlimited AI Code Reviews That Run on Commit


git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

  • 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
  • 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
  • 🔁 Build a habit, ship better code. Regular review → fewer bugs → more robust code → better results in your team.
  • 🔗 Why git? Git is universal. Every editor, every IDE, every AI…




Top comments (0)