DEV Community: Leonardo Luís Dalcegio

A Comprehensive Guide to MapReduce: Distributed Data Processing

Leonardo Luís Dalcegio — Sat, 06 Apr 2024 20:55:09 +0000

In this article, I will explore the MapReduce programming model introduced on Google's paper, MapReduce: Simplified Data Processing on Large Clusters. I hope you will understand how it works, its importance and some of the trade offs that Google made while implementing the paper. I will also provide real world applications of MapReduce at the end of the article.

Motivation: the problem Google was facing

In 2004, Google needed a way of processing large data sets, theoretically, the whole web. Google's engineers faced the challenge of processing enormous datasets generated by their web crawlers, indexing systems, and various other applications. Traditional data processing systems couldn't handle the scale or the fault tolerance required for such large-scale operations, imagine having to sort or classify the whole web for example.

At that time, distributed computing was already a thing, so one of the first things that the engineers might have thought was, "let's distribute the workload across multiple computers". But the problem is that many engineers were not familiar with implementing distributed computing algorithms at the scale required by Google's operations, so they needed to have an abstraction for doing so. This is when MapReduce was designed by Jeffrey Dean and Sanjay Ghemawat.

Before moving on, if you want to learn more about distributed computing/distributed systems, you can read Distributed Systems 101 (based on Understanding Distributed Systems book) for an introduction to the topic.

How MapReduce works: solving the problem of processing the whole web

MapReduce is a programming model for processing and generating large data sets. Users specify a Map function that processes a key/value pair to generate a set of key/value pairs, and a Reduce function that merges all values associated with the same key.

Take for example a word count, where you receive a text as a input and need to count the occurrences of each word on the text (you can see a representation of this on the image below).

The MapReduce will...

Split the input into many chucks in order to run the operation on multiple machines
Map the input data in order to generate a set of key/value pairs
- On the Map phase, we have many machines processing part of the input that was Split before (remember that we have a large input, like the whole web, so it is inefficient to do it on a single machine)
Reduce the key/value pairs by merging them based on their keys
- Since we are performing a word count here, the Reduce function sums each value based on the key. Each Reduce function writes the result to a file

The final output consists of unique keys and their corresponding reduced values.

The coordinator

In order for the MapReduce to function, there needs to be something coordinating all of this. We want to be able to just pass the input to it and let it run without worrying about its internal details. The component that receives this first input and starts the whole operation is called coordinator. The coordinator is responsible to pass the chunks for the machines to run the Map, that means that the coordinator is responsible for Splitting the data.

When a machine finishes the Map phase, it notifies the coordinator and passes the location of the finished Map result file to it. Then the coordinator calls the Reduce function on another machine with the location of the file to perform the reduce on. This implies on the machine responsible to execute the Reduce to access the machine that did the Map previously.

As seen before, when a machine finishes the Reduce phase, it writes the result to a file, the coordinator then returns these files to the MapReduce caller. It is important to notice that the files doesn't really needs to travel through the network, although they are reduced, they could still be big enough to impact performance significantly. What can be done here, depending on the implementation of the coordinator, is returning the address where the reduced files are stored.

Network bandwidth is a problem (one of the biggest)

The first fallacy of distributed computing is, "The network is reliable". That means that we want to minimize the traveling of the data through the network. Google does so by storing the inputs of the the Map phase on the same machines that executes the Map.

We can take a web crawler as an example, when the crawler finishes crawling a page, it can write a file on the same machine that will execute the Map phase on this file. The same applies to the Reduce function result, in which the files that each reducer wrote aren't concatenated or even passed through the network until they really need to get read, only their addresses are shared until then.

Fault tolerance importance in MapReduce

Fault tolerance is a critical aspect of distributed computing, and MapReduce is no exception. Ensuring that the system can gracefully recover from failures is essential for handling large-scale data processing tasks reliably.

How MapReduce is fault tolerant?

MapReduce ensures fault tolerance by running multiple instances of the same task across different machines. By doing so, if one machine fails during execution, the task can be rerun on another machine without data loss. The coordinator plays a role here by continuously monitoring the machines executing the processes. In the event of a machine failure, the coordinator reallocates the task to another operational machine that has already completed its previously assigned task.

Since the inputs are stored on the same machines that are responsible for the Mapping, copies of the same input must be stored on different machines, so that if one fails, the coordinator can use the others with the same input to execute the task. In fact according to the MapReduce paper, Google stores typically 3 copies on different machines.

Idempotency is also very important in MapReduce systems. Both Map and Reduce tasks are designed to be idempotent, meaning they can be run multiple times without altering the final result. This ensures consistency and reliability, especially in scenarios like the one above, where some tasks needs to be rerun due to failures.

The coordinator can't fail

All that we talked about regarding fault tolerance here pertains to the machines that perform the Map and Reduce functions, but there is one more piece, which is the coordinator. Google itself decided not do make the coordinator fault tolerance, because it is more likely that one of the 2000 machines that runs the "Map" will crash instead of that one single machine that runs the coordinator for example. Here we see a trade-off between complexity and likelihood of failure.

Real-world applications of MapReduce

There are many applications in which MapReduce can be useful, below only a few of them are listed, but MapReduce can (and is) used on a widely range of real world applications. It is also important to keep in mind that it is designed for processing large datasets in which it wouldn't be efficient processing it on only a single machine for example.

Counting URL Access Frequency: in this scenario, the Map function processes logs of web page requests, extracting URLs and emitting key-value pairs, like (url, accessCount). The Reduce function aggregates the counts for each URL, producing a final output of unique URLs and their corresponding access frequencies.

Data Cleaning and Preprocessing: MapReduce can be used to clean and preprocess large datasets before further analysis. For example, the Map function can perform tasks such as data validation, transformation, and filtering, while the Reduce function can aggregate the results and produce a cleaned dataset ready for analysis.

Log Analysis: logs can be collected from servers, applications, and network devices to monitor system health, troubleshoot issues, and gain insights into user behavior. With MapReduce, those logs can be processed in parallel across multiple machines, the Map function can parse each log entry and emit key-value pairs based on the information extracted, while the Reduce function can perform analyses on this data, like counting the most common types of errors for example.

This is the end of the article, but not the end of our journey on distributed systems, feel free to leave any comments or suggestions. Here are some next steps, such as papers and books, that you can explore to gain a deeper understanding of distributed processing of large datasets:

MapReduce: Simplified Data Processing on Large Clusters (paper that introduces the MapReduce programming model)
Bigtable: A Distributed Storage System for Structured Data (paper about a distributed storage system designed to handle massive amounts of structured data across thousands of servers)
Building and operating a pretty big storage system called S3 (article about AWS S3 written by a distinguished engineer at AWS)
Designing Data-Intensive Applications (book by Martin Kleppmann that covers the principles of building distributed applications)

Distributed Systems 101 (based on Understanding Distributed Systems)

Leonardo Luís Dalcegio — Thu, 21 Mar 2024 22:09:13 +0000

I recently finished reading "Understanding Distributed Systems", by Roberto Vitillo. In this article, I'll provide an overview of distributed systems, drawing from the knowledge and principles outlined in Vitillo's book and from my daily work as an SDE on Amazon.

Many of the topics approached here deserves a whole article on it, please keep in mind that this is a introduction to distributed systems, but I will link other articles and researches on the end and through the text so you can dive deep on each subject.

In the end, I hope that you had a brief overview of many important factors of distributed applications. I will also link many references through the article and "next steps" for you to dig into, as it is impossible to cover everything.

How systems communicate with each other?

Nothing functions in isolation, whether you're accessing a website or sending an email, almost every action on a computer involves communication with another system. We will explore how this communication works as it is an important topic to understand before digging into distributed systems.

Physical communication between services

Sometimes we forget about this but there is something (data) physically traveling from one computer to another, whether it is by WiFi for example or through Ethernet cables. For this data exchange to happen, rules and formats were set, those rules and formats are called protocols.

Protocols act as the language of communication between systems, defining the rules and formats for data exchange. HTTP (Hypertext Transfer Protocol) governs communication over the web, allowing clients to request and receive resources from servers (Every time you open a webpage, you're using HTTP behind the scenes!). Similarly, RPC (Remote Procedure Call) enables programs on different computers to talk to each other by invoking procedures on remote machines.

But before data can travel using protocols, it needs an addressing system. This is where IP (Internet Protocol) addresses come in. An IP address is a unique identifier assigned to each device on a network, similar to a mailing address in the real world. When you request information from a website, your device uses its own IP address to communicate with the website's server IP address.

Digital Cartography: mapping services

Before moving on, I want to take a step back and talk about DNS, which is extremely important for the communication of services. IP addresses are not very user-friendly, imagine having to remember a long string of numbers for every website you want to visit.

DNS solves this problem by providing a way to map human-readable domain names (like "google.com" or "instagram.com") to IP addresses. So, while protocols define the rules and formats for data exchange, and IP addresses provide the addressing system for devices on a network, DNS serves as the intermediary that maps human-readable domain names to machine-readable IP addresses, making it easier for us to navigate the vast landscape of the internet.

What happens when communication rate increases?

Imagine many systems that you use daily, like Instagram, Amazon, Uber, AirBnB, Steam, how they guarantee resiliency? If Amazon recommendation system goes offline, probably the rest of the website will still be working. How they guarantee scalability? As all of those companies serve millions of customers, they need to be prepared for a insane amount of load. Do they have one big computer to process everything? Nope, imagine if this computer suddenly stops working.

Of course, they were not always this big, through the time, their systems started facing challenges on scalability, throughput, latency and so on. And, as we have a limit to scale memory, CPU, network and disk size for example, there is the need to rely on another technic to scale the system. Distributed systems comes as one for doing so, by employing techniques such as partitioning, replication, load balancing and caching.

Instead of having one enormous large server (which would be unfeasible), those companies have many computers connected and linked to each other over the network which appears as a single computer to the user.

Here comes the need for distribution!

You have a system that is getting millions of accesses, whose infrastructure is already scaled, the performance of it is not as good as you would want and you still need to support even more customers. What do you do? You distribute it.

Scalability: how do we scale?

There are several challenges when scaling a system, you want to minimize the code changes, minimize the cost for doing so, minimize single points of failure (i.e. if a single part of you application stops, the whole application stops). To solve these challenges when load and data volume increase, we distribute the application in pieces.

Services: the foundation

A service within a distributed system typically represents a specific domain or business capability. Each service operates independently, and if one service goes down, the rest of the application can continue running (other services wont be able to communicate with the offline service, that's for sure, but we will talk more about it latter).

Communication: you distributed the application in parts, how each part will communicate with each other?

In a distributed architecture, effective communication between services is essential for the overall functionality of the system. One common approach to enable communication between services is through messaging protocols such as HTTP, AMQP (Advanced Message Queuing Protocol), or gRPC (Google Remote Procedure Call). It is literally, a service calling each other.

Event-Driven Architecture

Using events to communicate with other service is another widely used approach. In an event-driven architecture, services communicate through the exchange of events or messages. When significant events occur within a service, such as the creation of a new user or the completion of a task, the service calls another service that is responsible to only store those messages and call another service to actually process these stored messages.

Services interested in these messages can subscribe to a so called message broker and react accordingly, enabling loosely coupled and scalable communication between services. You basically have an intermediate service that orchestrates the communication. See the image bellow for a representation of an event-driven architecture.

Horizontal Scaling: the foundation for scalability

You divided your application into pieces, but how exactly it will be able to support more customers? Is dividing it enough? No! You need to scale the application horizontally, by adding more machines or instances to the network, distributing the workload across multiple nodes. Although it might seem easier to scale vertically, there is a limit, cost and many disadvantages like single points of failure when doing so.

However, horizontal scaling also comes with its own challenges, such as ensuring data consistency across distributed nodes and orchestrating load balancing efficiently. As you might be wondering, proper design and implementation of the horizontal scaling is essential for achieving scalability in distributed systems.

Load Balancing

Imagine that you have a service named InventoryService, which is running on 10 machines, if another service wants to call it, how will you know which of the machines to use in order to process this request? The load balancer of the InventoryService will do it for you!

In a distributed system, load balancing plays a crucial role in distributing incoming requests among the machines efficiently. It ensures that the workload is evenly distributed across all available machines, preventing overwhelming some instances while others remain underutilized.

Taking for example the InventoryService, the load balancer is another service, running on another machine, responsible to receive the requests and deciding which of those 10 machines will end up processing the request. Oh wait! The load balancer can become a single point of failure, if it goes down, none of the requests will go through any of the 10 machines! How do we resolve it? Well, that is a topic for another article that I will be writing in the future. For now, lets understand how exactly it "balances" the load evenly.

Round Robin: this method assigns each incoming request to the next available server in a sequential order. It ensures that all servers receive an equal number of requests over time, effectively distributing the load evenly.
Least Connections: this algorithm direct incoming requests to the server with the fewest active connections at the time of the request. Tt aims to distribute the workload based on the current load of each server.
Weighted Round Robin: mainly used in scenarios where not all servers are equal in terms of capacity or performance, it assigns a weight to each server. Servers with higher weights receive a proportionally larger number of requests.
IP Hashing: determine which server will handle the request by using a hash function based on the client's IP address. Requests from the same client are consistently routed to the same server.
Least Response Time: load balancer will monitor the response times of each server and direct requests to the server with the fastest response time.
Dynamic Load Balancing: dynamically adjusts the load balancer strategy based on real-time metrics such as server health, current load, or network conditions. This ensures efficient resource utilization and can handle sudden spikes in traffic more effectively.

Bellow is an image showing a service with running on three machines, the load balancer will receive the request and be responsible to decide which machine to choose in order to process it.

Data Consistency and Replication

You now have a system running on multiple machines, as talked on the horizontal scaling section, what if this system stores data on the machines? An example of a system that does so is DynamoDB, a distributed database service. How do we guarantee that all the machines sees the same data (i.e. data is consistent)? We use replicas, which are redundant copies of data in a system.

Replicating data involves maintaining copies of data across multiple replicas. Updates to data are propagated to all replicas to ensure consistency. However, ensuring consistency introduces challenges such as synchronization and coordination of updates, because data is not stale and changes through the time. This brings the issue of defining the level of consistency we want. Bellow, some of them are explained.

Strong consistency

In a service with strong consistency, every read receives the most recent write or an error. The system guarantees that all replicas will reflect the most recent write before any subsequent reads are allowed. This ensures that all subsequent reads will return the updated value.

Although this is the best consistency you can have in terms of reading the most recent data, it might hurt your performance, as for any write, all of the replicas must be updated before the data is available to be read. Strong consistency may impose constraints on scalability, as the number of replicas increases, the complexity of maintaining strong consistency also grows.

Example: imagine a chat application where users can send and receive messages in real-time. In a system with strong consistency, when a user sends a message, all replicas of the chat data across different servers must be immediately updated with the new message. Subsequent reads by any user should then reflect the most recent message sent, ensuring that all users see the same conversation at any given time.

Sequential consistency

Sequential consistency is weaker than strong consistency but stronger than eventual consistency. It implies that the order of operations appears consistent to all processes. However, it relaxes the requirement that all processes see the same order of operations for concurrent writes.

Example: imagine a banking system in which two customers, Alice and Bob, initiate withdrawals concurrently from their joint account. Although their withdrawal requests are processed by different servers, sequential consistency demands that all servers agree on the order in which these transactions occurred. If Alice's withdrawal is processed before Bob's, this order must be observed consistently across all servers.

However, due to propagation delays, account balance updates might lag across replicas, leading to temporary discrepancies. Despite these delays, sequential consistency ensures that eventually, all replicas converge to a state where the order of transactions is consistent across the system, preserving the chronological sequence of events.

Eventual consistency

This is the weakest form of consistency among the three. Eventual consistency allows for temporary inconsistencies among replicas, however, there is a guarantee that, given enough time, all replicas will eventually converge to a consistent state.

This approach assumes that achieving immediate consistency across distributed systems can be impractical or inefficient due to factors such as network latency and the volume of data being replicated. Eventual consistency prioritizes availability over immediate consistency.

This model is commonly employed in distributed databases, like DynamoDB, content delivery networks, and other distributed systems where trade-offs between consistency, availability and performance needs to be carefully balanced to meet the requirements.

Example: imagine a social media platform where users can post updates. After a user posts an update, different replicas across servers might not immediately reflect the new post. However, over time, as the updates propagate through the system, eventually, all replicas will converge to a consistent state where every user sees the posted update. (Yep, the exact same example as on sequential consistency)

In practice, the user may open the platform, see the post, refresh the page, stop seeing it, then refresh the page again and start seeing the post again. This can happen until all replicas are consistent and received the post update successfully.

Replication and consistency are important topics inside distributed systems (and also hard to understand on the first read), replicating data provides many benefits like keeping the data geographically close to the users to minimize latency and increase availability in case of certain machines stops working.

Here are some more references for you to dig into:

Partitioning

While replications involves creating identical copies of data across multiple machines, in which each machine stores a complete copy of the dataset. Partitioning involves dividing the dataset into smaller subsets and distributing these subsets across different machines. Each machine stores only a portion of the overall dataset.

Why we need partitioning?

Partitioning enables systems to handle larger datasets and higher workloads by distributing the data across multiple machines. This allows for better resource utilization and accommodates growing demands without overwhelming individual machines.

Partitioning also enhances fault tolerance by reducing the impact of individual machine failures. Even if one machine goes down, the system can continue to function using the data stored on the remaining machines. Additionally, partitioning often includes replication within partitions, ensuring that each subset of data is redundantly stored across multiple machines.

Availability: how to stay up and running?

Availability is understood as the percentage of time the system is available and operational, it is typically calculated in number of 9s. 1 nine = 90% availability, 2 nines = 99% availability, 3 nines = 99.9 % availability, 4 nines = 99.99% availability, and so on

Availability in distributed systems involves not only ensuring that individual components remain operational. It extends to the ability of the entire system to continue to providing seamless and uninterrupted service to users. There are many mechanisms used to achieve a high availability, such as fault tolerance ones, replication (that we also talked about previously) and much more.

Fault Tolerance

100% availability is physically impossible, that is why we apply fault tolerance and recovery mechanisms as it is not a matter of "if", but "when" the system will have a failure. We can't prevent failures entirely, but we can design systems to be resilient and quickly recover from them. Bellow some of the techniques for doing so are explored.

Replication (again?)

Yep, again! Replication helps on having a high availability rate by providing redundancy, ensuring that if one replica fails or becomes unavailable, others can continue to serve requests. This improves the system's reliability and availability.

The load balancer component that we talked before is also very important here, as it enhances availability by ensuring that replicas doesn't becomes overwhelmed with requests.

Failure detection and Recovery

There are many ways to detect failures, a very common one is the service periodically sending signals to indicate it is alive. Those are called heartbeat signals. Missing signals indicates a potential failure.

Services should also monitor and track metrics like CPU usage and memory availability. It helps on identifying issues before they erupt into complete failures.

Recovery: there are some techniques that can be used for recovery, this step can also be automated depending on the scenario, with backup servers for example. If it's not possible to recover automatically, the system should trigger an alert for manual intervention.

Example: an e-commerce website uses a distributed system with multiple servers handling customer requests. A failure detection and recovery system continuously monitors these servers. If a server shows high CPU usage or stops sending heartbeats, the system detects the failure. It then isolates the failing server and initiates a failover, automatically switching traffic to a healthy server. This ensures that customer experience is minimally impacted, and purchases can continue uninterrupted.

Graceful Degradation

Graceful degradation is a strategy employed to maintain essential functionalities even when certain components experience failures or become unavailable. Unlike failure detection and recovery mechanisms, graceful degradation emphasizes maintaining a baseline level of service quality even under adverse conditions. Also, it is important to note that both can and should be used together.

As an example of graceful degradation in use to help on providing high availability, imagine an online learning platform, if a video lecture experiences buffering issues due to high traffic, the platform can offer alternative delivery methods like downloadable transcripts or audio-only versions to ensure students can still access the learning material.

Monitoring: ensuring visibility in distributed systems

Maintaining visibility into performance, health, and behavior is essential for ensuring reliability and timely issue resolution. Monitors plays an important role in achieving this objective, offering approaches to gain insights into system behavior and performance.

Metrics

Metrics are quantitative measurements that capture different facets of system behavior and performance. In the context of distributed applications, metrics can include mainly:

Latency: measuring the time it takes for requests to be processed or responses to be delivered.
Throughput: tracking the rate at which requests are handled by the system.
Error Rates: monitoring the frequency of errors or failures occurring within the application.
Resource Utilization: assessing the usage of CPU, memory, disk space, and other resources.

Dashboards

Dashboards provide visual representations of metrics, allowing stakeholders to monitor the health and performance of distributed applications in real-time. Key characteristics of effective dashboards include:

Aggregation: aggregating metrics from multiple sources and components of the distributed system to provide a comprehensive view of its health.
Visualization: utilizing charts, graphs, and other visualization techniques to present metrics in an intuitive and actionable manner.
Alerting: integrating alerting mechanisms into dashboards to notify stakeholders when predefined thresholds or anomalies are detected.

Logs

Logs play a crucial role in capturing detailed information about system events, errors, and transactions, helping in debugging and troubleshooting efforts. Key aspects of logs include:

Event Logging: recording chronological records of system events, errors, and transactions, offering a timeline of system activity.
Error Tracking: identifying and logging errors or exceptions encountered within the application, assisting in diagnosing and resolving issues.
Transaction Monitoring: tracking the flow of transactions through the system, enabling the reconstruction of transaction paths for analysis and auditing purposes.

A robust monitoring strategy for a distributed system incorporates the use of metrics, dashboards, and logs to ensure comprehensive visibility into system performance and behavior, facilitating proactive monitoring, troubleshooting, and optimization efforts.

What's next?

We have come to the end of this article, but there is much more to study on distributed systems. Here are some next steps, books and other articles for you to read. Also, feel free to leave any comments or suggestions!

Understanding Distributed Systems (book by Roberto Vitillo)
Designing Data-Intensive Applications (book by Martin Kleppmann)
Why Distributed Computing?
Fallacies of distributed computing
On Designing and Deploying Internet-Scale Services
The Google File System
Time, Clocks, and the Ordering of Events in a Distributed System
Impossibility of Distributed Consensus with One Faulty Process