The amount of data produced in this world has exploded. Whether it’s from the Internet of Things (IoT), social media, app metrics, or device analytics, the amount of data that can be processed and analyzed at any given moment is staggering. Big data is on the rise, and data systems are tasked with handling it. But this begs the question: Are these systems up for the task?
Data systems are most useful when they correctly answer the right questions about all the data they contain. With big data, however, new data comes into the system every minute — and there’s a lot of it. The data system faces two fundamental challenges.
Challenge #1: The Latency Problem. I can get super-precise answers to your questions based on all the data streamed from the beginning of my existence until this very moment, but that will take me awhile. I hope you don’t mind waiting.
Challenge #2: The Accuracy Problem. I’ve finally got my super-precise answers for you. Sadly, they took me so long to acquire, they’re no longer up-to-date and accurate.
Lambda Architecture is a big data paradigm, a way of structuring a data system to overcome the latency problem and the accuracy problem. It was first introduced by Nathan Marz and James Warren in 2015. Lambda Architecture structures the system in three layers: a batch layer, a speed layer, and a serving layer. We will talk about these in detail shortly.
Lambda Architecture is horizontally scalable. This means that if your data set becomes too large or the data views you need are too numerous, all you need to do is add more machines. It also confines the most complex part of the system in the speed layer, where the outputs are temporary and can be discarded (every few hours) if ever there is a need for refinement or correction.
As mentioned above, Lambda Architecture is made up of three layers. The first layer —the batch layer— stores the entire data set and computes batch views. The stored data set is immutable and append-only. New data is continually streamed in and appended to the data set, but old data will always remain unchanged. The batch layer also computes batch views, which are queries or functions on the entire data set. These views can subsequently be queried for low-latency answers to questions of the entire data set. The drawback, however, is that it takes a lot of time to compute these batch views.
The second layer in Lambda Architecture is the serving layer. The serving layer loads in the batch views and, much like a traditional database, allows for read-only querying on those batch views, providing low-latency responses. As soon as the batch layer has a new set of batch views ready, the serving layer swaps out the now-obsolete set of batch views for the current set.
The third layer is the speed layer. The data that streams into the batch layer also streams into the speed layer. The difference is that while the batch layer keeps all of the data since the beginning of its time, the speed layer only cares about the data that has arrived since the last set of batch views completed. The speed layer makes up for the high-latency in computing batch views by processing queries on the most recent data that the batch views have yet to take into account.
Consider the elderly man who lives alone in a giant mansion. Every room of his mansion has a clock, but except for the clock in the kitchen, all of the clocks are wrong. The man decides one day that he will set all of the clocks, using the kitchen clock as the correct time. But, his memory is poor, so he writes down the current time on the kitchen clock (9:04 AM) on a piece of paper. He begins his slow walk around the mansion setting all of his clocks to 9:04 AM. What happens? By the time he gets to the very last clock in the guest bedroom of the east wing, it is 9:51 AM. He sets this clock too, as he has done with all of the other clocks, to the time written on his paper — 9:04 AM. No wonder all of his clocks are wrong!
This is the problem we would experience if a data system only had a batch layer. The answers to the questions we’re asking would no longer be up-to-date because it took so long to get those answers.
Fortunately, the man remembers that he has an old runner’s stopwatch. The next day, he starts again in the kitchen at 9:04 AM. He writes down the time on a piece of paper. He starts his stopwatch—his speed layer—and begins walking around his mansion. Now, when he gets to the very last clock in the east wing, he sees the paper with 9:04 AM written, and he sees his stopwatch says “47 minutes and 16 seconds”. With a little bit of math, he knows to set this last clock to 9:51 AM.
In this analogy — which, admittedly, is not perfect — the man is the serving layer. He takes the batch view (the paper with “9:04 AM” written on it) around his mansion to answer “What time is it?” But, he also does the additional work of reconciling the batch view with the speed layer to get the most accurate answer possible.
In Marz and Warren’s seminal book on Lambda Architecture Big Data, they list eight desirable properties in a big data system, describing how Lambda Architecture satisfies each one:
Robustness and fault tolerance. Because the batch layer is designed to be append-only, containing the entire data set since the beginning of time, the system is human-fault tolerant. If there is any data corruption, then all of the data from the point of corruption forward can be deleted and replaced with correct data. Batch views can be swapped out for completely recomputed ones. The speed layer can be discarded. In the time it takes to generate a new set of batch views, the entire system can be reset and running again.
Scalability. Lambda Architecture is designed with layers built as distributed systems. By simply adding more machines, end users can easily horizontally scale those systems.
Generalization. Since Lambda Architecture is a general paradigm, adopters aren’t locked into a specific way of computing this or that batch view. Batch views and speed layer computations can be designed to meet the specific needs of the data system.
Extensibility. As new types of data enter the data system, new views will become necessary. Data systems are not locked into certain kinds or a certain number of batch views. New views can be coded and added to the system, with the only constraint being resources, which are easily scalable.
Ad hoc queries. If necessary, the batch layer can support ad hoc queries that were not available in the batch views. Assuming the high-latency for these ad hoc queries is permissible, then the batch layer’s usefulness is not restricted only to the batch views it generates.
Minimal maintenance. Lambda Architecture, in its typical incarnation, uses Apache Hadoop for the batch layer and ElephantDB for the serving layer. Both are fairly simple to maintain.
Debuggability. The inputs to the batch layer’s computation of batch views are always the same: the entire data set. In contrast to debugging views computed on a snapshot of a stream of data, the inputs and outputs for each layer in Lambda Architecture are not moving targets, vastly simplifying the debugging of computations and queries.
Low latency reads and updates. In Lambda Architecture, the last property of a big data system is fulfilled by the speed layer, which offers real-time queries of the latest data set.
While the advantages of Lambda Architecture seem numerous and straightforward, there are some disadvantages to keep in mind. First and foremost, cost will become a consideration. While how to scale is not very complex—just add more machines—we can see that the batch layer will necessarily need to expand and grow over time. Since all data is append-only and no data in the batch layer is discarded, the cost of scaling will necessarily grow with time.
Others have noted the challenge of maintaining two separate sets of code to compute views for the batch layer and the speed layer. Both layers operate on the same set—or, in the case of the speed layer, subset—of data, and the questions asked of both layers are similar. However, because the two layers are built on completely different systems (for example, Hadoop or Snowflake for the batch layer, but Storm or Spark for the speed layer), code maintenance for two separate systems can be complicated.
In the field of machine learning, there’s no doubt that more data is better. For machine learning to apply algorithms or detect patterns, however, it needs to receive its data in a way that makes sense. Rather than receiving data from different directions without any semblance of structure, machine learning can benefit by processing data through a Lambda Architecture data system first. From there, machine learning algorithms can ask questions and begin to make sense of the data that enters the system.
While machine learning might be on the output side of a Lambda Architecture, IoT might very well be on the input side of the data system. Imagine a city of millions of automobiles, each one equipped with sensors to send data on weather, air quality, traffic, location information, driving habits, and so on. This is the massive stream of data that would be fed into the batch layer and speed layer of a Lambda Architecture. IoT devices are a perfect example of providing the data in big data.
We noted above that “the speed layer only cares about the data that has arrived since the completion of the last set of batch views.” While this is true, it’s important to make a clarification: that small subset of data is not stored; it is processed immediately as it streams in, and then it is discarded. The speed layer is also referred to as the “stream-processing layer.” Remember that the goal of the speed layer is to provide low-latency, real-time views of the most recent data, the data that the batch views have yet to take into account.
On this point, the original authors of the Lambda Architecture refer to “eventual accuracy,” noting that the batch layer strives for exact computation while the speed layer strives for approximate computation. The approximate computation of the speed layer will eventually be replaced by the next set of batch views, moving the system towards “eventual accuracy.”
Processing the stream in real-time in order to produce views that are constantly updated as new data streams in (on the order of milliseconds) is an incredibly complex task. Partnering a document-based database with an indexing and querying system is often recommended in these cases.
We noted above that a considerable disadvantage of Lambda Architecture is maintaining two separate code bases to handle similar processing since the batch layer and the speed layer are different distributed systems. Kappa Architecture seeks to address this concern by removing the batch layer altogether.
Instead, both the real-time views computed from recent data and the batch views computed from all data are performed within a single stream processing layer. The entire data set—the append-only log of immutable data—streams through the system quickly in order to produce the views with the exact computations. Meanwhile, the original “speed layer” tasks from Lambda Architecture are retained in Kappa Architecture, still providing the low-latency views with approximate computations.
This difference in Kappa Architecture allows for the maintaining of a single system for generating views, which simplifies the system’s code base considerably.
Coordinating and deploying the various tools needed to support a Lambda Architecture—especially when you’re in the starting up and experimenting stage—is accomplished easily with Docker.
Heroku serves well as a container-based cloud platform-as-a-service (PaaS), allowing you to deploy and scale your applications with ease. For the batch layer, you would likely deploy a docker container for Apache Hadoop. As the speed layer, you might consider deploying Apache Storm or Apache Spark. Lastly, for the serving layer, you could deploy docker containers for Apache Cassandra or MongoDB, coupled with indexing and querying by Elasticsearch.
Taking on the task of big data is not for the faint of heart. Scaling and system robustness are huge challenges, so paradigms like Lambda Architecture bring excellent guidance. As massive amounts of data stream into the data system, the batch layer provides high-latency accuracy, while the speed layer provides a low-latency approximation. Meanwhile, the speed layer responds to queries by reconciling these two views to provide the best possible response. Implementing a Lambda Architecture data system is not trivial. While perhaps complex and initially intimidating, the tools are available and ready to be deployed.