Apache Kafka: A Deep Dive!

Today, we are experiencing a lot of transformation in the data engineering field, especially with the shift to microservices and event-driven architecture. A decade, or two ago, batch processing was the primary application paradigm, but today, businesses are looking for speed, responsiveness and personalization, which has forced them to embrace real-time data flows. At the core of this paradigm shift lies event management systems, which allow businesses to publish, subscribe and react to events, including collaboration seamlessly. We cannot forget the event logs, which are also at the center of event management system, where businesses can reconstruct events, to achieve transparency and replicability, and this explains the shift to Apache Kafka. Earlier, Apache Kafka was meant to ingest and process event streams, and since it was made open source in 2011, it has become one of the applications used in real-time event streaming, powering data pipelines, and integration of services. As a distributed event sore and stream processing, Apache Kafka is flexible to serve various roles such as a queue, vent log, and stream processes, making it invaluable in data engineering.

Figure 1: Key Components of Apache Kafka
Source: https://www.datacamp.com/tutorial/apache-kafka-for-beginners-a-comprehensive-guide

Why KAFKA?

You’d probably ask, why Kafka? With Kafka, it can route captured events to the required services in a prompt and reliable manner. Modern developers are looking for scalability, which Kafka offers by eliminating dependencies across services. It is not a simple message broker, but a platform that is all in one, serving roles as being a broker, event store and stream processor. So, Kafka can allow different sources of data to push streams of events simultaneously, and it guarantees storage, and sequence or parallel consumption of the data. Ultimately, you have a system that is capable of handling high volume, and continuous real-time data in a reliable manner.

Figure 2: Advantages of Kafka
Source: https://kafka.apache.org/
Some of the advantages of Kafka are shown in Figure 2:

High throughput – This means that Kafka’s architecture focused on speed, and components such as partitioned topics, batch writes, and append-only logs allow the system to send millions of messages at a go. Built under a lightweight protocol, Kafka’s client-broker communication has made it possible to achieve real-time streams at big scales.
Scalability – Kafka can scale horizontally. It achieves these using partitions, where the topics are split, and this makes it easier for Kafka
Low latency – The ability to deliver messages at latencies of less than 2 milliseconds, is invaluable, and this is possible through the decoupling of producers and consumers to ensure messages are delivered quickly even when the load is heavy.
Durable and reliable – Kafka was created with fault tolerance in mind, such that partitions ensure seamless streaming even when a broker goes down.
High availability – If one broker is unable to perform, the partitions automatically transfer the service to others, and this means clients can produce and consume data as if nothing broke down. This is very important when dealing with real-time data.

Data Engineering Applications

Kafka features a lot in data engineering. In fact, it is the core in moving, processing, storing and transforming data. Since it is flexible, data engineers use Kafka to create scalable pipelines that can tolerate faults, and stream data in real-time. Additional applications of Kafka in data engineering are:

Streaming Real-time ETL Traditionally, data engineers relied on batching in their ETL process, where data was first collected, transformed and loaded into a data warehouse, but this process took a lot of time. With Kafka, it enables streaming of the pipeline, where it captures real-time data from various sources, transforms it using Kafka Streams, and loads it using Kafka Connect, which pushes the transformed data into sinks.

Figure 3: Employing Apache Kafka, Apache Kafka Connect, and Apache Kafka Streams for the implementation of real-time ETL pipelines
Source: https://datacater.io/blog/2022-02-11/etl-pipeline-with-apache-kafka.html

Ingestion Today, data lakes, for example, S3, HDFS and GCS among others, are used in storing raw and processed data of various magnitudes. With Kafka, it blends in as the ingestion layer, which facilitates the reliable arrival of the streamed events. Through Kafka Connect, data engineers have access to sink connectors, where they can feed directly into S3 storage. The schema management further ensures data consistency. So, data engineers can feed data into the lake house, continuously.

Figure 4: Data Ingestion using Kafka
Source: https://www.elastic.co/search-labs/blog/elasticsearch-apache-kafka-ingest-data

Change Data Capture With Kafka, data engineers can capture changes in the database system in real-time manners. Tools like Debezium are critical in this application, as they can publish activities such as insert, update or delete as events to Kafka.

Figure 5: CDC Using Kafka
Source: https://medium.com/@darioajr/data-ingestion-with-apache-kafka-revolutionizing-data-integration-architectures-01faf91fbcb8

Stream Processing Kafka processes real-time data using Kafka Streams, which helps engineers to do joints, aggregations, or even detect anomalies in the database real-time. When used with other frameworks, for example, Flink, Kafka promotes advanced event-driven analytics.

Figure 6: Streaming with Kafka
Source: https://dzone.com/articles/real-world-examples-and-use-cases-for-apache-kafka

Use Cases

Companies like LinkedIn, Netflix and Uber handle large pieces of real-time information daily, which is why Kafka is at the center of their activities. Kafka originally was created for LinkedIn to address the ever-growing amount of data from page views, updates, and clicks in real time. LinkedIn leverages Kafka’s scalability to process millions of messages daily. The decoupling of producers and consumers is at the core of LinkedIn, allowing for the publication and consumption of data, independently. The fault tolerance and high throughput of Kafka ensure the system can remains reliable under heavy traffic.
Another company, Netflix, also relies heavily on Kafka. In one of the areas where Netflix uses Kafka is data-driven personalization, and streaming. So, Netflix built a pipeline using Kafka as the base, to ensure the production and consumption of events. When watching a movie on Netflix, Kafka streams the data to personalization engines, and this is why you always see what to watch next’ recommendations. Kafka also gives Netflix flexibility, in that the company has embraced hybrid workloads using real-time stream processing and batch jobs.
At Uber, there are millions of events that are processed in seconds. For example, the location of a driver, requests, payment and routing information, which is why Uber uses Kafka to tie these systems together. So, Kafka allows for microservice-to-microservice communication, which allows the real-time syncing between the customer, rider, driver apps, including backend services. Kafka was also extended with tiered storage, which ensures data is retained for a long time, where older data moves to other storages, whereas the recent one remains available. Here, we clearly see how Kafka is a core infrastructure at Uber, which synchronizes rides, payments and logistics in real-time.