DEV Community: Samuel Osoro

Apache Kafka: A Deep Dive!

Samuel Osoro — Tue, 09 Sep 2025 16:22:49 +0000

Today, we are experiencing a lot of transformation in the data engineering field, especially with the shift to microservices and event-driven architecture. A decade, or two ago, batch processing was the primary application paradigm, but today, businesses are looking for speed, responsiveness and personalization, which has forced them to embrace real-time data flows. At the core of this paradigm shift lies event management systems, which allow businesses to publish, subscribe and react to events, including collaboration seamlessly. We cannot forget the event logs, which are also at the center of event management system, where businesses can reconstruct events, to achieve transparency and replicability, and this explains the shift to Apache Kafka. Earlier, Apache Kafka was meant to ingest and process event streams, and since it was made open source in 2011, it has become one of the applications used in real-time event streaming, powering data pipelines, and integration of services. As a distributed event sore and stream processing, Apache Kafka is flexible to serve various roles such as a queue, vent log, and stream processes, making it invaluable in data engineering.

Figure 1: Key Components of Apache Kafka
Source: https://www.datacamp.com/tutorial/apache-kafka-for-beginners-a-comprehensive-guide

Why KAFKA?

You’d probably ask, why Kafka? With Kafka, it can route captured events to the required services in a prompt and reliable manner. Modern developers are looking for scalability, which Kafka offers by eliminating dependencies across services. It is not a simple message broker, but a platform that is all in one, serving roles as being a broker, event store and stream processor. So, Kafka can allow different sources of data to push streams of events simultaneously, and it guarantees storage, and sequence or parallel consumption of the data. Ultimately, you have a system that is capable of handling high volume, and continuous real-time data in a reliable manner.

Figure 2: Advantages of Kafka
Source: https://kafka.apache.org/
Some of the advantages of Kafka are shown in Figure 2:

High throughput – This means that Kafka’s architecture focused on speed, and components such as partitioned topics, batch writes, and append-only logs allow the system to send millions of messages at a go. Built under a lightweight protocol, Kafka’s client-broker communication has made it possible to achieve real-time streams at big scales.
Scalability – Kafka can scale horizontally. It achieves these using partitions, where the topics are split, and this makes it easier for Kafka
Low latency – The ability to deliver messages at latencies of less than 2 milliseconds, is invaluable, and this is possible through the decoupling of producers and consumers to ensure messages are delivered quickly even when the load is heavy.
Durable and reliable – Kafka was created with fault tolerance in mind, such that partitions ensure seamless streaming even when a broker goes down.
High availability – If one broker is unable to perform, the partitions automatically transfer the service to others, and this means clients can produce and consume data as if nothing broke down. This is very important when dealing with real-time data.

Data Engineering Applications

Kafka features a lot in data engineering. In fact, it is the core in moving, processing, storing and transforming data. Since it is flexible, data engineers use Kafka to create scalable pipelines that can tolerate faults, and stream data in real-time. Additional applications of Kafka in data engineering are:

Streaming Real-time ETL Traditionally, data engineers relied on batching in their ETL process, where data was first collected, transformed and loaded into a data warehouse, but this process took a lot of time. With Kafka, it enables streaming of the pipeline, where it captures real-time data from various sources, transforms it using Kafka Streams, and loads it using Kafka Connect, which pushes the transformed data into sinks.

Figure 3: Employing Apache Kafka, Apache Kafka Connect, and Apache Kafka Streams for the implementation of real-time ETL pipelines
Source: https://datacater.io/blog/2022-02-11/etl-pipeline-with-apache-kafka.html

Ingestion Today, data lakes, for example, S3, HDFS and GCS among others, are used in storing raw and processed data of various magnitudes. With Kafka, it blends in as the ingestion layer, which facilitates the reliable arrival of the streamed events. Through Kafka Connect, data engineers have access to sink connectors, where they can feed directly into S3 storage. The schema management further ensures data consistency. So, data engineers can feed data into the lake house, continuously.

Figure 4: Data Ingestion using Kafka
Source: https://www.elastic.co/search-labs/blog/elasticsearch-apache-kafka-ingest-data

Change Data Capture With Kafka, data engineers can capture changes in the database system in real-time manners. Tools like Debezium are critical in this application, as they can publish activities such as insert, update or delete as events to Kafka.

Figure 5: CDC Using Kafka
Source: https://medium.com/@darioajr/data-ingestion-with-apache-kafka-revolutionizing-data-integration-architectures-01faf91fbcb8

Stream Processing Kafka processes real-time data using Kafka Streams, which helps engineers to do joints, aggregations, or even detect anomalies in the database real-time. When used with other frameworks, for example, Flink, Kafka promotes advanced event-driven analytics.

Figure 6: Streaming with Kafka
Source: https://dzone.com/articles/real-world-examples-and-use-cases-for-apache-kafka

Use Cases

Companies like LinkedIn, Netflix and Uber handle large pieces of real-time information daily, which is why Kafka is at the center of their activities. Kafka originally was created for LinkedIn to address the ever-growing amount of data from page views, updates, and clicks in real time. LinkedIn leverages Kafka’s scalability to process millions of messages daily. The decoupling of producers and consumers is at the core of LinkedIn, allowing for the publication and consumption of data, independently. The fault tolerance and high throughput of Kafka ensure the system can remains reliable under heavy traffic.
Another company, Netflix, also relies heavily on Kafka. In one of the areas where Netflix uses Kafka is data-driven personalization, and streaming. So, Netflix built a pipeline using Kafka as the base, to ensure the production and consumption of events. When watching a movie on Netflix, Kafka streams the data to personalization engines, and this is why you always see what to watch next’ recommendations. Kafka also gives Netflix flexibility, in that the company has embraced hybrid workloads using real-time stream processing and batch jobs.
At Uber, there are millions of events that are processed in seconds. For example, the location of a driver, requests, payment and routing information, which is why Uber uses Kafka to tie these systems together. So, Kafka allows for microservice-to-microservice communication, which allows the real-time syncing between the customer, rider, driver apps, including backend services. Kafka was also extended with tiered storage, which ensures data is retained for a long time, where older data moves to other storages, whereas the recent one remains available. Here, we clearly see how Kafka is a core infrastructure at Uber, which synchronizes rides, payments and logistics in real-time.

Sources

Apache Kafka for Beginners: A Comprehensive Guide | DataCamp. https://www.datacamp.com/tutorial/apache-kafka-for-beginners-a-comprehensive-guide
What, why, when to use Apache Kafka, with an example · Start Data Engineering. https://www.startdataengineering.com/post/what-why-and-how-apache-kafka/
Apache Kafka. https://kafka.apache.org/uses
Apache Kafka Concepts, Fundamentals, and FAQs. https://developer.confluent.io/faq/apache-kafka/concepts/
Real World Examples and Use Cases for Apache Kafka. https://dzone.com/articles/real-world-examples-and-use-cases-for-apache-kafka
Data Ingestion with Apache Kafka: Revolutionizing Data Integration Architectures | by D@rio | Medium. https://medium.com/@darioajr/data-ingestion-with-apache-kafka-revolutionizing-data-integration-architectures-01faf91fbcb8
How to ingest data to Elasticsearch through Kafka - Elasticsearch Labs. https://www.elastic.co/search-labs/blog/elasticsearch-apache-kafka-ingest-data
Building Real-Time ETL Pipelines with Apache Kafka. https://datacater.io/blog/2022-02-11/etl-pipeline-with-apache-kafka.html

Getting Started with Docker and Docker Compose: A Beginner’s Guide

Samuel Osoro — Tue, 26 Aug 2025 15:40:59 +0000

When students encounter tools like Apache Airflow in data engineering, the initial hurdle is rarely the concepts. It’s the setup. Installing dependencies, resolving conflicts, and making sure everything runs consistently across different computers can consume more time than actually learning the tool itself. This is where Docker—and its companion, Docker Compose—come in as game-changers for beginners.

What Exactly Is Docker?

At its core, Docker is a way of packaging software so it can run anywhere. Think of it as putting an entire mini-computer—complete with its operating system, libraries, and applications—inside a sealed box called a container. This container will behave the same way whether you run it on Windows, macOS, or Linux. For learners, this means you no longer need to worry about whether your laptop has the right version of Python or whether installing Airflow might break your existing projects.

Why Docker Matters for Beginners

Without Docker, the process of installing a tool like Airflow can feel overwhelming. Different operating systems may require different installation steps, and small mistakes can cause big frustrations. With Docker, you don’t need to configure everything manually. Instead, you start a container that already knows how to run Airflow. In other words, Docker helps you focus on learning Airflow instead of fixing your computer.

Enter Docker Compose

While Docker on its own is powerful, many modern applications are made up of several pieces working together. Airflow, for example, needs not only its core scheduler but also a web server, a database, and workers that handle tasks. Managing all of these by hand would be daunting.

This is where Docker Compose comes in. Docker Compose acts like a project organizer. It lets you describe all the parts of your application—say, Airflow’s scheduler, database, and web server—in one simple file. With a single command, all these parts are launched together, already connected and ready to run. Instead of juggling multiple installations, you just “compose” them and let Docker handle the details.

Example in Apache Airflow

Apache Airflow is a workflow orchestration tool widely used in data engineering. Setting it up the traditional way often involves installing Python dependencies, configuring environment variables, and ensuring the right versions of databases and message brokers are available. For a beginner, this can feel like climbing a mountain before even writing a single workflow.

With Docker and Docker Compose, that mountain becomes a short hill. You can run Airflow with all its components—scheduler, workers, database, and web interface—without manually installing each one. This allows you to start experimenting with designing workflows almost immediately. Instead of spending hours troubleshooting installations, you spend your time learning how Airflow schedules and runs tasks, which is the skill that truly matters.

Benefits

For those just starting in data engineering, the biggest advantage of Docker and Compose is time. They reduce setup friction so you can quickly move to the fun part: building. By using containers, you also gain confidence that what works on your machine will work on someone else’s, whether that’s a teammate, an instructor, or a potential employer. This sense of consistency is a powerful ally when learning complex systems.

Conclusion

Docker and Docker Compose may sound like advanced tools at first, but they are, in fact, the beginner’s best friend. They remove barriers, simplify complex setups, and give students the freedom to focus on concepts rather than configuration. When applied to tools like Apache Airflow, Docker transforms what would be a painful installation process into a straightforward launchpad for exploration. For anyone stepping into data engineering, learning to use Docker is less about becoming an infrastructure expert and more about unlocking the ability to learn quickly and effectively.

15 Data engineering Concepts

Samuel Osoro — Mon, 11 Aug 2025 10:07:10 +0000

Data engineering is the backbone of any modern data-driven organization. It involves designing and building systems that collect, process, and deliver data so it can be analyzed and turned into insights. Whether powering real-time dashboards, feeding machine learning models, or supporting business intelligence, data engineers work to ensure data is reliable, accessible, and timely. In this article, I will explore key concepts at the core of data engineering. Understanding these ideas will help you build robust data pipelines and scalable systems that meet the needs of today’s fast-moving digital world.

Batch vs Streaming Ingestion

In data engineering, efficient data ingestion is an important consideration. Batch ingestion involves collecting data in groups and processing it at set intervals—like once every hour or at the end of the day. This approach works well when real-time data isn’t essential and simplifies processing by handling large volumes at once. On the other hand, streaming ingestion processes data continuously as it arrives, allowing systems to respond to events instantly. Streaming is key for applications that need up-to-the-minute insights, such as fraud detection or live user analytics. Often, organizations use a combination of batch and streaming ingestion to balance performance, complexity, and timeliness.

Change Data capture

Change Data Capture, or CDC, is a method to identify and capture changes made to data in a source system—like inserts, updates, or deletions—and propagate those changes downstream. Instead of reprocessing entire datasets, CDC enables incremental updates, which is much more efficient and reduces latency. This technique is especially valuable for keeping data warehouses or analytics systems synchronized with transactional databases in near real-time. By tracking only what has changed, CDC supports timely and accurate data flows without overwhelming your pipelines.

Idempotency

In distributed data systems, operations can sometimes be retried due to failures or timeouts, which risks processing the same data multiple times. Idempotency ensures that performing the same operation repeatedly produces the same result as doing it once, preventing data duplication or corruption. Designing idempotent processes is essential for building reliable and fault-tolerant pipelines where retries and partial failures are common.

OLTP v. OLAP

Data systems generally fall into two categories: OLTP and OLAP. OLTP (Online Transaction Processing) focuses on handling a large number of short, atomic transactions, such as bank transfers or e-commerce purchases, where consistency and speed are critical. In contrast, OLAP (Online Analytical Processing) is designed for complex queries and data analysis over large datasets, supporting reporting, business intelligence, and decision-making. Understanding the difference helps data engineers choose the right storage, processing, and optimization strategies for each use case.

Columnar v. Row-based Storage

The way data is stored greatly affects performance and efficiency. Row-based storage organizes data by rows, making it ideal for transactional workloads where entire records are read or written frequently. Conversely, columnar storage saves data by columns, which optimizes analytical queries that often scan only a few fields across many records. Columnar formats enable better compression and faster read times for aggregations, making them well-suited for data warehouses and OLAP systems. Choosing the right storage format depends on your workload patterns and query needs.

Partitioning

Partitioning divides large datasets into smaller, manageable segments based on keys such as date, region, or customer ID. This organization improves query performance by allowing systems to scan only relevant partitions instead of the entire dataset. It also enables better parallelism during processing, reducing latency and resource usage. Effective partitioning is a key technique to scale data pipelines and optimize analytical workloads.

ETL v. ELT

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two common approaches to processing data pipelines. In ETL, data is extracted from sources, transformed into the desired format outside the target system, and then loaded into the destination. This approach suits environments where transformations require specialized tools or where source systems have limited capacity. ELT reverses the last two steps: data is first loaded in its raw form into the target system, often a modern data warehouse—then transformed internally using its processing power. ELT leverages scalable platforms and allows for more flexible and iterative transformations.

CAP Theorem

The CAP theorem is a fundamental principle in distributed systems, stating that a system can only guarantee two out of three properties at the same time: Consistency, Availability, and Partition tolerance. Consistence ensures all nodes see the same data simultaneously. Availability means every request receives a response, even if some nodes fail. Partition tolerance means the system continues operating despite network failures between nodes. Data engineers must carefully balance these properties based on their application’s needs, often trading off strict consistency for higher availability or vice versa.

Window in Streaming

In streaming data systems, data flows continuously and unbounded, making it challenging to analyze events over time. Windowing solves this by grouping data into finite chunks based on time or event criteria. Common window types include tumbling windows (fixed, non-overlapping intervals), sliding windows (overlapping intervals), and session windows (based on periods of activity separated by inactivity). Windowing enables meaningful aggregations and analytics on streaming data, such as calculating metrics over the last five minutes or detecting user sessions.

DAGS and Workflow Orchestration

Data pipelines often involve multiple interdependent tasks that need to run in a specific order. Directed Acyclic Graphs (DAGs) provide a way to model these workflows, where each node represents a task and edges define dependencies without cycles. Workflow orchestration tools like Apache Airflow use DAGs to schedule, manage, and monitor complex pipelines, ensuring tasks execute reliably and in the right sequence. This approach improves visibility, fault tolerance, and scalability in data engineering processes.

Retry Logic and Dead Letter Queues

Failures and errors are inevitable in distributed data systems. Retry logic allows systems to automatically attempt failed operations again, helping them to recover from temporary issues without manual intervention. However, when retries continue to fail, problematic data or messages need special handling to avoid blocking pipelines. Dead Letter Queues (DLQs) capture these failed records for later inspection and troubleshooting, ensuring data is not lost and enabling engineers to identify and fix underlying problems.

Backfilling and Reprocessing

Sometimes, data pipelines need to handle historical data that was missed or fix errors from previous runs. Backfilling involves loading and processing past data to fill gaps, ensuring datasets are complete. Reprocessing means rerunning transformations or computations on existing data to correct inaccuracies or apply updated logic. Both are essential for maintaining data quality and consistency in evolving systems.1

Data Governance

Data governance refers to policies, processes, and roles that ensure data is accurate, secure, and used responsibly. It encompasses data quality standards, access controls, compliance with regulations, and clear ownership. Strong governance builds trust in data assets and supports effective decision-making, especially as organizations face growing privacy and security requirements.

Time Travelling and Data Versioning

Modern data platforms often support time travel and data versioning, which allow users to query historical snapshots of data. This capability helps with auditing, debugging, and recovering from errors by enabling rollback to previous states. Data versioning tracks changes over time, ensuring reproducibility and transparency in data workflows—key for reliable analytics and compliance.

Distributed Processing Concepts

Handling large-scale data requires distributing computation and storage across multiple machines. Distributed processing frameworks like Apache Spark or Hadoop split tasks into smaller units that run in parallel, speeding up processing and improving fault tolerance. These systems coordinate resources, handle failures gracefully, and scale horizontally to meet growing data demands, forming the backbone of modern big data architectures.

Conclusion

Data engineering is a complex but essential field that powers today’s data-driven decisions. By understanding concepts like ingestion methods, storage formats, distributed systems, and data governance, engineers can design pipelines that are reliable, scalable, and efficient. Mastering these fundamentals equips data professionals to build systems that turn raw data into valuable insights, driving business success in an increasingly digital world.