DEV Community: Hilary Wambwa

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Hilary Wambwa — Thu, 11 Sep 2025 10:42:17 +0000

Kafka Architecture

Kafka is an distributed streaming system that uses producers to publish data and consumers to subscribe to the data in real time.

Topics are hosted in partitions. Several partitions make up a broker. A cluster is a group of brokers.

Brokers are servers that store and serve data. A cluster is made up of several brokers in communication. A broker stores topics in partitions.

A topic is a logical stream of data. Think of it as a table in a structured database. Like data in a table, events are written to a topic.

A partition is a unit of parallelism and scalability. Think of it as a subtopic. It is what makes Kafka distributed as partitions are distributed to brokers. A topic with 10 partitions can be processed by up to 10 consumers in parallel within a consumer group.

An offset is a position of a record in a partition. A partition contains messages. Each message is assigned a unique automatically increasing offset showing position of message in partition’s log.

ZooKeeper is a historic external coordinator for Kafka. Imagine Kafka are roads and highways transporting data, then zookeeper is the traffic command center that assigns partitions, configures topics, monitors health of brokers. Since it is external, data engineers have to configure it separately. If zookeeper fails, an entire Kafka cluster drops.

Kafka Raft enables zookeeper to manage its own ‘traffic’ where Kafka brokers elect a ‘controller’ from a quorum of brokers to configure topics and assign partitions. If the controller fails, another broker takes over.
Cluster setup and scaling.

Scaling in Kafka means it is able to accommodate extra load by adding more brokers. Done in 3 steps:

A new broker registers with the cluster, updating the cluster metadata.
Existing partitions are moved to the new broker to balance the load/redistribute.
Consumers adjust to new partition assignments to maintain parallel processing.

Producers and Consumers

Producers
These produce, write or publish data into topics.

Each message has a key attached to determine partition destination. Think of separate lines in a distribution center as partitions. If you use the same key (say, a user’s ID), all messages for that user go to the same partition, keeping things in order. For example, if you’re tracking orders for “User123,” all their order events are in one partition. This sends an order for “User123” to the “orders” topic, and Kafka figures out the partition based on the key.

Acknowledgment modes
Acks is how Kafka acknowledges a message.

Acks=0; Producer sends a message and does not wait for a reply/ acknowledgement. Fast and risky.
Acks=1; Producer waits for lead broker to acknowledge message. However, if lead broker crashes before copying message to other brokers, data is lost.
Acks=all; producer waits for all backup brokers (in sync replicas) to acknowledge the message. Slow but reliable.

For critical stuff like bank transactions, you’d go with acks=all. Netflix, for example, uses this for their event streaming to make sure no data gets lost, as they handle billions of events daily.

Consumers
These consume, read or subscribe data from topics.

Consumer groups are like a team of workers. These consumers in one group are assigned to partitions in a topic. In case of a consumer fail, Kafka assigns partition to another consumer in the group.

There are two ways to keep track of last message processed by consumers using offset.

Automatic Commits: Consumer activities are saved every few seconds (set by auto.commit.interval.ms). Easy but prone to skips and duplicates if anything crashes mid process.
Manual Commits: The consumer decides when to save its progress, giving more control.

Message Delivery & Processing

Message Delivery Semantics are mechanisms about whether a message sent by a producer will be delivered to a consumer and under what conditions.

At-most-once: Message is delivered zero or once. Might get lost, but never duplicated.
At-least-once: message is delivered one or more times. Won’t get lost, but duplicates are possible.
Exactly-once: message is delivered exactly one time. No losses, no duplicates.

These messages are retained via three methods:
Time-based (e.g., keep data for 7 days)
Messages deleted after a specified time. Perfect for when data is only relevant for a specific period of time. Example: Consider a building management system that monitors temp readings from IOT sensors installed in a building. It only requires data that is hours or minutes long to make decisions and discards message after decision is made.

Size-based (e.g., max 1 GB).
This limits space a topic partition can use. In cases of the size exceeding limit, Kafka starts deleting the oldest messages. Used in cases where control over storage costs is critical.

Log compaction (keep the latest value per key).
Only most recent message for each key in a topic is kept. This is for keeping a single source of truth for each key ideal for monitoring the latest edition of an entity e.g., user profile updates.

Back pressure & Flow Control

Back pressure is when producers write messages faster than consumers can read them. This can lead to consumers crashing or performance slowing down.

There are several ways to handle slow consumers;

Consumer group scalability: Adding more consumers to handle the available partitions thus distributing load and reducing back pressure.
Pause and resume: consumers can pause and resume reading messages from specific partitions to clear backlog without overwhelm of messages
Buffer management: Consumers use buffers that are configured to limit amount of data read per request. Lowering the limit prevents backlog of data in memory, reducing overload.

How can we monitor the speed of consumers?
This measures how far behind a consumer as far as the latest message in a partition is concerned.
Lag = Latest Offset - Consumer Offset

Monitoring tools
The kafka-consumer-groups.sh tool displays lag for each partition in a consumer group.

Command line
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group order-consumers --describe

Metrics APIs
Kafka exposes JMX metrics like kafka.consumer:consumer-lag for monitoring. Grafana can scrape these metrics for real-time dashboards.

Third-Party Tools
Confluent Control Center or Burrow provide user-friendly interfaces for lag monitoring. According to a Confluent blog, monitoring lag is critical for ensuring SLAs in production systems.

Serialization & Deserialization

In Kafka, messages are stored and transmitted as raw bytes. Serialization is the process of converting data like a JSON object into a byte stream that Kafka can handle. Deserialization takes that byte stream and turns it back into something meaningful, like a python dictionary or a JSON object, for the consumer to process.

JSON: It’s human-readable, widely supported, and great for quick prototyping. Great for getting started, but for production and numerous messages, its size and lack of schema management can become a hinderance.
Example:
{
"order_id": "12345",
"amount": 99.99
}

Avro: a binary format designed for efficiency. Utilizes a schema used by producers and consumers to maintain consistency.
The schema is stored separately in the Confluent Schema Registry. This way, one can add new fields to schema or make them optional without breaking consumers.
Here’s an example Avro schema for an order:

When serialized, this schema produces a compact binary representation, unlike JSON’s bulky text.

Protobuf: Like Avro, it’s a binary format, but it’s designed for cross-language compatibility; Java, Python, C++ and high performance; often smaller than avro.

Advanced Kafka Concepts

Replication

To prevent chaos if something goes wrong, Kafka creates multiple copies (replicas) of each partition and spreads them across different brokers
Leader & follower replicas.
Leader replicas is the primary broker of a partition carrying out all reads and writes.
Follower replicas are backup brokers. They don’t serve directly but keep an ‘eye’ or copy of the leader’s activities.
ISR (In-Sync Replicas) are a list of replicas that are fully up-to-date with the leader’s data. If a follower falls behind, it’s kicked out of the ISR until it catches up.

How to configuring replication:
replication.factor: How many total replicas (leader + followers) each partition should have. A replication.factor=3 means one leader and two followers.
min.insync.replicas: The minimum number of replicas that must be in sync for the leader to accept writes. Setting min.insync.replicas=2 ensures at least two replicas (including the leader) are up-to-date before acknowledging a write.

Fault tolerance

A brief process of how Kafka handles failures
Normally; The leader replica for a partition ; Broker 1 handles all reads and writes, while followers; Brokers 2 and 3 replicate the data in real time.
Failure; If Broker 1 crashes, Kafka’s controller, part of the cluster management, handled by ZooKeeper or Kraft, detects the failure and promotes a follower to become the new leader.
Recovery; When Broker 1 comes back online, it rejoins as a follower, catching up on any missed messages before re-entering the ISR.

Kafka Connect

A scalable, easy to use framework built into Kafka to stream data. It’s a plug and play where only the connectors are configured while Kafka connect does the heavy lifting of integration. Two types of connectors:

Source connectors: pull data into Kafka

Sink connectors: push data out of Kafka

Each connector breaks the data flow into tasks, which run in parallel for scalability.

An example is where a CDC pipeline is created to pull data from a postgres DB and post them on a Casandra DB with a postgres-Debezium source connector and debezium-Cassandra sink connector as shown in the above code snippets.
Similarly, consider a retail company using a MongoDB source connector to feed inventory updates into Kafka, then a sink connector to push alerts to a notification system.

Kafka Streams

A java library used to build real time data processing applications. It utilizes Kafka capabilities of scale and fault tolerance without the need of a separate system to process, analyze, and transform your data on the fly.

Stateless operations: Each message (data record) is processed independently, with no context carried over.

Map: Transform data, like renaming.
Filter: Keep only certain records.
FlatMap: Split one record into multiple.

Stateful operations: These operations require maintaining “state” (context) across multiple messages.

Count: Track the number of orders per item.
Aggregate: Sum up or average values.
Join: Combine streams. States are managed using state stores, which are backed by Kafka topics.

Windowing

This is grouping data into time-based “windows” for analysis, perfect for real-time trends.

Tumbling: Fixed, non-overlapping time intervals (e.g., every 5 minutes).
Hopping: Fixed intervals that overlap (e.g., 5-minute windows advancing every 1 minute). Like checking messages every minute but looking at the last 5 minutes.
Session: Dynamic windows based on user activity (e.g., group orders from the same customer if they’re within 10 minutes of each other).

ksqlDB

SQL-like interface for streaming queries. It is like a streaming database developed on Kafka to query real time data using simple SQL commands. It operates on three terms:

Streams: This is an immutable sequence of events. Take an example of a temperature monitoring system. Stream in this case is temperature updates sent by a sensor.

This stream, stored in a Kafka topic called temperature_readings, keeps growing as new readings arrive.

To define a temperature stream:

Tables: These are a snapshot of what’s happening now and represent the current state of data.
Think of a stream as a log of every temperature reading ever sent, while a table is like a dashboard showing the most recent reading for each sensor.
Lets create a table of latest temperatures.

Queries: To analyze tables and streams using SQL.

Push Queries: Continuously deliver results as new data arrives, like a live feed of temperature alerts.

Pull Queries: Fetch the current state from a table.

Transactions & Idempotence

Kafka’s transactions and idempotence features are used to ensure every message is processed exactly once no duplicates, no misses by allowing the processing of data across multiple topics and partitions as a single, atomic unit.

This in turn enables exactly-once semantics (EOS), a holy grail for applications where precision is non-negotiable, like financial systems. EOS means a message is delivered and processed exactly once, despite the occurrence of network failures or broker crashes.

Idempotence ensures that even if a producer retries sending a message due to a broker failure, Kafka recognizes it as a duplicate and only records it once. This is achieved using a unique producer id and sequence numbers for each message.

Use Case: A bank uses Kafka to process transactions. Without EOS, a transfer could be recorded twice, overcharging a customer.

Security in Kafka

Authentication in Kafka is used to prevent unauthorized access of messages and topics.
SASL(Simple Authentication and Security Layer): Think of it as a versatile ID checker. It supports protocols like PLAIN (username/password), GSSAPI (Kerberos for enterprise environments), and SCRAM (secure password-based authentication).
Kerberos is common in large organizations like banks, where strict identity verification is a must.

OAuth: This is like using a third-party app like Google to log in. It’s great for modern, cloud-native setups.

Authorization (Access Control Lists, ACLs)
Once users are verified, you need to control what they can do. Authorization in Kafka uses ACLs to define permissions. ACLs specify who can perform actions e.g., read, write, delete on topics or consumer groups.

Encryption (TLS)
Encryption ensures messages stays confidential. Kafka uses Transport Layer Security to encrypt data in transit between producers, consumers and brokers, as well as between brokers.
Enabling TLS

Operations & Monitoring

Consumer lag
This measures how far behind a consumer as far as the latest message in a partition is concerned.
Lag = Latest Offset - Consumer Offset
Lag shows how quick consumers can handle the incoming data volume. If lag keeps growing, a real-time pipeline might not be so “real-time” anymore.
To monitor this, we can set alerts for lag thresholds e.g., >1000 messages using tools like Prometheus and Grafana.

Broker health & under-replicated partitions
Brokers are the heart of a Kafka cluster. Broker health metrics, especially under-replicated partitions (URPs), tell if brokers are keeping up with replication.

Kafka uses replication to ensure fault tolerance. Each partition has a leader replica and follower replicas. If followers can’t keep up with the leader, URPs are displayed, signaling potential data loss risks if the leader fails.

To monitor this, we check the UnderReplicatedPartitions metric via Kafka’s JMX metrics or tools like Prometheus. A non-zero value indicates trouble.

Throughput & latency
Throughput is a measure of how many messages a cluster processes per second. Latency tracks how long it takes for a message to go from producer to consumer.

Low throughput or high latency can choke a pipeline. For example, if streaming real-time analytics, slow throughput could delay driver updates, impacting user experience.

Monitoring is via metrics like BytesInPerSec, BytesOutPerSec, and MessageInPerSec for throughput, and RequestLatencyAvg for latency, using JMX.
Latency can be reduced by optimizing network settings or reducing partition counts for low-traffic topics.

Scaling Kafka

Partition count tuning
More partitions mean more parallelism, letting consumer groups process messages faster. A topic with 10 partitions can be consumed by up to 10 consumers in a group, speeding up throughput. However, too many partitions can increase overhead, like having too many queues confusing your baristas.

A good rule of thumb, per the Kafka documentation, is to start with 1–2 partitions per broker and adjust based on throughput needs.
NOTE: A a topic’s partition count can’t be changed without recreating it, so it is wise to plan ahead.

Adding brokers
Adding brokers increases storage and throughput capacity, it is like opening a new counter to serve more customers.
New brokers are added the cluster, and Kafka redistributes partitions to balance the load. The cluster metadata is updated via the Kafka controller.

Rebalancing partitions
After adding brokers or tuning partitions, rebalancing partitions is necessary to spread the load evenly, like reassigning baristas to quieter queues.

kafka-reassign-partitions.sh tool moves partitions between brokers, ensuring no single broker is overwhelmed while others sit idle.
Tip: Rebalancing can be resource-intensive, so schedule it during low-traffic periods
Weaving it all together, tune partitions to match throughput needs, add brokers to handle more traffic, and rebalance to keep things smooth.

Performance Optimization

Batching and compression
Producers send messages to brokers, but sending each message individually is like inefficient. Batching groups multiple messages into a single request, reducing network overhead and boosting throughput.
To configure this, we use linger.ms, which sets how long a producer waits to accumulate messages before sending a batch. A small delay 5milliseconds can dramatically increase throughput by allowing more messages to be sent together. Another setting, batch.size, controls the maximum size of a batch (in bytes).

Page cache usage
Kafka uses the operating system’s page cache, a portion of RAM used to cache disk data. When producers write messages to a partition, they’re stored on disk but also kept in the page cache. Consumers reading recent messages can often pull them directly from memory, avoiding slow disk I/O. This is especially powerful for Kafka’s log-based architecture, where messages are appended sequentially, making cache hits common.

Disk and network considerations
Disk considerations:
Kafka stores messages on disk in a log-based structure, making disk performance critical. To optimize this:
Use SSDs: Solid-state drives (SSDs) offer faster I/O than traditional HDDs, reducing write and read latency.
Separate log directories: Spread Kafka’s log directories (log.dirs) across multiple disks to parallelize I/O.
Avoid overloading disks: Monitor disk I/O using tools like iostat. If disk utilization nears 100%, add more disks or brokers to distribute load.

Network considerations:
High-Bandwidth NICs: Use network interface cards (NICs) with at least 1 Gbps (preferably 10 Gbps) to handle high message volumes.
This ensures Kafka can handle large bursts of traffic without bottlenecks.
Enable compression: Compression reduces network load. For network-bound clusters, lz4 compression offers a good balance of speed and efficiency.
Monitor network latency: Use metrics like network-io-rate to spot bottlenecks. If latency spikes, consider upgrading network hardware or optimizing producer/consumer configurations.

Use Cases

Uber's Real-Time Ad System
Imagine running Uber Eats ads where every click or impression is money on the line. Mess up, and you’re either losing revenue or overcharging advertisers. In 2021, Uber built a slick system using Apache Kafka, Flink, Pinot, and Hive to process ad events (clicks, impressions) in near real-time with exactly-once precision, no duplicates, no losses. It’s like a coffee shop where every order is tracked perfectly, even during a rush.

Kafka acts as the reliable message queue, with topics like “Mobile Ad Events” split into partitions for parallel processing. Producers (the app) send events with acks=all for reliability, while Flink jobs (consumers) aggregate, attribute, and load data across two regions for failover.

Exactly-once semantics: Flink’s “read_committed” mode and Kafka’s transactions, paired with unique record UUIDs, ensure no double-counting. Retention policies keep 3-day backups for recovery, and replication across brokers guarantees fault tolerance. Flink’s 1-minute windowing handles back pressure, while consumer lag monitoring keeps the pipeline smooth.

This setup powers ad auctions, billing, and analytics, processing millions of events weekly with sub-2-minute latency. Uber’s use of Kafka’s partitioning, transactions, and scalability shows how theory fuels real-world wins, delivering fast, accurate insights for advertisers.

Link to article : https://www.uber.com/en-KE/blog/real-time-exactly-once-ad-event-processing/

Pinterest
Pinterest hales massive data with 459 million users pinning images non-stop. To manage this data of 40 million messages per second, 50 GB/s traffic, pinterest leans on Apache Kafka, running 50+ clusters with 3,000 brokers.

Kafka’s topics and partitions are central, handling 3,000+ topics and 500K partitions for user events and database changelogs. Static “brokersets” ensure even partition distribution, boosting scalability. Producers like Singer (logging) and Maxwell (DB ingestion) publish compressed messages. Consumers, such as S3 Transporter for analytics and Flink/Kafka Streams for real-time spam detection or recommendations, use consumer groups to scale processing.

Serialization uses compact formats to ease CPU strain during upgrades. For performance optimization, SSDs replace magnetic disks, slashing I/O latency, while batching and compression boost throughput.

Pinterest powers real-time monetization and safety pipelines, processing petabytes for ML and metrics. Kafka’s partitioning, replication, and monitoring make this data chaos a seamless, scalable win.

Link to article: https://www.confluent.io/blog/running-kafka-at-scale-at-pinterest/

BRIEF INTRODUCTION TO DOCKER AND DOCKER COMPOSE

Hilary Wambwa — Wed, 27 Aug 2025 10:18:03 +0000

DOCKER

Docker is a tool to develop and run applications in containers.

A container provides an isolated environment for an application and its dependencies(libraries) to ensure it runs across environments (OS systems/machines). It is in itself a running instance of an image.

An image is a blueprint containing everything needed to run an application (code, libraries e.t.c).

Installation
Docker has two installation options.
Docker desktop:
https://docs.docker.com/desktop/setup/install/windows-install/

Docker engine via WSL Linux distribution: https://gist.github.com/dehsilvadeveloper/c3bdf0f4cdcc5c177e2fe9be671820c7

Let’s do it practically:
In this example, we will have three files.

Requirements text file
This file contains libraries needed to run our application (e.g pandas)

Python file(main.py)
Import pandas
print("Trump")

Dockerfile
This is a script/instruction that guide how an image is created.

Specify an official Python base image. Instead of installing python, we pull this image from the registry

FROM python:3.13-slim

Set working directory

WORKDIR /app

Copy requirements file and install dependencies

COPY requirements.txt .
RUN pip install -r requirements.txt

Copy contents in project directory to our app directory.

COPY . .

Run the application

CMD ["python", "main.py"]

Docker Compose

Docker compose is a tool that makes it easy for data engineers/developers to define and run multi container Docker applications.

Imagine you're building a complex data pipeline via docker that needs multiple databases, Kafka and a CDC tool e.g. Debezium. Without Docker compose you'd be typing in multiple commands to get each of these containers up and running for every service needed to run your application for instance you'd start with something like Docker run for your databases adding in all those parameters for ports volumes and networks again diving back into the terminal typing another lengthy docker run command with even more parameters to ensure that it runs your app.

This process can be very error prone as there is more potential for typos or mistakes that could send you back to square one.

Enter Docker compose where you define your multi container setup in a single yaml file that outlines which images to user build the ports the volumes and how these containers should talk to one another configurations for services are defined.

In a .yml file which uses yaml syntax a service definition in Docker compose tells Docker how to run a specific container based on an image including configurations like ports volumes environment variables and dependencies on other services. With just a simple docker-compose up command Docker compose gets to work bringing your entire application to life seamlessly. When you're done you can simply run docker compose down to tear it all down.

Here are some commands for managing containers and images:
docker build -t app: Build an image from a Dockerfile
docker images: List downloaded images.
docker run : Start a container from an image.
docker ps: List running containers.
docker ps -a: List all containers.
docker stop : Stop a container.
docker rm : Remove a container.
docker rmi : Remove an image.
docker-compose up: Start all services defined in compose
docker-compose down: Stop and remove all services

15 Must-Know Data Engineering Tricks

Hilary Wambwa — Sun, 10 Aug 2025 22:31:18 +0000

Think of data engineering as the behind-the-scenes hero that makes sense of the massive amounts of data modern companies deal with every day.
Here are 15 tricks that will help you navigate this field.

Batch vs Streaming ingestion

Batch Ingestion: Fixed chunks of data is ingested on a fixed schedule or manually into a system.
Example: processing daily sales data for a retail company from a transactional database to a warehouse.
Streaming: Data or events are ingested into a system in real time or near real time based on a trigger.
Example: processing data from a temperature monitoring IOT sensor in a green house.

Change Data Capture (CDC)

Style of data movement where every change (inserts, updates, deletes) is captured in real time to move data from one data source to target without reprocessing the entire datasets.
Three ways to perform CDC:

Log based: Every database transaction is logged in a log file. Pick up the changes and move from the log to target. Efficient, no impact on source system
Query based: Querying the data in the source based on a timestamp to pick up the changes. Source must have a column tracking the last modification timestamp.
Triggers: Database triggers are created to log/store the changes in a separate audit table which are then read and propagated to the target database.

Example: A healthcare institution stores patient records in a transactional database. This data is moved to an Amazon Redshift warehouse via log-based change data capture therefore storing patients’ historical data for tracking and compliance.

Idempotency

Given that the same input is used, a data pipeline/processes should produce the same results regardless of the number of times it runs or system failures such as network, server or API.
Without idempotency, scheduled retries could lead to duplication of data, incomplete data states as well as costly computation and storage.
How to practice idempotency:

Use unique IDs for each data operation/record.
Verify the state of data to confirm if incoming data matches current data.
Deduplication: Kafka’s exactly-once semantics, ensures messages are processed only once by tracking offsets and transaction IDs.
Using unique constraints/upsert operations on databases e.g INSERT ... ON CONFLICT DO NOTHING/UPDATE in SQL to avoid duplicates.

OLTP vs OLAP

OLTP is a database design system which entails transactional/read-write operations (insert, update, drop) of high volume and frequent datasets.
OLAP is designed to optimize complex queries and aggregations often used in data warehouses to analyze large historical data.

Columnar vs Row-based Storage

To understand this let’s consider a table with sales data; 1 million rows and the following rows (customer_id, date, amount, order_id).

In a Columnar based storage, each column data is stored separately in memory i.e all 1 million customer_id are stored together separate from date, amount and order_id. This is perfect for OLAP systems where aggregation is common e.g SUM(amount) query only reads amount column, skipping the rest. It minimizes disk I/O.

In a Row based storage, data is stored by row where all attributes of one record are stored together. This is perfect for OLTP where read-write operations are commonly done on entire records.

Partitioning

This is dividing data into small partitions using attributes such as date and geographical regions. It is useful in optimizing querying and processing such that in a large dataset, the system uses partitions to only query the relevant data instead of the whole dataset.

Horizontal partitioning: Partitioning a subset of rows based on attributes such as date.
Vertical partitioning: Splitting a table into columns, most frequently queried columns are stored in one partition while those less frequently queried stored in another.

ETL vs ELT

ETL (Extract, Transform, Load): Extract data from source, transform and load into target. Common in structured data warehouses.
ELT (Extract, Load, Transform): Extract data from source, loads into target then transform the data. Common in cloud data lakes as they leverage the cloud’s compute power.

CAP Theorem

In a distributed system where data is stored across different nodes, when a communication failure between nodes occurs, a system must choose between consistency (all nodes show the same data) or availability (the system remains operational). Partition tolerance is a constant, so the trade off is between consistency and availability.

Examples:
Consistency: Every read operation retrieves the most recent write. Crucial for data accuracy. Consider banking where if a user transfers Ksh.1000, all nodes update to reflect this immediately.
Availability: System must remain operational to respond to requests. Responsive over accuracy. Social media platform retrieves posts even though some nodes are disconnected, sometimes showing outdated posts.

Windowing in Streaming

When streaming, data comes in fast and continuously. Processing and analyzing the data requires bounded chunks of data to calculate averages, sums e.t.c. Windowing is dividing this data in bounded streams(windows). These windows are based on:

Time: 5-minute bounds
Count: Every 100 events
Session: A user’s browsing/shopping session.

Example:
A ride hailing company uses time-based windows to adjust price based on real time demand (ride duration or number of requests)

DAGs and Workflow Orchestration

Directed Acyclic Graph (DAG) uses nodes and directed edges to create an execution order for tasks avoiding cycles meaning tasks cannot loop back to earlier steps.
Workflow orchestration is the managing and scheduling of these tasks. Apache Airflow is an example of workflow orchestration tool.
A DAG is written using python to extract air quality data from an API, transform the data and load it in a database. An airflow scheduler provides execution order while the web UI monitors the DAG. If the DAG fails(error), airflow logs the error, retries the task and displays or sends an alert.

Retry Logic & Dead Letter Queues

Retry logic is redoing a task over and over where the retry is not only done after some time and not immediately (configurable delays) but also the retry time increases exponentially with each attempt (exponential backoff) i.e. 1s, 2s, 4s, 8s. This helps avoid overwhelming target system with retries as well as increasing wait time to allow system recovery.

Backfilling & Reprocessing

Backfilling is input of historical data to a system to fill gaps or correct errors especially when introducing a new pipeline.
Reprocessing on the other hand is re-running pipelines on existing data to fix errors, increase accuracy of the data or when introducing new transformations.

Example: A marketing firm introduces a new analytic platform. It has to backfill years of campaign data for historical analysis.

Data Governance

These are rules that define the management and protection of an organization’s data to ensure security, quality and compliance.
It helps to define who owns the data, who accesses it and what is included in the metadata.

Time Travel & Data Versioning

Time travel: This is the ability to query data as it existed at a certain time in the past. Imagine debugging a pipeline and need to see what your data looked like last week before a bug corrupted it. Time travel enables one to rewind the data at the specific time without altering it.
Data Versioning: Monitoring changes to dataset overtime similar to version control in software engineering i.e., Git. Each change creates a new version of data allowing analysis of how data has evolved.

Distributed Processing Concepts

Distributed processing is carrying out complex tasks involving large data using multiple connected nodes (servers/cloud instances). This divides tasks into smaller and parallel tasks. The opposite is centralized processing where one machine carries out all the tasks.
Distributed processing relies on several concepts:

Parallel processing: This is executing several tasks simultaneously.
Partitioning: Dividing data in small chunks where each node processes a partition independently.
Fault tolerance: Ability to continue operating despite system failure. In this case if a node fails while processing, another node is used to recompute the task.
Load balancing: Distribution of tasks equally across nodes to avoid bottlenecks and delays.
Distributed coordination: Tasks have to be managed for synchronization. A good example is Apache Zookeeper.

Understanding Data Warehousing for Retail Analytics: A Comprehensive Guide

Hilary Wambwa — Sat, 26 Jul 2025 11:06:56 +0000

What is it?

A data warehouse is a central store used for managing large volumes of historical and current data for an organization. Unlike operational and transactional databases, it is optimized for analysis and business intelligence.

What are its components?

1. Database

This is the core storage component in a data warehouse built upon a data model. Dimensional modelling is the preferred method of coming up with the blueprint/data model for this database because it is both query optimizing and easy to grasp i.e. Fact table for quantitative measurable metrics and dimension tables for descriptive/attribute content adding meaning to fact tables.
Two schema designs are used in this modelling:
Star schema: simple and intuitive. It is denormalized, query optimizing, compatible with reporting and BI tools but storage inefficient.
Snowflake schema: complex and extends star schema by normalizing tables. It is normalized, storage efficient, maintains data integrity by reducing redundancy but makes queries and ETL processes more complex due to multiple joins.

2. Data Sources

Since data warehouses are central, they fetch data from multiple sources:
Transactional databases: handle real time, small scale, frequent, read-write operations (OLTP). Normalized to reduce redundancy. Example: A retail MySQL dB managing daily sales.
Customer Relationship Management (CRM): systems that store company interactions with customers and prospects. Example: A CRM containing customer profiles, sales leads, campaigns and purchase history.
Enterprise Resource Planning (ERP): integrate and manage core business processes; Finance, HR, supply chain and inventory. Highly normalized for operational efficiency. Example: ERP table for inventory containing ItemID, StoreID, Quantity, Cost, Date_of_purchase.
APIs: basically, how systems share data. In this context, a retail company’s website real time/ near real time traffic data (page views, user demographics) can be pulled using Google Analytics API and stored in a warehouse for analysis.
Flat Files: simple, non-relational files stored in formats such as CSV, JSON, XML. Stored in local systems/cloud before ingesting into data warehouse. Example: A CSV file of customer survey responses stored in S3 buckets then loaded into data warehouse to analyze customer satisfaction.

How are these data sources integrated into the warehouse?

3. ETL/ELT Processes

This is where ETL/ELT comes in handy.
Extraction: Pulling data from sources mentioned above to ensure all relevant data is collected.
Transformation: Cleaning and standardizing the data. Removing duplicates, handling missing values, ensuring data integrity i.e. standardize date formats. Different sources have different formats, conventions; An ERM may use Firstname, Lastname order while a CRM uses Lastname, Firstname order.
To optimize performance and reduce query complexity, aggregation of the data is necessary e.g. aggregating daily transactional data to weekly sales.
Loading: Organizing the data into an optimized structure i.e. in a star schema to improve query performance.

Take Note: Stages can be very useful to store Flat files temporarily before loading them into tables. A good example is Snowflakes AWS S3 stage which is managed by snowflake. It may also be external e.g. AWS S3, Azure Blob, Google Cloud Storage but will require configuration. In the case of retail: An AWS S3 stage storing customer survey responses in a CSV file.

4. Query and reporting tools

Allow users to interact with stored data, generate insights and build reports. These includes business intelligence platforms: Power BI, Tableau, Looker as well as SQL tools.

Real World Applications

Amazon employs Amazon Redshift, a cloud-based data warehouse, to handle vast amounts of data from its e-commerce platform, including clicks, impressions, website visitors, and purchase histories. This supports marketing analytics, tracks key performance indicators (KPIs) like conversion rates and churn, and enables reverse ETL to target audiences effectively.
Target Corporation, one of the largest retailers in the United States, uses a sophisticated data warehouse to power its analytics and decision-making. Their system, known as the “Guest Data Platform,” integrates data from various sources to create a unified view of each customer. This has enabled target to:

Implement highly successful personalized marketing campaigns
Optimize store layouts based on customer behavior analysis
Improve inventory management, reducing stockouts and overstocks
Enhance their online and mobile shopping experiences