DEV Community: susan waweru

Core Kafka Fundamentals for Data Engineering

susan waweru — Fri, 12 Sep 2025 14:54:55 +0000

Apache Kafka is an open-source distributed event streaming platform.
Was originally developed by LinkedIn but later open-sourced under the Apache Software Foundation.

Event
A record of something that happened in the system eg button click in a website, data insertion in a database etc

Streaming
This is the continuous generation, delivery and processing of data in real-time

Event streaming is the continuous capture, storage and processing of events as they happen ie:

capture data in real time in the form of streams of events
store these streams of events for later retrieval
process, react to the event streams in real time
route the event streams to destination technologies as needed

Kafka Architecture

1. Brokers
Brokers are Kafka servers. They:

store topics and partitions
handle incoming and outgoing messages
communicate with producers and consumers (clients) Kafka clusters usually have multiple brokers for scalability and fault tolerance

Topics are where producers write events to and where consumers read events from. Topics are logical, not physical, they’re split into partitions for scalability.
An event is written into exactly one partition. Events inside a partition are stored in an ordered, immutable log (append-only).

2. Zookeper vs KRaft mode
Zookeeper
A distributed coordination service.
It helps manage metadata, configuration and synchronization for distributed systems.

Keeps track of all Kafka brokers
Elects the controller broker responsible for partition assignments
Detects and manages broker failures or restarts and decides which broker leads which partition

Suppose you have 3 Kafka brokers, if one broker fails, zookeeper notices the failure, elects the new leader for the affected partitions and updates metadata so consumers and producers know what to do.

cons

separate system to manage hence operational complexity

KRaft Mode
In KRaft mode, Kafka removes ZooKeeper and manages everything internally:

Brokers handle data, controllers manage metadata using the Raft protocol replacing ZooKeeper’s coordination model.

pros

simpler deployment and stronger metadata consistency
easier scaling to thousands of brokers

3. Cluster setup and scaling
A cluster is a group of Kafka brokers working together

A broker has:
Multiple brokers (for load-balancing and fault-tolerance)
A controller broker (which manages partition leadership)

Scaling is done by adding or removing brokers, which triggers partition rebalancing handled manually with tools like kafka-reassign-partitions.sh or automated with solutions such as Confluent Auto Data Balancer.

Kafka scales out by adding brokers and rebalancing partitions, and scales in by removing brokers after reassigning their partitions.
Partitioning enables parallelism, replication ensures fault tolerance, and monitoring metrics helps decide when to adjust cluster size.

Setup and Scaling
step i) deploy the more than one broker instances to increase storage and throughput
step ii) create topics and subdivide them into partitions. Partitions spread data across brokers for parallel processing and load balancing.
step iii) for fault tolerance, replicate each partition accross multiple brokers to ensure data availability if a broker fails.
step iv) connect client applications ie producers and consumers to the cluster
Producers auto-discover brokers via bootstrap servers. Consumers scale by adding more instances in a consumer group, each taking partitions.

Topics, Partitions, Offsets

Topic
A topic is a logical category or feed name to which events (messages/records) are published.
All events of a similar type go into the same topic.

Producers write events to a topic (to one of a topic's partitions).
Consumers read events from those partitions, usually in the order they were appended.

Partition
Partition is a subdivision of Kafka topics. It's the actual storage unit where the events are stored in an ordered, immutable log. Each topic is split into partitions for scalability and parallelism. These partitions physically reside on Kafka brokers

Offset
A position of a record in a partition. Offset is an increasing integer assigned to each record within a partition.

Producer writes messages and Kafka assigns them sequential offsets per partition
Example: In Partition 0, records may have offsets 0, 1, 2, 3..

Consumers read messages using offsets as pointers: A consumer remembers “last processed offset”.
Offsets act as bookmarks, if a consumer crashes and restarts, Kafka knows where it left off.

Producers

Producers are client applications that publish (write) events to a Kafka topic . They can choose which partition the event goes to (using key or a round-robin)

a) Writing Data into Topics
A producer sends records (messages) to a topic.
Kafka then decides which partition within that topic the record will go to.

b) Key-based Partitioning (Controlling Message Distribution)
Each record may include a key.
Kafka uses the key to determine which partition the record should go to:
If a key is provided, Kafka applies a hash function on the key and the record always goes to the same partition for the provided key.
If no key is provided, Kafka distributes records in a round-robin fashion across partitions for load balancing.

c) Acknowledgment Modes (acks)
Producers can choose how much confirmation they want from Kafka before considering a write successful:

acks=0 (Fire and Forget)
Producer does not wait for any acknowledgment.
Fastest, but messages may be lost if the broker fails.

acks=1 (Leader Acknowledgment)
Producer gets acknowledgment once the leader partition writes the record.
Safer, but if the leader crashes before followers replicate, data may be lost.

acks=all (or acks=-1) (All Replicas Acknowledge)
Producer waits until the leader and all in-sync replicas acknowledge the write.
Safest (strong durability guarantee), but slower due to waiting for replication

Consumers

Consumers are client applications that subscribe to (read and process) events from a kafka topic.
They can read from one or more partitions

a) Read data from topics
A consumer subscribes to one or more topics. They pull data from Kafka at their own pace

b) Consumer groups (scaling and parallel consumption)
Consumers belong to a consumer group, identified by a group id. Kafka ensures that each partition is consumed by exactly one consumer in a group.

Benefits
Scalability: Multiple consumers in a group can read from different partitions in parallel.
Fault Tolerance: If one consumer fails, Kafka reassigns its partitions to other consumers in the group.

c) Offset management (automatic vs manual commits)
An offset is a number that marks a consumer’s position in a partition (like a bookmark).
When a consumer reads messages, it must commit the offset to Kafka so it can resume from the right place if restarted.

To commit offsets:
1. Automatic Commits
Kafka handles committing offsets on behalf of the consumer at regular intervals. The consumer reads messages and commits the last consumed offset. Its controlled by setting enable.auto.commit=true and auto.commit.interval.ms

Pros:
Simple, low overhead.

Cons:
Risk of data loss if consumer crashes after processing but before commit or duplication if commit happens before processing finishes

2. Manual Commits
Consumer explicitly commits offsets after processing is done. Consumer reads messages, after processing each batch/ record, it calls commit to save the offset. Its controlled by setting enable.auto.commit=false

Pros:
More control, ensures exactly-once/at-least-once semantics.

Cons:
More complex to implement.

Message Delivery Semantics

Message delivers semantics describes the guarantee that a Kafka system provides when delivering messages to consumers, especially in the case of failures.
ie how many times a consumer may receive a message, especially when failures occur.

At-MostOnce (messages may be lost, never duplicated)
Consumer commits the offset before processing the message. Kafka may consider that message 'done' even if consumer hasn't finished processing.

In the case that the consumer crashes before processing, on restart, Kafka starts from the next offset.

Guarantee: Messages are delivered 0 or 1 times.

At-least-Once(messages are never lost, may be duplicated)
Consumer processes the message first, then commits the offset. This will ensure a message is acknowledged until its fully handled.

If consumer crashes before committing offset, even after it has processed the message, on restart, Kafka re-delivers that message and consumer reprocess it again hence duplicate

Guarantee: Messages are delivered 1 or more times. No message is lost, but duplicates may occur.

Exactly-Once(each message is delivered precisely once)
Each message is delivered precisely once.
With enable.idempotence=true, producers avoid duplicate writes when retrying after network or broker issues.

Kafka supports transactional producers where:

Messages sent by the producer + consumer offset commits are part of the same atomic transaction. This ensures that either both the message and offset are committed, or neither is.

Producer commits offset + writes results in the same transaction.
If a crash happens mid-way, Kafka ensures nothing is partially committed.
On restart, the consumer retries safely without duplicates.

Guarantee: Each message is processed once and only once, even with retries and failures.

Retention Policies

Retention policies are there to manage the data stored within the Kafka topics.
They help manage disk usage while ensuring consumers have access to the required data

Time-Based Retention
Messages are kept for a specified duration, after which they can be deleted.
Configured using parameters like log.retention.hours, log.retention.minutes, or log.retention.ms

When a message’s age exceeds the configured retention period, it becomes eligible for deletion.

Size-based retention
Policy limits how much disk space a topic or partition can use.
Configured using parameter log.retention.bytes

When the configured size threshold is reached, Kafka deletes the oldest log segments to make room for new data.

Log compaction
This policy focuses on retaining the latest value for each key within a topic.
Configured using parameter cleanup.policy=compact

If multiple records exist with the same key, Kafka removes the older versions and keeps only the most recent one.
Unlike time/size retention, compaction does not delete all old data — it only removes outdated records per key

Back pressure & Flow Control

Sometimes producers can generate messages faster than the consumer can process them. This can lead to consumer lag, memory buildup, or even crashes so Kafka uses back pressure and flow control mechanisms to handle this situation.

Back Pressure (Handling Slow Consumers)
In back pressure the system signals that consumers (or brokers) are overloaded and cannot keep up with the data flow.
If a consumer is slow, messages keep accumulating in the topic partitions.
If lag exceeds retention limits, consumers may miss data permanently.

Flow Control
Flow control ensures that producers and consumers operate at a balanced pace without overwhelming brokers or clients.
If the buffer fills up because brokers or consumers are too slow, Kafka blocks the producer max.block.ms until space becomes available or throws an exception.

Consumer Lag Monitoring
Consumer lag is the key metric for identifying slow consumers.
Kafka exposes lag metrics via tools like Kafka Consumer Group CLI, monitoring systems like Prometheus + Grafana.
Large or growing lag indicates consumers are falling behind.

Serialization & Deserialization

Kafka stores and transmits messages as byte arrays.
Producers must serialize data into bytes before sending it to Kafka, and consumers must deserialize those bytes back into a usable data structure.

Common Serialization Formats
JSON
Pros: Human-readable, language-agnostic and easy debugging and integration.
Cons: Larger message size, no built-in schema

Avro
Pros: Compact, binary format and schema is stored separately
Cons: Requires schema management (via Schema Registry).

Protobuf (Protocol Buffers)

Pros: Very compact and fast, Widely used in microservices
Cons: More complex tooling compared to JSON.

Confluent Schema Registry
One big challenge in Kafka is, how do producers and consumers agree on message structure over time?
Confluent Schema Registry provides a centralized repository for managing and validating schemas for Kafka messages
Its a central repository for schemas. Producers register schemas when publishing messages and Consumers fetch schemas to deserialize data correctly.

Replication & Fault Tolerance

In Kafka, high availability and durability is achieved through replication which ensures data is not lost if a broker fails and fault-tolerance which allows the system to keep working even when components go down

Replication
Remember a topic is divided into partitions and each partition can have replicas distributed across brokers.

Types of Replicas
Leader replica
Handles all reads/ writes for a partition.
Follower replica
Continuously fetch data from the leader to keep their logs synchronized. This ensures that if the leader broker fails, a follower can take over without data loss.

So the producer sends messages to the leader replica and follower replicas fetch data from the leader and stay in sync.

*Fault-Tolerance *
Achieved by ensuring data survives and remains available even after broker failures

In-Sync Replicas (ISR):
ISR are considered "in-sync" with the leader. They successfully replicated all committed messages from the leader and are not significantly lagging.
Only replicas within the ISR are eligible to be elected as the new leader if the current leader fails.

High Availability
If a broker hosting a leader replica fails, Kafka automatically elects a new leader from the remaining followers in the ISR.

Failover
Switching to a new leader when the current leader becomes unavailable. This ensures continuous operation of the Kafka cluster.

Consumers and producers automatically detect the new leader via metadata updates.
If a follower replica falls behind, it is temporarily removed from ISR.
Once it catches up, it rejoins ISR.

Kafka Connect

Kafka Connect is a framework for scalably and reliably streaming data between Apache Kafka and other systems (databases, cloud storage, search indexes, file systems)
It eliminates the need for writing custom integration code by providing a standardized way to move large datasets in and out of Kafka with minimal latency.

Kafka Connect includes two types of connectors:
Source Connector
Reads data from external systems into Kafka.

Example
Stream database changes (CDC) from PostgreSQL, MySQL, etc.
Collect metrics from application servers and publish them to Kafka topics.

Sink Connector
Writes data from Kafka into external systems.

Example
Index records into Elasticsearch for search.
Load messages into HDFS, S3, or Hadoop for offline analytics.
Insert Kafka records into relational databases or cloud warehouses.

A connector instance is a job that copies data between Kafka and another system.
A connector plugin contains the implementation (classes, configs) that define how the integration works.

Benefits
Scalable: Runs as a distributed service across multiple worker nodes.
Reliable: Handles faults, retries, and exactly-once semantics

Kafka Streams

Kafka Streams is a library that lets you process and analyze data as it flows through Kafka in real-time.

Stateless vs Stateful Operations
Stateless operations don't need to remember anything from before. They process each record independently without retaining any information from previous records.

Example
Map: to transform value of a record

Stateful operations on the other hand need to remember past data to work. They require maintaining and updating state based on past records to produce results.

Example
Join: to combine records from two streams
Aggregate: to aggregate a single aggregate value for records grouped by key

Windowing Concepts
Windowing defines how streaming records are grouped by time so you can run stateful operations like aggregations or joins.

Types of Windows
Tumbling Windows
Fixed-size, non-overlapping intervals.
Each record goes into exactly one window.

Hopping Windows
Fixed-size, but overlapping intervals that “hop” forward in smaller steps.
A record can belong to multiple windows.

Session Windows
Based on activity periods, not fixed time.
A session continues while events keep coming in, and ends after a defined inactivity gap.

Sliding Windows
Windows move continuously with event timestamps.
More advanced, often used with the lower-level Processor API.

ksqlDB

ksqlDB provides a SQL-like interface for Apache Kafka therefore enabling real-time stream processing and analytics using familiar SQL syntax.

Features and benefits
SQL-like Interface: Write queries using familiar SQL syntax just like working with a relational database

Real-Time Stream Processing: Processes events the moment they arrive in Kafka topics

Simplified Development: Eliminates the need for complex Java/Scala code by providing a higher-level SQL abstraction over Kafka Streams.

Built on Kafka Streams: Inherits scalability, performance, and fault tolerance from the Kafka Streams library.

Use Cases include Real-time analytics & dashboards, Data transformation and enrichment, Fraud and anomaly detection, IoT event processing, Event-driven microservices

Transactions & Idempotence

Exactly-Once Semantics (EOS) guarantees that a message is delivered and processed by the consuming application exactly one time. This ensures that even if a system component fails, the message will not be lost, and its effect on the target system will not be duplicated.

EOS is achieved by combining Kafka transactions and the use of idempotency.

Kafka transactions enable atomic processing of multiple write operations by grouping them into a single, indivisible unit that either succeeds entirely or fails entirely, ensuring data consistency in the event of failures. Ensure atomicity by means that all operations within the transaction are committed together, or none of them are.

Idempotency refers to the ability to perform an operation multiple times without causing unintended side effects or changes to the system's state beyond the initial application. It guarantees that messages sent by a producer are written to the Kafka log exactly once, even if the producer retries sending due to network issues or broker failures.

How it works
With enable.idempotence=true, the broker assigns a Producer ID (PID) and sequence numbers to each message. This lets the broker detect and drop duplicates caused by retries.
Guarantee: Messages are written once, in order, within a single producer session.
Limitation: Doesn’t prevent duplicates across producer restarts or across multiple partitions.

Security in Kafka

Apache Kafka’s security model is built on several key components that work together to protect data and ensure data confidentiality, integrity, and controlled access.

Authentication:
Verifies the identity of clients (producers/consumers) and brokers attempting to connect to the Kafka cluster
authentication mechanisms:
SASL (Simple Authentication and Security Layer) is a framework for authentication used in Kafka to verify the identity of clients and brokers

Example of a SASL mechanism in Kafka is:
PLAIN: which uses Username + password authentication
GSSAPI (Kerberos): Provides strong, centralized authentication using a Kerberos Key Distribution Center (KDC) for issuing tickets to clients and services

Authorization:
Authorization in Kafka is controlled by Access Control Lists (ACLs). Kafka checks authorization to determine what operations the client is allowed to perform.

Operations include:
Read/Write: On topics.
Describe/Create: On topics or consumer groups.
Alter/Delete: On topics or consumer groups.
Cluster-level operations: Such as managing brokers.

Encryption
Encryption ensures confidentiality and integrity of data in transit. It prevents unauthorized parties from reading or tampering with messages while they move: between clients (producers/consumers) and brokers and between brokers (inter-broker communication).

Kafka uses TLS (SSL) to encrypt connections. When TLS is enabled, data is encrypted before being sent over the network and the receiver decrypts it using trusted certificates

Operations & Monitoring

Metrics to Monitor

To effectively monitor Kafka, you should track consumer lag, under-replicated partitions (URP) and throughput and latency.

Consumer Lag
Shows the difference between the last message produced to a partition and the last message consumed by a consumer group.
High lag means consumers are falling behind, which can cause delays in latency-sensitive applications.
Large values in records-lag-max indicate a significant delay in consumption

Broker Health & Under-Replicated Partitions (URP)
URPs occur when the leader partition has replicas that are not fully in sync.
URPs signal a risk of data loss if a broker fails.

Throughput and Latency

Throughput
Measures rate of message processing, usually in records per second.

Latency
Measures time from when a message arrives until it is processed (end-to-end delay).

High throughput + low latency leads to efficient system.

High latency leads to potential bottlenecks (example network congestion, slow consumers, broker overload)

Scaling Kafka

Scaling Kafka is adjusting the Kafka cluster’s capacity so it can handle more data, higher throughput, or more clients without performance degradation

Partition Count Tuning
Increasing the number of partitions in a topic boosts parallelism, allowing more consumers in a consumer group to process data concurrently.

However more partitions improve throughput but also add overhead in terms of open file handles, memory, and controller load.

Adding Brokers
Adding brokers to the cluster distributes data and workload across more servers.

Benefits

Increases overall storage capacity
Improves throughput by balancing load
Enhances fault tolerance

Rebalancing Partitions
Rebalancing redistributes partitions across brokers to ensure even load distribution.

Triggers
New brokers added
Brokers removed or fail
Uneven data distribution detected

Performance Optimization

Performance Optimization focuses on making message production, storage, and consumption faster and more reliable by reducing bottlenecks in the producer, broker, and consumer pipeline

Batching and Compression

Batching:
Kafka producers send messages in batches instead of individually hence fewer network round trips

Compression:
Compressing batches reduces network bandwidth and disk usage.

Page Cache Usage
Kafka relies on the OS page cache for fast disk reads/writes.
Keeping enough RAM ensures frequently accessed logs stay cached, avoiding expensive disk I/O.

Benefit: Minimizes random disk I/O, improving read and write performance.

Disk & Network Considerations

Disk:
Use fast disks (SSD preferred over HDD).
Separate Kafka log directories from OS and application logs to avoid I/O contention

Network:
Kafka is network-intensive;
High-bandwidth, low-latency connections (10Gbps+ for large clusters), tuned socket buffers, and balanced partition leaders.

Conclusion

Apache Kafka is a powerful distributed event streaming platform designed for real-time data processing at scale. Its scalability, durability, and rich ecosystem make it a backbone for event-driven architectures and data pipelines. As organizations continue to generate and consume data at massive scale, Kafka provides the foundation to ensure that information flows reliably, securely, and with low latency.

Getting Started with Docker and Docker Compose: A Beginner’s Guide

susan waweru — Wed, 27 Aug 2025 11:47:06 +0000

DOCKER

Docker is a platform that packages applications and their dependencies into lightweight containers, ensuring they run consistently across different environments.

HOW IT WORKS
Instead of the familiar problem of “it works on my machine,” Docker guarantees “it works everywhere.” With Docker, developers don’t need to manually install dependencies or replicate complex setups. As long as Docker is installed, they can run containers that already include the code, versions, dependencies and everything an application requires to run.
Docker manages these containers

Containers vs Virtual Machines (VM)

Each VM runs its own OS along with the application and its dependencies hence making them larger and slower to start
Containers share host OS kernel while isolating processes. Each container has only the application and its dependencies, not a full OS hence faster startup

IMAGE
A Docker image is a blueprint or template for creating containers. It packages everything an application needs to run.

It has the following stored inside it:

Runtime environment(e.g., specific python or java version)
Required libraries and dependencies
The application code
Base OS (e.g., Ubuntu, Alpine, Debian)
Docker images are read-only and immutable—once built, they do not change

CONTAINER
A container is a runnable instance of an image ie when you run an image, Docker creates a container from it. Each container is an isolated process that executes the application exactly as outlined/ defined in the image it is in.

DOCKER FILE
A recipe/instruction file used to build a Docker image.

Contains steps like:
** What base image to use
** What libraries to install
** How to copy code into the image
** What command to run when the container starts

DOCKER HUB
A registry/repository where Docker images are stored and shared.
Its like GitHub but for images. Official images (Python, Postgres, Nginx, Ubuntu, etc.) live here. You can also push your custom images for sharing/deployment.

To install Docker on WSL 2 with Ubuntu 22.04 follow these steps:

Step 1: Update system

sudo apt update && sudo apt upgrade -y

Step 2: Install required dependencies

sudo apt install -y apt-transport-https ca-certificates curl software-properties-common

Step 3: Add Docker GPG key

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

Step 4: Add Docker Repository

echo "deb [signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Step 5: Install Docker image

sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Step 6: Add your user to the Docker Group

sudo usermod -aG docker $USER

Step 7: Finally restart WSL on CMD/ Powershell.

wsl --shutdown

then

docker version

You should see something like below

Docker version 28.3.3, build 980b856

DOCKER COMPOSE

Docker Compose is a tool that allows you to define, configure, and run multi-container Docker applications using a single YAML file (docker-compose.yml).

Instead of starting each container manually with long docker run commands, Compose lets you describe the whole application stack in one place and bring it up with:

docker compose up

HOW IT WORKS
With Docker Compose, you use a special configuration file written in YAML (called docker-compose.yml or compose.yaml) to describe your application. In this file, you list all the services your app needs like a web server, database, or cache and how they should work together.
Once the file is ready, you can start everything with a single command using the Compose CLI

docker compose up

The format of the Compose file follows a standard set of rules called the Compose Specification, which ensures your multi-container applications are defined in a consistent way.

WHY USE COMPOSE

It simplifies multi-container app setup: Defines and manages multi-container apps in one YAML file
Efficient collaboration: Shareable YAML files support smooth collaboration between developers and operations
Makes environments reproducible with a single command.
Great for local development, CI/CD pipelines, and testing microservices

To install Docker compose, follow steps defined in below link

https://docs.docker.com/compose/install/

COMMANDS

To start all the services defined in your compose.yaml file run

docker compose up

To stop the services run

docker compose down

To list all the services and their current status run

docker compose ps

Data Engineering Concepts

susan waweru — Thu, 14 Aug 2025 14:43:13 +0000

Data engineering is the discipline of designing, building, and maintaining the systems and workflows that make data accessible, reliable, and ready for analysis. It is the behind-the-scenes backbone of modern analytics, machine learning, and business intelligence.

In this article, we will explore some foundational concepts in data engineering

1. Batch vs Streaming Ingestion

These are methods for getting data (ingestion) for processing, analytics, or storage

Batch Ingestion
What it is:
Involves collecting data over a period of time, then processing it all at once in a batch.

When used:
Batch ingestion is used when one doesn’t need immediate results and when large volumes of data is required to be processed at once

data is stored temporarily (over a period of time)
At a set time, a job is scheduled to run (to read and process the data in chunks)

Example use case
Data warehouse: loading data into a data warehouse from operational systems daily

Streaming Ingestion
What it is:
Involves collecting data in real time as it arrives.
It's continuously ingested and processed without waiting for a batch window

When used:
Used when one needs real-time insights and fast reactions, eg, in fraud detection, real-time alerts

data flows continuously
data is processed in small increments

Example use case
Stock trading - the trade data in financial markets is ingested in real time

Batch vs Streaming Ingestion
Batch
Processing of data: Scheduled
Data delay: Over time (min -hrs)
Data Volume: Large chunks of data at once
Use case: Reports, analytics, ETL jobs
Examples: Payroll, web logs
Tools: SSIS

Streaming
Processing of data: Real-time
Data delay: Real time (seconds to milliseconds)
Data volume: Small data events continuously
Use case: Real-time alerts, monitoring, dashboard
Examples: Stock prices, IoT sensor data
Tools: Apache Kafka, Spark Streaming

2. Change Data Capture (CDC)

Just as the name suggests (change data capture) CDC is a technique that is used to track and capture changes in a database so that changes can be propagated to another system or used for downstream processing.
The changes can be inserts, updates, deletes etc in a database

CDC

Detect the changes
So Data sources (usually relational databases) are monitored for changes
Identifies rows that have been inserted, updated or deleted
Capture the changes
The data changes are logged in a table or a log
Metadata ie (operation type, timestamp) is included in the data changes
Deliver changes
Push or pull the changes to a target system (eg data warehouse, an analytics tool, streaming pipeline)
Process changes
Downstream systems (eg dashboards, alerts, machine learning models) update/ react based on the new data

How CDC is implemented
Database triggers
Timestamps
Transaction logs
Third-party tools/ services

Why use CDC
Efficiency: avoids full table scans to find out what has changed
Real time replication: keeps systems in sync
Audit trails: records what changed, when and by whom
Data pipelines: feeds changes into data lakes, warehouses

Use cases

Data warehousing:
Keeps data warehouse in sync with operational/ transactional dbs without reloading entire tables
Source Systems log changes via CDC -> ETL/ELT pipelines extract only the changed records (new, updated, or deleted) -> loaded into the data warehouse
Microservices communication
Monolith or microservice writes data to a shared db -> CDC tool captures and publishes them ->
Etl optimization
Reduce processing time and system load in ETL pipelines by only extracting what changed instead of full table scans
ETL jobs query CDC change tables or logs instead of full source tables -> extract only rows where lastModified or CDC log entry exists since the last run

3. Idempotency

Involves an operation being performed multiple times without changing the result beyond the initial application ie doing it once or doing it multiple times will give the same outcome
What Idempotency means is = Operation that can be repeated safely, on Repeat = same effect

How it works
Say you’re calling an API or running a DB operation
Without Idempotency, repeating the same operation multiple times would result to duplicate records, errors
With idempotency, repeating the same operation would not affect the result, it would just behave as if done only once

What Idempotency involves
Unique Identifiers:

track each operation performed using a key
server checks if the request with the same key has already been processed

State checks:

systems check the current state before performing an action eg only update if status is still in “Pending” status

Idempotent methods:

In REST APIs..GET, PUT, DELETE are idempotent..POST is not (but can be made idempotent using a unique key)
Upserts.. In databases you can prevent duplicates by using logic like (Update if exists else insert)

Where Idempotency is used mostly
. APIs(POST/PUT) - prevent duplicate submissions
. Databases - prevent duplicate rows or repeated updates
. ETL/ Batch jobs - Rerunning jobs shouldn’t corrupt data
. Distributed Systems - Network retries must not reapply the same action
. Messaging systems - reprocessing messages should not repeat side effects
. Payments - backend uses the same payment_id and returns the existing result

Use Cases
1.Payments .. in a case eg where the user clicks “Pay” but the network fails, without idempotency the user is charged twice, with idempotency backend uses the same payment_id and returns the existing result
2.API Call.. if the same request is sent again with the same key, the server ignores it or returns the first response
3.Database Upsert

Importance
. Prevent duplicate transactions
. Improve system reliability
. Enable safe retries in unreliable networks

4. OLTP vs OLAP

Online Transaction Processing (OLTP)
Systems optimized for managing day-to-day transactional operations. They are designed for speed, consistency and concurrency.

Use Case
E-commerce systems (shopping cart, orders
Banking systems
Ticket booking platforms

Example
A customer places an order in an online store, OLTP system records the purchase, updates inventory and creates an invoice

Online Analytical Processing (OLAP)
Designed for analyzing large volumes of historical data, enabling business intelligence, reporting and decision-making

Use Case
Sales trends
Executive dashboards
Profitability analysis by region or product

Example
Analysts want to know “what were the total sales per region in Year1 vs Year2?”
This query runs on the OLAP system, which processes and aggregates the data quickly

OLTP vs OLAP diffs
OLTP
Purpose: Handles real-time transactions
Data type: Deals with Operational ie current data
Operations: Implements Insert, Update, Delete and Read operations

Users: Is mainly for frontline employees, systems
Speed Focus: High-speed, low latency
Database: Normalized (3NF), fewer indexes

OLAP
Purpose: Supports complex data analysis and reporting
Data type: Can also have historical(aggregated, summarized data
Operations: Implements the read-heavy operations ie SELECTs etc
Users: Mainly for analysts and decision- makers
Speed Focus: Fast query performance on large datasets
Database: Denormalized(start/snowflake schema)

5. Columnar vs Row-based storage

These are ways databases organize data

Row-based storage:
Stores data row by row hence
All columns for a single row are stored together
Efficient for OLTP where records are frequently accessed/ modified

Columnar storage:
Stores data column by column
All values for a single column are stored together
Columnar storage excels in analytical (OLAP) queries that focus on specific columns across many rows/ over large datasets

For example you have below query
SELECT AVG(AGE) FROM STUDENTS;

Columnar would work better because it only reads the AGE column. Row-based on the other hand would have to read every row and skip over other columns (non AGE columns) hence not efficient

6. Partitioning

Partitioning involves dividing a large dataset to smaller subsets called partitions. The partitions are stored and managed independently

Data partitioning is done mostly for performance and scalability.
In the case for instance you have a large dataset but you only need to query for a certain period say last month, without partitioning the engine would have to scan the entire dataset hence consuming a lot of time, with partitioning the engine skips all the irrelevant partitions and only scans based on partition provided.

- Improved performance
ie faster query execution and data retrieval
Because data is divided into smaller partitions, when querying, the query can target only the relevant subset hence reducing the amount of data that needs to be scanned and processed.

- Enhanced scalability
Because the partitions are distributed across multiple nodes/ servers more servers resources can be added to handle increasing data volumes and demands

- Increased Availability
Because there are several partitions, if one partition is unavailable, the other partitions can be accessed hence ensures continued data availabilty

Drawback
One of the drawbacks of partitioning is extra management because there too many small partitions

7. ETL vs ELT

Both are data integration methods ie moving and processing data.
They differ in that ETL (Extract, Transform, Load) transforms data before loading in into the target system whereas ELT(Extract, Load, Transform) first loads data into the target system and transforms the data withing target.

ETL Steps

Data is extracted from its source
Its then transformed in a staging area
The data transformed data is then loaded into a data warehouse or data lake

ETL is better used with complex transformations or when data quality and consistency are paramount

ELT Steps

Data is extracted from source
Its loaded directly into a data warehouse or data lake without transformation
Transformation is applied within target

ELT is more suitable for cloud-based data warehouse and data lakes
Its efficient for handling large volumes of data and is well-suited for analytics workloads

Comparison
ETL
Transform step: Before loading
Target type: Legacy/on-prem warehouses
Raw data stored?: No (only transformed)
Speed of loading: Slower (pre-transform)
Flexibility: Lower
Compute location: ETL tool/staging server

ELT
Transform step: After loading
Target type: Cloud warehouses/lakes
Raw data stored?: Yes (raw + transformed)
Speed of loading: Faster (load first)
Flexibility: Higher
Compute location: Data warehouse/lake

8. CAP Theorem

Cap theorem states that in a distributed data system, its impossible to simultaneously achieve all three of: Consistency, Availability and Partition Tolerance. A distributed system must choose at most two of the three properties to prioritize

Consistency (C)
Every read receives the most recent write or an error. After a successful write, any subsequent read should return that updated value. The clients therefore see a single, up-to-date view of the data.

Example
In a banking system if you deposit money in ATM A then check the balance immediately at ATM B, the money you deposited should reflect in the new balance

Availability (A)

the system never refuses to answer Every request receives a valid (non-error) response - but it might not be the latest data

Example
In the banking system, if there’s a network glitch between the data centers when checking balance after depositing, it shows the previous balance instead of the current one

Partition Tolerance (P)
Network still continues to operate despite network failures or delays between nodes
The system continues to operate even if messages between parts of the system are lost or delayed

Example
A distributed database runs 3 data centers and communication to one of the data centers break, partition tolerance means the others can still serve reads/writes despite being cut off from the other.

If part of the network goes down the system still runs using the available nodes/ reachable nodes.

Key
In real-world distributed systems Partition Tolerance is non-negotiable because networks can and will fail therefore the trade-off in CAP is Consistency vs Availability during a partition:
CP systems: prefer correctness
AP systems: prefer uptime (may return stale data)

9. Windowing in Streaming

Windowing in stream processing allows data be processed in small, manageable chunks over a specified period. It’s a technique that’s used to handle large datasets, real-time data processing, an in-memory analytics.
The aim of windowing is to allow stream processing applications break down continuous data streams into manageable chunks for processing and analysis.

Data streams (eg sensor readings) never end, so to calculate metrics like count, sum, average over time, you cant wait for the stream to finish because it wont. Windowing lets you process data in slices of time or count, producing intermediate results

Types of Windows

- Tumbling Window
Divides data streams into fixed-size , non-overlapping time intervals
Each event belongs to exactly one window

Use Case example
Website traffic monitoring: number of visits per minute

- Sliding Window
Create overlapping intervals allowing windows be included in multiple windows
Defined by window length and slide interval.
They capture overlapping patterns and trends in the data stream

Use Case example
Network monitoring: track packet loss or error rate over overlapping periods

- Session Window
Groups events that occur within a specific timeframe, separated by periods of inactivity.
The window size is not fixed but determined by the gaps between events.
Window closes after a period of inactivity (gap)

Use Case example
E-commerce analytics: track user journey from first click to checkout

10. DAGs and Workflow Orchestration

DAGs (Directed Acyclic Graphs)
DAG is a type of graph structure used to represent tasks and their dependencies in a workflow.
Provide a structured way to represent and manage complex processes.
Directed : Each edge has a direction, A -> B meaning B depends on A
Acyclic: No loops or cycles
Graph: Made up of nodes (tasks) and edges (dependencies)

DAGS:

ensure tasks execute in the right order
make it clear which tasks can run in paralle and which must wait
prevent infinite loops in workflows

Example
Extract - Transform - Load
Extract runs first, Transform waits for Extract, Load runs after Transform

Workflow Orchestration
Process of defining, scheduling and monitoring workflows. Ensures every task happens at the right time.

Workflow Orchestration

defines workflows
schedules workflows
manages dependencies
handles failures and retries
monitors runs

11. Retry Logic & Dead Letter Queues

Essential components in building robust and resilient distributed systems

Retry Logic
The ability of a system to automatically reattempt an operation when it fails, instead of instantly giving up.
Its crucial for handling transient errors which are temporary failures that are likely to resolve themselves with time.
Failures are often temporary eg a network glitch, a busy server retrying can save you from losing messages

How it works
When an operation fails, the system attempts to re-execute it after a short delay. This process can be repeated a number of times.

Immediate Retries - Try again instantly
Fixed Interval – Wait a fixed time before retrying.
Exponential Backoff – Increase wait time exponentially after each failure.
Jitter - Add randomness to wait times to avoid ‘retry storms’ when many clients retry at once

Benefits
Improves system resilience by making applications more tolerant to temporary disruptions hence reducing the need for manual intervention and can prevent messages from being prematurely moved to a DLQ.

Dead Letter Queues (DLQs)
A special “quarantine” queue where failed messages are sent after they’ve been retried the allowed number of times but still fail

How it works
When a message fails to be processed after the configured retries, or if an unrecoverable error occurs the message is moved from the main queue to the DLQ

prevents problematic messages from blocking the main queue, ensuring flow of other messages
provides a centralized location to inspect failed messages, analyzes the root cause of failures

12.Backfilling & Reprocessing

Techniques used to ensure data completeness and accuracy in data pipelines

Backfilling
Running data pipelines for past time periods to fill in missing or incomplete data

_Purpose _
Data consistency: Ensures that all historical data is consistent and complete, avoiding inconsistencies or missing data points
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data
Regulatory compliance: Helps meet regulatory requirements by ensuring that historical data is complete and accurate
Accurate analytics: Provides a complete and accurate historical dataset for analysis and reporting, preventing skewed results

Goal: Fill missing historical data
Trigger: Data gap or late arrival
Data State: No data exists for that time period
Example: Adding data for Jan 1–14 that was never loaded

Reprocessing
Running your data pipelines again for data that has already been processed, usually to correct or update results

Purpose
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data.
Code updates: Allows for the application of new code or logic to existing data.
Data quality improvements: Ensures that the data is in the desired state by re-running the processing logic.

Goal: Correct or update existing processed data
Trigger: Logic change, bug fix, or source update
Data State: Data already exists but is wrong or outdated
Example: Correcting sales data from Jan 1–14 that was miscalculated

13.Data Governance

Refers to the policies, procedures, and practices that ensure data is managed, secured, and used effectively throughout its lifecycle. Basically the “rules” for how data is collected, stored, transformed, and accessed—so that your data remains accurate, consistent, secure, and compliant

Key Components
Data Quality - Implement validation checks in ETL/ELT pipelines
Metadata Management - Store information about data sources, transformations, and lineage
Data Lineage - Track how data moves from source → transformations → destination. Helps in debugging and compliance audits.
Access Control & Security - Use role-based access control (RBAC), encryption, masking, and tokenization to protect sensitive data

Data Ingestion
Governance rules decide what data can be ingested, from which sources, and at what frequency

Data Transformation
Apply quality checks, enrichment rules, and ensure changes are logged for lineage tracking

Data Storage
Follow partitioning, retention, and archival rules
Apply encryption and role-based access at the storage layer

Data Access
Use a data catalog so users can discover data and understand its meaning before querying
Enforce least-privilege access

Monitoring & Auditing
Automated alerts when governance rules are violated

14.Time Travel and Data Versioning

Techniques that let you query, restore, or compare data from a specific point in the past — almost like a “rewind” button for your datasets

Time Travel
‍Refers to the ability to access historical versions of feature values at previous points in time
Time travel simplifies low-cost comparisons between previous versions of data. Time travel aids in analyzing performance over time. Time travel allows organizations to audit data changes over time, often required for compliance purposes. Time travel helps to reproduce the results from machine learning models.

How it works
Systems like Delta Lake store changes as immutable snapshots with transaction logs
Each write (insert/update/delete) produces a new version, but older files are retained until cleanup (compaction or vacuum)

Data Versioning
Data versioning is the storage of different versions of data that were created or changed at specific points in times

Benefits
Safe experimentation (branch & merge like Git)
Reproducibility for ML models and analytics
Ability to restore to a stable version after a bad pipeline run

15.Distributed Processing Concepts

The concept of dividing a large task into smaller parts that are processed concurrently across multiple interconnected computers or nodes, rather than on a single system.

Instead of a single powerful computer handling everything, multiple computers (nodes) work together.
Tasks are broken down and executed simultaneously on different machines, significantly speeding up processing time.
Nodes communicate and coordinate their actions through a network

This approach enhances performance, scalability, and fault tolerance, making it ideal for handling large-scale data processing and complex computations

Importance
- Improved performance
Parallel processing can dramatically reduce the time it takes to complete a task.
- Scalability
Systems can easily be scaled by adding more nodes as needed to handle increasing workloads.
- Fault tolerance
If one node fails, the others can often continue working, preventing complete system failure.
- Cost-effectiveness
Using multiple commodity computers can be more affordable than relying on a single, powerful (and expensive) machine, according to Hazelcast.
- Concurrency
Nodes execute tasks at the same time.
- Independent failure
Nodes can fail without bringing down the entire system.
- Resource sharing
Nodes can share resources like processing power, storage, and data

Examples
Cloud computing
Services like AWS, Google Cloud utilize distributed processing to offer on-demand computing resources.
Big data processing
Frameworks like Apache Hadoop and Spark leverage distributed processing for analyzing massive datasets.
Database systems
Many database systems are designed with distributed processing to handle large amounts of data and high traffic