All About Change Data Capture CDC

#dataengineering #kafka #webdev #programming

What is Change Data Capture?

Change Data Capture is an approach that detects, captures, and forwards only the modified data from a source system into downstream systems such as data warehouses, dashboards, or streaming applications.

Core principles of CDC include:
- Capture: Detect changes in source data while minimally impacting source system performance.
- Incremental Updates: Transmit only changed data to reduce overhead.
- Real-time or Near Real-time Processing: Maintain fresh data in targets.
- Idempotency: Ensure changes applied multiple times do not corrupt data.
- Log-based Tracking: Leverage database transaction logs for accurate and scalable data capture.

CDC Implementation Methods

A) Log-based CDC
-- The most robust, it reads database transaction logs (e.g., PostgreSQL WAL, MySQL binlogs) directly to stream change events with minimal latency and high scalability.
Advantages: Low system overhead and near‑real‑time performance make it ideal for high-volume environments.
Disadvantages: It requires privileged access to transaction logs and depends on proper log retention settings.

EG. Logical Replication with psql

-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;

-- Create a logical replication slot to capture changes
SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput');

-- Fetch recent changes from the WAL
SELECT * FROM pg_logical_slot_changes('cdc_slot', NULL, NULL);

B) Trigger based
Uses database triggers to capture changes. Offers lower latency but may impact performance due to trigger overhead.

 Advantages: Straightforward to implement on databases that support triggers and ensures immediate change capture.
Disadvantages: It can add extra load to the database and may complicate schema changes if not managed carefully.

-- Create an audit table to store changes
CREATE TABLE customers_audit (
    audit_id SERIAL PRIMARY KEY,
    operation_type TEXT,
    customer_id INT,
    customer_name TEXT,
    modified_at TIMESTAMP DEFAULT now()
);

-- Create a function to insert change records
CREATE OR REPLACE FUNCTION capture_customer_changes()
RETURNS TRIGGER AS $
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO customers_audit (operation_type, customer_id, customer_name)
        VALUES ('INSERT', NEW.id, NEW.name);
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO customers_audit (operation_type, customer_id, customer_name)
        VALUES ('UPDATE', NEW.id, NEW.name);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO customers_audit (operation_type, customer_id, customer_name)
        VALUES ('DELETE', OLD.id, OLD.name);
    END IF;
    RETURN NULL; -- No need to modify original table data
END;
$ LANGUAGE plpgsql;

-- Attach the trigger to the customers table
CREATE TRIGGER customer_changes_trigger
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION capture_customer_changes();

C) Polling-based/Query-based CDC
Periodically queries the source database to check for changes based on a timestamp or version column.

EG. A products table with a version_number column that
increments on each update

Advantages: Simple to implement when log access or triggers are unavailable.
Disadvantages: It can delay the capture of changes and increase load if polling is too frequent.

D) Timestamp-based CDC
Relies on a dedicated column that records the last modified time for each record. By comparing these timestamps, the system identifies records that have changed since the previous check.

Key CDC Tools and Technologies

Debezeum
Open-sourced log-based CDC platform that captures row-level changes from various databases, including PostgreSQL, MySQL, SQL Server, and MongoDB, and publishes them as change event streams typically into Apache Kafka

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "localhost",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.dbname": "inventory",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "dbz_publication",
    "database.server.name": "dbserver1",
    "include.schema.changes": "true"
  }
}

Debezium supports incremental and blocking snapshots, accommodates schema changes, provides fault tolerance with offset tracking, and supports signal tables for ad hoc snapshots.

Apache Kafka & Kafka Connect
Kafka serves as a durable, scalable event streaming platform ideal for transporting CDC events. Kafka Connect offers extensible connectors to ingest CDC events from sources (like Debezium connectors) and deliver them downstream.

from kafka import KafkaProducer
import json

# Initialize the Kafka producer with bootstrap servers and a JSON serializer for values.
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Define a CDC event that includes details of the operation.
cdc_event = {
    "table": "orders",
    "operation": "update",
    "data": {"order_id": 123, "status": "shipped"}
}

# Send the CDC event to the 'cdc-topic' and flush to ensure transmission.
producer.send('cdc-topic', cdc_event)
producer.flush()
print("CDC event sent successfully!")

CDC events are published as Kafka topics, enabling downstream consumers to perform real-time analytics, caching, and replication tasks [Confluent CDC Blog].

Two main CDC connector types in Kafka Connect:
Source connectors: Capture and stream change events into Kafka.
Sink connectors: Consume CDC events from Kafka and write them to other data stores.

Confluent Cloud CDC Connectors
Confluent Cloud provides managed CDC connectors, including Oracle CDC Source Connector, enabling easy capture from Oracle redo logs and publishing to Kafka topics with built-in fault tolerance and support for security ACLs, offset management, and topic partitioning [Confluent Docs: Oracle CDC].

AWS Database Migration Service (DMS)
Uses log-based cdc to continuously replicate data from on-premises systems to the AWS cloud with minimal downtime.

Talend and Informatica
Talend and Informatica are comprehensive ETL platforms offering built‑in CDC functionality to capture and process data changes, reducing manual configurations. They are especially advantageous in complex data transformation scenarios, where integrated solutions can simplify operations

Database-native CDC solutions
Several relational databases offer native CDC features, reducing the need for external tools:

 - PostgreSQL logical replication: Captures changes in  WAL and streams them to subscribers.
 - SQL Server change data capture (CDC): Uses transaction logs to track changes automatically.
 - MySQL binary log (binlog) replication: Logs changes for replication purposes.

Real-World CDC Implementation Strategies

1. Initial Snapshot

Any CDC pipeline begins with capturing a consistent snapshot of the source database so downstream systems start with an accurate baseline.
Debezium takes snapshots using SQL queries within optimized transaction isolation levels.
The snapshot runs once at startup or ad hoc, capturing the existing state before switching to streaming.

2. Streaming Changes

Subsequent changes (INSERT, UPDATE, DELETE) are streamed as events sourced directly from database logs.
Kafka provides durable messaging and ordering guarantees.
Event consumers rebuild or maintain up-to-date representations of source data efficiently.

3. Denormalization Patterns
CDC typically mirrors highly normalized source schemas, which can be hard to consume for analytics. Denormalization approaches include:

 - No denormalization: Simple replication with downstream joins.
 - Materialized Views: Create database views that join/enrich data before CDC capture.
 - Outbox Pattern: Application writes change events to an immutable outbox table from which CDC is performed.
 - Stream Processing: Use Kafka Streams or ksqlDB to enrich and denormalize events downstream.
 - Denormalization at Destination: Perform transformations in data warehouse or lake.

Choosing where denormalization occurs depends on latency, complexity, and architectural preferences.

CDC Challenges and Solutions

-- Schema Evolution
Source database schema changes (add/drop columns, data types) can break CDC pipelines
Solution:
Use schema registry and versioning; Debezium supports some schema change handling; perform backward-compatible schema updates; Incremental snapshots can handle some changes gracefully
-- Event Ordering
Changes arriving out of order can cause incorrect data state.
Solution:
Use Kafka’s partitioning and ordering guarantees; Debezium buffers snapshot and streaming events to resolve collisions; design idempotent consumers
-- Late Data
Data changes delayed due to stream interruptions or replication lag
Solution:
Employ windowing and watermark strategies in stream processing; support replay by Kafka's retained log storage and offset management
-- Fault Tolerance
Network, system failures can interrupt pipeline operation
Solution:
Debezium offset tracking for resume; Kafka’s durability; Idempotent writes at sink; Signal tables for controlled snapshot restarts

Sample Kafka Connect Debezeum Sink Connector for writing CDC data to a data warehouse

{
  "name": "dw-sink-connector",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "topics": "dbserver1.inventory.customers",
    "connection.url": "jdbc:postgresql://datawarehouse:5432/dw",
    "connection.user": "dw_user",
    "connection.password": "password",
    "auto.create": "true",
    "insert.mode": "upsert",
    "pk.mode": "record_key",
    "pk.fields": "id"
  }
}

Aside: Kafka source connectors read data from an external system and write it to Kafka topics, while Kafka sink connectors read data from Kafka topics and write it to an external system

References:

Debezium Documentation - PostgreSQL Connector:
debezium.io
Confluent Blog on CDC Patterns and Implementation:

How Change Data Capture (CDC) Works

CDC is a software design pattern that identifies and captures changes made to data in a database. Learn how CDC works, the best solutions, and how to get started with various implementations.

alt="favicon"
class="c-embed__favicon m-0 mr-2 radius-0"
src="https://www.confluent.io/favicon.ico"
loading="lazy" />
confluent.io
Confluent Oracle CDC Source Connector Documentation:

docs.confluent.io
Additional CDC Concepts and Tools Overview:

Change Data Capture (CDC): What it is and How it Works - Striim

Change Data Capture is ideal for real time data movement. Learn how it works, the best use cases for CDC, and the role it plays in streaming ETL.

alt="favicon"
class="c-embed__favicon m-0 mr-2 radius-0"
src="https://media.striim.com/wp-content/uploads/2020/09/13183215/favicon-circle-157x150.png"
loading="lazy" />
striim.com
Apache Kafka CDC Implementation Guide:

How To Implement Change Data Capture With Apache Kafka and Debezium | Estuary

Learn how to implement real-time Change Data Capture (CDC) with Apache Kafka, Debezium, and Estuary Flow for seamless data integration and analytics.

estuary.dev