DEV Community

Mwirigi Eric
Mwirigi Eric

Posted on

Change Data Capture (CDC) in Data Engineering: Concepts, Tools and Real-world Implementation Strategies

What is Change Data Capture?

According to Confluent, Change Data Capture is a process of tracking all changes occurring in data sources (such as databases, data warehouses, etc.), and capturing them in destination systems.

Why CDC Matters in Data Engineering?

  • CDC eliminates Bulk Load Updates by enabling incremental loading/real-time streaming of data changes to target destinations.

  • It enables Log-based Efficiency by capturing changes directly from transaction logs, thus reducing system resource usage and ensures performance efficiency.

  • Facilitates real-time data movement, ensuring zero-downtime database migrations. This ensures that up-to-date data is available for real-time analytics and reporting.

  • CDC enables seamless synchronization of data across systems, which is crucial for time-sensitive decisions in high-velocity data environments.

How Does CDC Work?

Now that we have a grasp of what CDC is and why it matters, it's crucial to explore how it essentially works so as to have an even better understanding.

Firstly, CDC can be initiated in two approaches i.e Push and Pull. In the Push approach, a source system sends/pushes data changes into a target system(s), whereas, in the Pull approach, the target system regularly polls a source system and "pulls" all identified changes.

Therefore, CDC works by identifying and recording change events happening in various data sources (such as databases), and transferring these changes from the source system in real-time/near real-time to a target system such as a data warehouse or a streaming platform e.g. Kafka.

CDC Implementation Methods/Patterns

CDC implementations can be accomplished through the following methods:

  • Log-based CDC: In this method, a CDC application processes the changes recorded in the database transaction logs, and shares the updates with other systems. This method is suitable for real-time data synchronization since it offers low latency and high accuracy.

  • Trigger-based CDC: In this method, triggers are executed once specific modifications occur in a database, and the changed data is then stored in a change/shadow table. The method is simple to implement, however, it burdens source systems since triggers are activated each time a transaction happens in the source table.

  • Time-based CDC: Changes are identified by polling a timestamp column in the source database, and the updates delivered to target system(s). Time-based CDC is easier to implement, however, it puts additional load on a system timestamp polling occurs frequently.

CDC Tools

Basically, a CDC tool automates the tracking and replication of changes across systems, by detecting data modification (updates, deletion or addition) and replicating them to a target database or system.

There are several tools that can be adopted for CDC implementation, such as Rivery, Hevodata, Debezium and Oracle, each with different capabilities. This article will focus on Debezium, and how it can be used for CDC integrations.

1. Understanding Debezium

From the official documentation, Debezium is a set of distributed services to capture changes in databases so that applications can see those changes and respond to them. It records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

Debezium is built on top of Apache Kafka, and provides a set of Kafka Connect compatible connectors, which record the history of data changes in a Database Management System (DBMS) by detecting the changes as they occur, and streaming a record of each change event into a kafka topic.

2. CDC Architecture with Kafka and Debezium

Debezium is commonly deployed by means of Apache Kafka Connect, which is a framework and runtime for implementing and operating:

  • Source connectors e.g. debezium that send records to kafka.

  • Sink connectors that propagate records from kafka topics to other systems.

The CDC Architecture is illustrated in the following figure:

CDC architecture

From the above, when a new record is added to a database, the debezium source connector detects and records the change, and pushes it to a kafka topic. The sink connector then streams this record to a target system like a data warehouse, where it can be consumed by an application or service.

3. How CDC is Implemented Using Debezium and Apache Kafka

When debezium captures changes from a database, it follows the following workflow:

  • Connection Establishment: Debezium connects to the database and positions itself in the transaction log.

  • Initial Snapshot: For new connectors, debezium typically performs an initial snapshot of the database to capture the current state before processing incremental changes.

  • Change Capture: As database transactions occur, debezium reads the transaction log and converts changes into events.

  • Event publishing: Change events are published to Kafka topics, typically one topic per table by default.

  • Schema Management: If used with Schema Registry, event schemas are registered and validated.

  • Consumption: Applications or sink connectors consume the change events from Kafka topics.

4. How to Implement PostgreSQL CDC Using the Debezium PostgreSQL Connector

Considered Architecture:

  • PostgreSQL database with logical replication enabled.

  • Apache Kafka for message streaming.

  • Kafka connect with the Debezium PostgreSQL connector.

Step 1: Enable Logical Replication in PostgreSQL:

  • In PostgreSQL configuration file, update the following settings to appropriate values, and restart PostgreSQL to apply the changes.

    wal_level = logical
    max_wal_senders = 1
    max_replication_slots = 1
    
  • Thereafter, grant necessary replication permissions to the database user as follows:

    psql -U postgres -c "ALTER USER myuser WITH REPLICATION;"

Step 2: Install Debezium PostgreSQL Connector:

Step 3: Configuring the Connector:

  • Create a JSON configuration file for the Debezium PostgreSQL connector, specifying connection details, replication settings, and the databases and tables to monitor, as follows:
    {
    "name": "postgres-connector",
    "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "localhost",
    "database.port": "5432",
    "database.user": "myuser",
    "database.password": "mypassword",
    "database.dbname": "mydb",
    "database.server.name": "server1",
    "table.include.list": "public.users",
    "plugin.name": "pgoutput"
      }
    }
Enter fullscreen mode Exit fullscreen mode

Step 4: Register/Deploy the Connector to Start Monitoring the Database:

  • Use Kafka Connect’s REST API running on port 8083 to deploy the connector configuration, as follows:
curl -X POST -H "Content-Type: application/json" --data @connector-config.json http://localhost:8083/connectors
Enter fullscreen mode Exit fullscreen mode

Step5: Generating and Observing Change Events:

  • Create table 'customers' in the database and insert values
   psql -U myuser -d mydb -c "CREATE TABLE customers (id SERIAL PRIMARY KEY, name            VARCHAR(255), email VARCHAR(255));"
Enter fullscreen mode Exit fullscreen mode
  • Insert customer details in the table

    psql -U myuser -d mydb -c "INSERT INTO customers(name,email) VALUES ('Tyler', 'tyler@example.com');"
    
  • View the change event using Kafka's console consumer by running the following command:

    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic server1.public.customers --from-beginning
    

    The change event structure will appear as follows:

{
    "schema":{...},
    "payload":{
        "before":null,
        "after":{"id":1, "name":"Tyler", "email":"tyler@example.com"},
        "op":"c",//'c'indcates create(insert)
        "ts_ms":1620000000001
    }
}
Enter fullscreen mode Exit fullscreen mode

The event includes both "before" and "after" states. For an insert, "before" is null since the row didn't exist previously. The "op" field indicates the operation type, and "ts_ms" provides a timestamp.

Updates and Deletion of rows can be executed and the events viewed in a similar manner.

Challenges and Solutions

Whilst the adoption of CDC plays a crucial role in an organization's data management process, it comes with several challenges, which necessitates careful consideration during implementation. Some of these challenges include:

1. Schema Evolution:

  • Overtime, databases undergo changes such as column additions, deletions and renaming, which can disrupt CDC workflows if not handled properly.

  • In production, this challenge can be mitigated by adopting schema registries, which validates schemas for compatibility when changes occur.

2. Late Data:

  • In CDC, late data represents data that arrives after a batch or time window has passed. To handle this, several measures are adopted:
    • Adding source-based commit timestamps to messages;
    • Defining a cut-off timestamp;
    • Using a holding table for delayed messages.

3. Fault Tolerance:

  • In CDC fault tolerance can be ensured through the adoption of tools that support retries and error handling for failed events. Additionally, enabling persistence in message brokers like Kafka ensures durability, thus ensuring that no events are lost during system outages.

4. Event Ordering:

  • Without proper event handling mechanisms in CDC, data inconsistencies and inaccuracies are bound to happen in target systems, since the events are split and processed in a distributed manner.

    • To address this the following measures can be adopted:
    • Use of partitioned messaging system like Kafka which guarantee per-partition ordering.
    • Implementing merge logic in the target system that is both sorted and idempotent.
    • Processing data in partition-aware batches in the target system.

Top comments (0)