Implementing Change Data Capture for Reliable Bi-Directional Data Sync

#programming #api #productivity

Before you can sync data between two systems, you need a reliable way to know what changed. Change Data Capture (CDC) is the pattern for detecting changes as they happen rather than scanning entire tables on every sync cycle. Without CDC, the sync layer must do expensive full-table comparisons or risk missing changes entirely.

There are two main approaches: database-level CDC that reads from the write-ahead log, and application-level CDC that uses timestamps and polling.

Approach 1: Database-Level CDC (PostgreSQL WAL)

PostgreSQL logical replication allows consumers to subscribe to a stream of row-level changes directly from the database write-ahead log (WAL). Every insert, update, and delete is captured with before and after values.

Setup:

-- postgresql.conf must have wal_level = logical
CREATE PUBLICATION sync_pub FOR TABLE customers, orders, products;
SELECT pg_create_logical_replication_slot('sync_slot', 'pgoutput');

Python consumer using psycopg2:

import psycopg2
import psycopg2.extras
import json

def create_repl_conn(dsn: str):
    return psycopg2.connect(
        dsn,
        connection_factory=psycopg2.extras.LogicalReplicationConnection
    )

def consume_changes(dsn: str, slot: str, pub: str):
    conn = create_repl_conn(dsn)
    cur = conn.cursor()
    cur.start_replication(
        slot_name=slot, decode=True,
        options={"proto_version": "1", "publication_names": pub}
    )
    def handle(msg):
        payload = json.loads(msg.payload)
        action = payload.get("action")  # I=insert, U=update, D=delete
        if action in ("I", "U"):
            record = {c["name"]: c["value"] for c in payload.get("columns", [])}
            # emit {action: "upsert", table: ..., record: ...}
        elif action == "D":
            identity = {c["name"]: c["value"] for c in payload.get("identity", [])}
            # emit {action: "delete", table: ..., identity: ...}
        msg.cursor.send_feedback(flush_lsn=msg.data_start)
    cur.consume_stream(handle)

WAL-based CDC captures every change including hard deletes. The main operational concern is replication slot lag: if the consumer falls behind, the slot prevents WAL segments from being recycled, potentially filling disk. Monitor pg_replication_slots for lag_bytes and alert at 500 MB.

Alternatively, Debezium provides a managed CDC layer on top of PostgreSQL that handles slot management, schema evolution, and failure recovery.

Approach 2: Application-Level CDC (Timestamp Polling)

Timestamp polling queries for records where updated_at > last_processed_timestamp. This is simpler to set up and works with any database, but cannot detect hard deletes.

Schema requirement:

ALTER TABLE customers ADD COLUMN updated_at TIMESTAMP DEFAULT NOW();
ALTER TABLE customers ADD COLUMN deleted_at TIMESTAMP;  -- soft deletes
CREATE INDEX idx_customers_updated_at ON customers(updated_at);

Poller with persistent high-watermark:

from datetime import datetime, timedelta
from sqlalchemy import create_engine, text
import redis as redis_lib

class TimestampCDCPoller:
    def __init__(self, engine, table: str, redis_client):
        self.engine = engine
        self.table = table
        self.redis = redis_client
        self.wm_key = f"sync:watermark:{table}"

    def get_watermark(self) -> datetime:
        val = self.redis.get(self.wm_key)
        return (datetime.fromisoformat(val.decode()) if val
                else datetime.utcnow() - timedelta(hours=1))

    def set_watermark(self, ts: datetime):
        self.redis.set(self.wm_key, ts.isoformat())

    def poll(self):
        # 10-second overlap buffer for rows arriving slightly out of order
        since = self.get_watermark() - timedelta(seconds=10)
        with self.engine.connect() as conn:
            rows = conn.execute(
                text(f"SELECT * FROM {self.table} "
                     "WHERE updated_at > :since ORDER BY updated_at ASC"),
                {"since": since}
            ).fetchall()
        if rows:
            self.set_watermark(max(r.updated_at for r in rows))
        return [dict(r._mapping) for r in rows]

The 10-second overlap means some rows are processed twice -- which is why idempotent consumers are essential. Store the watermark in Redis (persistent across restarts) rather than in memory. An in-memory watermark is lost on consumer restart, causing either a full re-scan or missed changes.

Filtering Sync-Originated Changes

Both CDC approaches will detect changes made by your sync layer itself, causing sync loops. Tag the sync connection:

with engine.connect() as conn:
    conn.execute(text("SET application_name = 'data_sync'"))
    conn.execute(
        text("UPDATE customers SET name = :name WHERE id = :id"),
        {"name": new_name, "id": record_id}
    )
    conn.commit()

In your WAL consumer, skip events where application_name = 'data_sync'. For timestamp polling, add a synced_at column and skip rows where updated_at - synced_at < INTERVAL '2 seconds'.

Choosing the Right Approach

Use WAL-based CDC when: you need capture of hard deletes, you have high change volume (over 1,000 rows per minute), or you need sub-second sync latency. Use Debezium for production systems with high reliability requirements.

Use timestamp polling when: you cannot modify database replication configuration (common in managed database services like AWS RDS), you need a simpler operational setup, or sync latency tolerance is 30 seconds or more.

For the full guide including conflict resolution and dead-letter queue setup, see How to Build a Bi-Directional Data Sync Between Business Applications. For production implementation support, the 137Foundry data integration services team has run both CDC patterns across multiple client systems.

Monitoring CDC Consumers in Production

Regardless of which CDC approach you use, the operational monitoring for the consumer follows the same pattern:

Consumer lag. The gap between the most recent change event captured and the most recent change event processed. For WAL-based CDC, this is visible as replication slot lag in pg_replication_slots. For timestamp polling, it is the difference between NOW() and the current high-watermark.

Consumer heartbeat. For consumers running as background processes, a heartbeat check verifies that the consumer is still running and processing events. A consumer that has crashed but has not been restarted is invisible without an explicit heartbeat check.

Error rate. The fraction of processed events that result in errors (schema mismatch, downstream API failures, constraint violations). A rising error rate indicates something changed in the upstream system or downstream dependencies.

DLQ depth. For conflicts and permanent failures routed to a dead-letter queue, the DLQ depth should remain low (typically under 10 entries in a healthy system). A growing DLQ means problems are accumulating without resolution.

These four metrics -- lag, heartbeat, error rate, DLQ depth -- are the minimum viable monitoring for any CDC-based sync system. Without them, the sync can fail silently for hours before anyone notices.

For the complete guide on building reliable bi-directional sync including CDC implementation and operational monitoring, see 137Foundry's data integration resources and the data integration services overview.

Why This Matters for Production Reliability

The failure modes of bi-directional sync are almost always discovered in production, not in testing. Test environments rarely replicate the exact conditions that cause clock skew conflicts -- clock synchronization on development machines is generally better than on production infrastructure. Test environments rarely replicate the specific bulk operation patterns that create consistency gaps. And test environments rarely run long enough to reveal the slow drift that accumulates when a field authority map is not updated after a schema change.

This is not an argument against testing -- it is an argument for investing in observability alongside testing. The monitoring patterns in this guide (sync lag, DLQ depth, conflict rate, record count parity) give you visibility into problems that tests will not catch before they affect users.

For teams building a bi-directional sync for the first time, the practical recommendation is: build the operational baseline (DLQ, monitoring, idempotency, loop prevention) before the first production deployment, not after the first production incident. The upfront cost is modest. The incident prevention value is significant.

For technical implementation guidance, see 137Foundry and the data integration resources. For production architecture review and implementation support, the 137Foundry services team works with teams across the integration lifecycle.