DEV Community

quarktimes
quarktimes

Posted on

I Fixed a 5s Database Bottleneck with CDC Dual-Writes

I Fixed a 5s Database Bottleneck with CDC Dual-Writes

We recently hit a critical bottleneck. While running a schema change on a billion-row order table during peak traffic, our P99 latency spiked to 5 seconds, triggering circuit breakers.

The culprit? MySQL's Online DDL. Even with the INPLACE algorithm, it briefly locks the table metadata to update dictionary files, blocking all incoming writes.

Here is how we solved this using a CDC (Change Data Capture) dual-write strategy and atomic table swapping, bringing P99 latency down to 200ms and achieving zero-downtime schema migrations.

The Architecture

The core idea is simple: instead of locking the live table, we create a shadow table and sync data asynchronously.

graph TD
    A[Client Request] --> B[Old Table]
    B --> C[Return Data]
    D[CDC Binlog Sync] --> E[New Table]
    F[Atomic Swap RENAME] -->|Swap Pointer| B
    F -->|Swap Pointer| E
    G[Validator Checksum] -->|Pass| F
    D -.->|Sync Data| E
Enter fullscreen mode Exit fullscreen mode

The Root Cause

We discovered that the issue wasn't just the DDL itself, but how it interacted with MDL (Metadata Locks).

  1. Phenomenon: Business requests couldn't acquire MDL read locks and were blocked, draining the connection pool.
  2. Mechanism: Even INPLACE DDL requires an exclusive lock momentarily at the start and end to update FRM files.
  3. Solution: CDC dual-write moves the lock conflict from "Request vs DDL" to "Async Task vs DDL".

Solution 1: Fixing DDL Safety Checks

Our initial safety logic was flawed. It incorrectly flagged ALGORITHM=INPLACE as unsafe. We corrected this to explicitly allow INPLACE and INSTANT algorithms while banning explicit locks.

# Before (Error Logic)
def is_safe_ddl(sql):
    if 'ALGORITHM=INPLACE' in sql:
        return False  # Logic error: INPLACE is standard for Online DDL
    return True

# After (Fixed Logic)
def is_safe_ddl(sql):
    # Allow INPLACE, but forbid explicit locking syntax
    if 'LOCK=SHARED' in sql or 'LOCK=EXCLUSIVE' in sql:
        return False
    # Allow ALGORITHM=INPLACE or INSTANT
    return True
Enter fullscreen mode Exit fullscreen mode

Solution 2: Atomic Table Swapping

We leveraged the atomic nature of MySQL's RENAME TABLE to switch traffic instantly. This operation only requires a brief exclusive lock, which is negligible compared to the original 5-second block.

-- Atomic swap operation
RENAME TABLE
    orders TO orders_old,
    orders_shadow TO orders;
-- Clean up the old table after swap
DROP TABLE orders_old;
Enter fullscreen mode Exit fullscreen mode

Architecture Decisions

We evaluated a few alternatives before settling on this approach:

Decision Alternative Rationale
CDC Dual-Writes gh-ost gh-ost relies on simulating a replica and adds trigger overhead. Our existing CDC pipeline is more reusable and controllable.
Atomic RENAME App-layer Dual-Write App-layer logic is complex and prone to data inconsistency. DB-level atomicity is guaranteed by the engine and offers better P99 latency.

Production Takeaways

After rolling this out:

  1. Performance: P99 latency dropped from 5s to 200ms.
  2. Efficiency: Full data sync took 45 minutes; consistency validation took only 3 minutes.
  3. Validation: Relying solely on Binlog sync isn't enough. You must use checksum to perform a full differential comparison between the old and new tables to ensure zero data loss.

Understanding locking behavior is critical. By using shadow tables, we decoupled the lock conflict into the async link, keeping the main business completely unaffected.


Originally posted on my tech blog.

Top comments (0)