How to Migrate MySQL to ClickHouse with Zero Downtime

#dataengineering #mysql #database #architecture

MaterializedMySQL is dead. Master the 2026 industry standard CDC pipeline using Debezium and Redpanda on Bare Metal.

MySQL is an outstanding transactional database, but it severely struggles with heavy analytical queries. Moving these workloads to ClickHouse is the definitive solution. However, if you read older migration guides from popular database vendors, they will almost universally instruct you to use the MaterializedMySQL engine.

Do not execute those commands. The ClickHouse team officially deprecated and removed the MaterializedMySQL engine in version 24.12. It was highly experimental and fundamentally flawed at scale. The true enterprise standard for achieving zero-downtime replication is Change Data Capture, commonly referred to as CDC.

Migration Blueprint

Phase 1: The MaterializedMySQL Trap
Phase 2: Network Latency and SaaS Economics
Phase 3: Advanced Schema Mapping and Snapshot
Phase 4: The 2026 CDC Streaming Pipeline
Phase 5: The Missing Ingestion Layer
Phase 6: Tombstones, The FINAL Trap, and Storage Tax
Phase 7: Fault Tolerance and Cutover

Phase 1: The MaterializedMySQL Trap

As mentioned, relying on the built-in MaterializedMySQL engine is a trap. It failed to handle complex schema migrations and crashed under heavy replication loads. Modern Data Engineering requires a decoupled, resilient pipeline that reads the MySQL Binary Logs (Binlogs) asynchronously. This is where CDC steps in.

Phase 2: Network Latency and SaaS Economics

Many modern tutorials suggest using fully managed SaaS platforms like Confluent Cloud or ClickPipes to handle your CDC streaming. While these tools are convenient, they introduce a massive financial trap.

When you sync terabytes of operational data across different cloud regions, public providers will charge you astronomical network egress fees. Furthermore, change data capture is highly sensitive to network latency.

If your primary MySQL database is located in North America, hosting your open-source Redpanda and ClickHouse architecture on dedicated bare metal servers ensures sub-millisecond communication. This localized bare metal approach eliminates replication lag during peak transactional hours while completely avoiding per-gigabyte cloud billing shocks.

Phase 3: Advanced Schema Mapping and Snapshot

Before activating the live stream, we must copy the historical data. The biggest mistake engineers make here is assuming basic data types map perfectly. In production environments, you must handle null values, financial decimals, and timezones meticulously.

You must manually create the destination table first, mapping MySQL data types to ClickHouse's advanced types. Once created, use the native mysql() table function to pull the data at maximum speed.

-- Creating a production-ready ClickHouse schema
CREATE TABLE orders_analytics (
    order_id UInt64,
    customer_name Nullable(String),          -- Handling MySQL NULLs
    amount Decimal(10, 2),                   -- Financial precision
    status Enum8('PENDING' = 1, 'PAID' = 2), -- Strict enumerations
    created_at DateTime('UTC')               -- Timezone awareness
) ENGINE = MergeTree()
ORDER BY order_id;

-- Execute the high-speed initial data copy
INSERT INTO orders_analytics
SELECT * FROM mysql('10.0.0.5:3306', 'prod_db', 'orders', 'user', 'pass');

Phase 4: The 2026 CDC Streaming Pipeline

To capture live transactions, we use Debezium to read the MySQL binary logs. Debezium will push these changes to an event streaming message broker.

The Kafka vs. Redpanda Reality: Apache Kafka is the battle-tested enterprise standard with a massive ecosystem. You can absolutely use it. However, running JVMs can be resource-heavy. For bare metal NVMe servers, we often recommend Redpanda as a drop-in C++ alternative for simpler operations, zero ZooKeeper dependencies, and lower latency. Both work perfectly for this pipeline.

// Example Debezium Connector Configuration pushing to your broker
{
  "name": "mysql-clickhouse-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "10.0.0.5",
    "database.include.list": "prod_db",
    "table.include.list": "prod_db.orders",
    "database.history.kafka.bootstrap.servers": "broker_host:9092",
    "database.history.kafka.topic": "schema-changes.orders"
  }
}

Phase 5: The Missing Ingestion Layer

Many tutorials skip a critical step: How does data actually flow from the Kafka topic into the ClickHouse storage table? You need an ingestion layer. ClickHouse provides a native Kafka Engine that reads the message stream, and a Materialized View that routes those messages into your final analytical table.

-- 1. Create the Kafka Engine Consumer
CREATE TABLE orders_kafka_queue (
    order_id UInt64,
    amount Decimal(10, 2),
    status String,
    op_type String -- Debezium operation type (create, update, delete)
) ENGINE = Kafka()
SETTINGS kafka_broker_list = 'broker_host:9092',
         kafka_topic_list = 'prod_db.orders',
         kafka_group_name = 'clickhouse_consumer',
         kafka_format = 'JSONEachRow';

-- 2. Route data to the final table
CREATE MATERIALIZED VIEW orders_mv TO orders_analytics_final AS
SELECT order_id, 
       amount, 
       status, 
       if(op_type = 'd', 1, 0) AS is_deleted, 
       now() AS updated_at
FROM orders_kafka_queue;

Phase 6: Tombstones, The FINAL Trap, and Storage Tax

ClickHouse is an append-only database. When Debezium detects a deleted row in MySQL, it sends a tombstone record. To process this, we use the ReplacingMergeTree engine with a deleted flag. However, this introduces two massive production challenges.

The Storage Tax: The ReplacingMergeTree does not delete old rows immediately. It waits for a random background merge, causing storage amplification. To manage this, schedule an OPTIMIZE TABLE orders_analytics_final FINAL command during off-peak night hours to force a cleanup.
The FINAL Trap: Many blogs tell you to use the FINAL keyword in your SELECT queries to get the latest row. Do not do this. It causes massive CPU spikes. Instead, use the argMax function to efficiently fetch the latest state without locking the database.

-- The Enterprise way to query updated records without the FINAL keyword
SELECT 
    order_id, 
    argMax(amount, updated_at) AS latest_amount, 
    argMax(status, updated_at) AS latest_status
FROM orders_analytics_final
GROUP BY order_id
HAVING argMax(is_deleted, updated_at) = 0;

Phase 7: Fault Tolerance and Cutover

Before routing live traffic, ensure your pipeline is fault-tolerant. Configure a Dead Letter Queue (DLQ) inside your Kafka or Redpanda broker to catch schema mismatch errors. Ensure your ClickHouse ReplicatedReplacingMergeTree tables have a replication factor of at least two across different bare metal nodes.

Once verified, update your application code to route all heavy aggregations, dashboard requests, and report generation queries to ClickHouse. Your MySQL database is now relieved of analytical strain, allowing it to focus purely on rapid transactional writes.

MySQL Migration FAQ

Why is the MaterializedMySQL engine throwing syntax errors?
The MaterializedMySQL engine was highly experimental, and the ClickHouse development team officially deprecated and removed it in version 24.12. You must now use a Change Data Capture pipeline like Debezium for replication.

How does ClickHouse handle MySQL DELETE operations?
ClickHouse is a columnar analytical database that does not delete rows instantly. When Debezium captures a delete operation, it sends a tombstone record. You must route this to a ReplacingMergeTree table and filter out the deleted flag in your queries.

Should I use the FINAL keyword to query updated rows in ClickHouse?
No. Using the FINAL keyword on large tables causes massive CPU overhead because it forces ClickHouse to resolve all intermediate row states in real-time. It is much faster to use aggregate functions like argMax or filter by a deleted column flag.

Why is Redpanda recommended over Apache Kafka for bare metal?
Redpanda is a modern C++ drop-in replacement for Apache Kafka. It completely eliminates the heavy Java Virtual Machine dependencies and ZooKeeper requirements, making it significantly faster and easier to deploy on bare metal NVMe servers.