nishaant dixit

Posted on May 6 • Originally published at sivaro.in

The Kafka ClickHouse Streaming Architecture Guide: Real-Time Data Pipelines That Actually Work

I built my first Kafka-to-ClickHouse pipeline in 2019. It broke three times before breakfast. The problem wasn't the tools. It was my understanding of how they actually fit together.

Most engineers think streaming data from Kafka into ClickHouse is simple. Push a topic, write a query, done. They're wrong. The real challenge isn't moving data. It's moving data reliably without losing events, crashing your cluster, or spending your entire engineering budget on infrastructure nobody understands.

So what is Kafka ClickHouse streaming architecture? It's the practice of ingesting real-time event streams from Apache Kafka into ClickHouse for ultra-fast analytical queries. Kafka handles the message queue. ClickHouse handles columnar storage and sub-second aggregation. Together, they power observability platforms, real-time dashboards, fraud detection systems, and production AI pipelines.

According to ClickHouse's official documentation, the Kafka engine table in ClickHouse directly consumes messages from Kafka topics without requiring middleware. That sounds elegant. In practice, you need to think carefully about exactly-once semantics, schema evolution, and backpressure.

I've spent six years building systems that process over 200,000 events per second through this exact stack. Here's what I learned the hard way.

The Kafka ClickHouse streaming architecture has three layers. Each one can fail independently.

Layer 1: Kafka as the event buffer. Kafka holds your raw events in topics. Retention policies determine how long data survives. For streaming architectures, you typically set retention to a few days. This gives you replay capability without infinite storage costs.

Layer 2: The ingestion path. This is where most people screw up. You have three options: ClickHouse Kafka table engine, Kafka Connect with the ClickHouse sink connector, or a custom consumer writing via the ClickHouse HTTP interface. According to Tinybird's production Kafka connector guide, their connector handles schema mapping, retries, and backpressure automatically. That's valuable because manual handling is error-prone.

Layer 3: ClickHouse as the analytical store. Data lands in MergeTree tables. These are optimized for columnar compression and fast aggregation queries. The key insight: ClickHouse isn't a transactional database. It's an analytical one. You batch inserts for performance.

Here's a concrete example using the Kafka table engine:

CREATE TABLE kafka_queue (
    event_id String,
    user_id UInt64,
    event_type String,
    timestamp DateTime,
    properties String
) ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'broker1:9092,broker2:9092',
    kafka_topic_list = 'user_events',
    kafka_group_name = 'clickhouse_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 4;

This creates a virtual table that streams Kafka messages directly into ClickHouse memory. But here's the catch: this table has no persistent storage. You must use a materialized view to move data into a real MergeTree table.

The hard truth about this architecture? Latency isn't your biggest problem. Data integrity is. I've seen pipelines silently drop 2% of events because of schema mismatches. That's catastrophic for billing systems.

Why go through the complexity of Kafka ClickHouse streaming? Three reasons dominate production deployments.

1. Real-time analytical queries at scale. ClickHouse queries against streaming data return in milliseconds, not seconds. According to Altinity's streaming guide, properly configured pipelines can handle 100,000 events per second with query latency under 50 milliseconds. Your dashboards actually update in real-time, not "eventually real-time."

2. Cost-effective storage. ClickHouse compresses data 5-10x compared to row-oriented databases. For high-volume event streams, this means storing 30 days of granular data for the same cost as 3 days in PostgreSQL. Every engineer I've worked with underestimates this until they see the first cloud bill.

3. Operational simplicity. Unlike streaming databases that require dedicated clusters, ClickHouse integrates with existing Kafka infrastructure. You don't need to manage another stateful system. The Big Data Aboutique KFC architecture blueprint (Kafka, Flink, ClickHouse) demonstrates how adding Flink between Kafka and ClickHouse enables complex stream processing without sacrificing simplicity.

In my experience, the teams that benefit most are those running observability platforms or real-time analytics products. One client reduced their time-to-insight from 45 seconds to under 2 seconds by switching from batch ETL to streaming ingestion. Their operations team stopped getting paged for stale dashboards.

But there's a trade-off. Streaming ingestion through Kafka adds operational complexity. You now need to manage Kafka cluster health, consumer lag monitoring, and ClickHouse insert batching. For small datasets (under 10 GB/day), batch loading via S3 might be simpler. For anything larger, streaming pays off.

Let me walk through the actual implementation. Skip the theory. Here's what you need to make this work in production.

Step 1: Design your schemas. ClickHouse is schema-on-write. Your upstream Kafka messages must match the target table schema. Use Nullable types sparingly — they prevent ClickHouse from using certain optimizations. According to Instaclustr's streaming guide, schema drift is the leading cause of pipeline failures. Implement schema registry at the Kafka level.

Step 2: Create your target MergeTree table. This is where your data lives permanently:

CREATE TABLE user_events (
    event_id String,
    user_id UInt64,
    event_type String,
    timestamp DateTime,
    properties String,
    ingested_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, event_id)
TTL timestamp + INTERVAL 90 DAY DELETE;

Notice the PARTITION BY and ORDER BY. ClickHouse stores data in sorted order. Choose your ORDER BY columns based on your most frequent query filters. TTL automatically removes old data — critical for compliance.

Step 3: Create the materialized view. This links the Kafka queue to persistent storage:

CREATE MATERIALIZED VIEW user_events_mv TO user_events AS
SELECT
    event_id,
    user_id,
    event_type,
    timestamp,
    properties
FROM kafka_queue;

Every message ingested from Kafka automatically flows into user_events. No manual inserts needed.

Step 4: Handle failures gracefully. Kafka consumer offsets are committed only after ClickHouse confirms the insert. If ClickHouse goes down during ingestion, Kafka will rebalance consumers and replay uncommitted messages. This gives you at-least-once semantics by default. For exactly-once, you need idempotent inserts — either deduplication at the application layer or using ClickHouse's ReplacingMergeTree:

CREATE TABLE user_events_dedup (
    event_id String,
    user_id UInt64,
    event_type String,
    timestamp DateTime,
    _version UInt64
) ENGINE = ReplacingMergeTree(_version)
ORDER BY (event_id);

The biggest mistake I see: Engineers set kafka_num_consumers too high, causing excessive memory pressure. ClickHouse forks a thread per consumer. With 50 consumers and high throughput, you'll OOM your ClickHouse node. Start with 2-4 consumers and scale up as you monitor memory.

Over years of breaking things, I've developed rules that prevent the most common failures.

Monitor consumer lag religiously. Kafka exposes consumer group offsets. ClickHouse exposes per-table insert rates. Correlate them. If lag grows consistently, your ClickHouse cluster can't keep up. The ClickHouse real-time event streaming guide recommends alerting when lag exceeds 10,000 messages.

Batch inserts on the ClickHouse side. Never insert individual events. ClickHouse hates single-row inserts. Configure your pipeline to batch at least 10,000 rows per insert or flush every 1 second, whichever comes first. The Tinybird stream-to-ClickHouse tutorial shows how their connector automatically manages batch sizing and backpressure.

Use compression. Enable both Kafka message compression and ClickHouse table compression. According to the beginner's guide on Dev.to, enabling LZ4 compression in both systems reduces storage costs by 40-60% with minimal CPU overhead.

Partition your Kafka topics by time. Using toYYYYMM(timestamp) as the ClickHouse partition key aligns perfectly with Kafka retention policies. This allows ClickHouse to drop entire partitions during data cleanup instead of performing row-by-row deletes.

Test with production traffic patterns. I've seen pipelines handle 50,000 events/sec in staging but collapse at 10,000/sec in production. The difference? Data shape variance. Real-world events have slightly different schemas, null fields, and outlier timestamps. Generate test data that mirrors actual patterns.

The biggest debate in Kafka ClickHouse streaming: which ingestion method to use?

Option A: ClickHouse Kafka Table Engine (KTE). Built directly into ClickHouse. No external dependencies. You write SQL to define the consumer. The advantage is simplicity — everything lives in one database. The disadvantage is limited configuration. You can't easily restart individual consumers or implement custom error handling. Use this for simple pipelines where speed matters more than control.

Option B: Kafka Connect with ClickHouse Sink. An external connector runs as a Kafka Connect worker. This gives you schema registry integration, dead letter queues for failed messages, and customizable transforms. The Medium guide by Awais Sattar demonstrates how Kafka Connect enables complex transformations like filtering and enrichment before data reaches ClickHouse. Use this for production systems handling sensitive data or requiring complex validation.

Option C: Custom Consumer. You write a microservice that consumes from Kafka and inserts into ClickHouse via the HTTP interface or native protocol. Maximum control. Maximum complexity. You handle offsets, batching, retries, and schema evolution yourself. The Reddit discussion on event-driven ML architectures highlights how custom consumers enable advanced features like exactly-once semantics for ML feature stores. Use this only when you need capabilities neither KTE nor Kafka Connect provide.

The decision matrix is straightforward:

Speed of setup → KTE (hours, not days)
Reliability and observability → Kafka Connect
Maximum control → Custom consumer

In my experience, 80% of use cases are served well by Kafka Connect. The remaining 20% — typically involving custom deserialization or complex deduplication — justify a custom consumer.

Every Kafka ClickHouse pipeline will face these issues. How you handle them determines whether your team trusts the data.

Challenge 1: Schema evolution. Kafka messages change format. New fields appear. Old fields get renamed. ClickHouse rejects rows that don't match. Solution: use Nullable columns for new fields, and set input_format_allow_errors_ratio = 0.1 in ClickHouse settings. This allows up to 10% of rows to have minor errors without failing the entire batch. Log rejected rows to a dead letter queue for analysis.

Challenge 2: Data duplication. Network issues cause Kafka to replay messages. ClickHouse inserts duplicate rows. Solution: use ReplacingMergeTree with dedup columns, or insert event_id as a unique constraint. According to GlassFlow's Kafka-ClickHouse integration, their platform handles deduplication at the stream processing layer before data reaches ClickHouse.

Challenge 3: Backpressure. Kafka produces events faster than ClickHouse can consume them. Consumer lag grows. ClickHouse memory usage spikes. Solution: monitor kafka_lag in ClickHouse's system tables. Implement a circuit breaker that pauses Kafka consumption when ClickHouse insert queue exceeds a threshold. Set kafka_max_block_size to limit batch sizes.

Challenge 4: Hot partitions. Data with the same timestamp or user ID routes to the same ClickHouse partition. Query performance degrades. Solution: add a shard key like cityHash64(user_id) to your ORDER BY clause. This distributes writes across partitions evenly.

Q: What's the difference between ClickHouse Kafka engine and Kafka Connect?

A: The Kafka engine runs inside ClickHouse as a SQL-level consumer. Kafka Connect runs as a separate Java process. Kafka Connect offers better error handling, schema registry integration, and is preferred for production systems.

Q: Can ClickHouse consume from multiple Kafka topics simultaneously?

A: Yes. The kafka_topic_list setting accepts comma-separated topics. Each topic's messages are concatenated into the same table. For different schemas, create separate Kafka engine tables pointing to different target MergeTree tables.

Q: How do I handle exactly-once semantics?

A: Use ReplacingMergeTree with a unique event ID as the ordering key, or implement deduplication at the application layer. ClickHouse's Kafka engine provides at-least-once by default. Exactly-once requires idempotent inserts.

Q: What's the maximum throughput for Kafka to ClickHouse streaming?

A: With proper configuration, ClickHouse can ingest 100,000-500,000 events per second on a single node. Higher throughput requires sharding across multiple ClickHouse nodes. According to ClickHouse's benchmarks, compression and batch sizing are the biggest throughput limiters.

Q: Can I run transformations on data before it reaches ClickHouse?

A: Yes. Use Kafka Streams, Flink, or Kafka Connect's Single Message Transforms (SMTs) to filter, enrich, or convert data before insertion. According to the KFC architecture blueprint, running Flink between Kafka and ClickHouse enables complex aggregations and stateful processing.

Q: How do I monitor consumer lag?

A: Query ClickHouse's system.kafka_consumers table for per-partition lag. Alternatively, use Kafka's kafka-consumer-groups CLI tool. Set up alerts when lag exceeds 10,000 messages or 10 seconds of data.

Q: Should I compress data before sending to Kafka?

A: Yes. Enable compression.type=snappy or lz4 on your Kafka producer. ClickHouse natively handles compressed messages. Compression reduces network bandwidth by 30-60% with negligible CPU overhead.

Kafka ClickHouse streaming architecture solves a specific problem: getting high-volume event data into an analytical database fast enough for real-time queries. The three paths — Kafka table engine, Kafka Connect, or custom consumer — each have trade-offs. Start simple. Scale as you understand your traffic patterns.

Here's what I'd tell my younger self: invest in schema management and monitoring before you deploy. Bad schema design kills pipelines. Good observability saves weekends. Start with Kafka Connect unless you have a compelling reason not to.

For your next step, deploy a test pipeline with the Kafka table engine. Send 10,000 events. Query them. See how fast ClickHouse responds. Then gradually increase volume until you find the breaking point. That's where real learning begins.

Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec. Connect on LinkedIn: linkedin.com/in/nishaant-veer-dixit

Originally published at https://sivaro.in/articles/the-kafka-clickhouse-streaming-architecture-guide-real.

DEV Community

The Kafka ClickHouse Streaming Architecture Guide: Real-Time Data Pipelines That Actually Work

Top comments (0)