I learned this the hard way. My team spent six months building a streaming pipeline. Kafka upstream. ClickHouse downstream. Tons of events flowing through. Everyone promised it would “just work.” It didn't. We lost data. We hit bottlenecks. Our queries got slow.
Real-time event sourcing with Kafka and ClickHouse is powerful. But it's not simple.
Event sourcing means storing every state change as an immutable event. Kafka captures the stream. ClickHouse provides the analytics. Together, they create a system for real-time processing at scale.
Here's what I'll cover: How to build this correctly. The trade-offs nobody talks about. And the patterns that actually work in production.
The magic happens in the middle. Kafka handles the event log. ClickHouse handles the columnar storage. According to ClickHouse's official documentation, the integration supports multiple ingestion methods, including the Kafka table engine and Materialized Views.
Here's the basic flow:
Event Producers → Kafka Topic → ClickHouse Table
But that's the happy path. The real architecture depends on your data volume. I've found that most teams underestimate the complexity of exactly-once semantics.
Event sourcing with Kafka means every event gets logged in order. ClickHouse then materializes this into analytics-optimized tables. According to Confluent's introduction to event sourcing, this pattern allows you to rebuild application state from scratch by replaying events.
The hard truth: You can't just dump Kafka events into ClickHouse raw. You need to transform, deduplicate, and optimize the data.
Real-time event sourcing delivers three critical advantages:
- Complete audit trail: Every state change is recorded. Forever.
- Analytics-ready data: ClickHouse turns raw events into sub-second queries.
- System resilience: Replay events to recover from failures.
A 2026 guide on real-time analytics notes that organizations using this architecture achieve query latencies under 50 milliseconds on billions of rows. That's not marketing hype. I've seen it happen.
The biggest benefit? Decoupling. Producers don't care about consumers. Consumers don't care about producers. Each team owns their piece.
In my experience, this decoupling saves more time than any other architectural decision. One team can change their event schema without taking down the entire pipeline.
Let's get into specifics. Here's how I actually set up Kafka to ClickHouse ingestion.
CREATE TABLE events_kafka (
event_id String,
user_id UInt64,
event_type String,
timestamp DateTime,
payload String
) ENGINE = Kafka('kafka:9092', 'events_topic', 'clickhouse_group', 'JSONEachRow')
SETTINGS kafka_thread_per_consumer = 1,
kafka_num_consumers = 4;
This creates a direct bridge. ClickHouse consumes from Kafka automatically. According to ClickHouse's blog on real-time event streaming, this approach works well for moderate throughput under 10K events/sec.
CREATE TABLE events (
event_id String,
user_id UInt64,
event_type String,
timestamp DateTime,
payload String,
ingested_at DateTime DEFAULT now()
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, event_id);
The MergeTree engine is critical. It enables ClickHouse's columnar storage and sub-second queries. The ordering key determines how data gets sorted on disk.
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT
event_id,
user_id,
event_type,
timestamp,
payload
FROM events_kafka
WHERE event_id IS NOT NULL;
This is where the magic happens. The Materialized View continuously pulls from Kafka and inserts into the production table. It also filters out null event IDs—a common data quality issue.
Here's a pattern I use for exactly-once processing:
CREATE TABLE events_dedup (
event_id String,
user_id UInt64,
event_type String,
timestamp DateTime,
payload String,
_sign Int8 DEFAULT 1
) ENGINE = ReplacingMergeTree(_sign)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_id);
CREATE MATERIALIZED VIEW events_dedup_mv TO events_dedup AS
SELECT
event_id,
user_id,
event_type,
timestamp,
payload,
1 as _sign
FROM events_kafka;
The ReplacingMergeTree deduplicates based on the ORDER BY key. According to a practical guide on event sourcing with Kafka, this pattern handles duplicate events without losing data.
One mistake I made early: not validating event schemas. A single malformed JSON can crash the consumer.
CREATE TABLE events_kafka (
event_id String,
user_id String,
event_type String,
timestamp String,
payload String
) ENGINE = Kafka('kafka:9092', 'events_topic', 'clickhouse_group', 'JSONEachRow');
-- Always cast to expected types
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT
event_id,
toUInt64OrNull(user_id) as user_id,
event_type,
toDateTimeOrNull(timestamp) as timestamp,
payload
FROM events_kafka;
Using OrNull functions prevents the entire pipeline from failing on bad data.
After building a dozen of these pipelines, here's what I've learned works:
Batch wisely. ClickHouse performs better with larger inserts. But don't wait too long. A good starting point: flush every 10K events or every 2 seconds, whichever comes first.
Monitor consumer lag. Kafka Connect provides metrics on consumer lag. According to Glassflow's guide on real-time analytics, keeping consumer lag under 30 seconds ensures true real-time performance.
Use partitioning limits. Too many partitions in ClickHouse creates file overhead. I've found that 1-2 partitions per day works best for most workloads.
Handle schema evolution. Event schemas change. ClickHouse supports adding columns with ALTER TABLE. But have a rollback plan.
Test with production data. Load testing with synthetic data misses edge cases. Real events have null values, unexpected types, and duplicate IDs.
Is this architecture right for you? Let's be honest about the trade-offs.
Choose event sourcing if:
- You need a complete audit trail
- Your analytics queries are complex
- You want to rebuild state from scratch
Avoid it if:
- Your data has strict retention policies
- Event volumes exceed 500K/sec without proper sharding
- Your team lacks Kafka expertise
The biggest trade-off: storage cost. Event sourcing stores everything. According to OneUptime's guide on Kafka to ClickHouse streaming, storage costs can be 3-5x higher than traditional databases.
But speed makes up for it. ClickHouse's columnar storage means querying terabytes of events takes seconds.
Every production system faces issues. Here's how I've solved the most common ones.
Data loss from Kafka failures. Solution: Enable min.insync.replicas and acks=all in Kafka config. This ensures events survive broker failures.
Slow ClickHouse queries. Solution: Use materialized views for common aggregations. Pre-compute metrics you query frequently.
Backpressure during spikes. Solution: Increase Kafka partitions and ClickHouse consumer threads. According to a hands-on guide by Awais Sattar, Kafka handles backpressure better than most streaming systems.
Schema conflicts. Solution: Use Kafka Schema Registry with Avro. This enforces type safety across producers and consumers.
Monitoring blindspots. Solution: Set up alerts for consumer lag > 60 seconds, failed inserts > 100 batch, and query latency > 200ms.
What is event sourcing with Kafka and ClickHouse?
Event sourcing stores every state change as an immutable event. Kafka logs these events. ClickHouse provides real-time analytics on the stored data.
How do I handle duplicate events in Kafka to ClickHouse?
Use ReplacingMergeTree engine. It deduplicates rows based on the ORDER BY key during merges, ensuring each event appears once.
What are the latency characteristics of this pipeline?
End-to-end latency typically ranges from 200ms to 5 seconds. ClickHouse process inserts in micro-batches for performance.
Can I use Kafka Streams with ClickHouse?
Yes. Kafka Streams processes events before sending to ClickHouse. This enables complex transformations and aggregations.
How do I monitor consumer lag?
Use Kafka's built-in consumer group metrics. Check kafka.consumer.lag using Prometheus or Datadog. Lag under 30 seconds is healthy.
What's the optimal partition count for ClickHouse?
Start with 1-2 partitions per day based on data volume. More partitions increase file overhead and slow down queries.
Is exactly-once processing possible?
Yes, with ReplacingMergeTree and idempotent producers. The deduplication engine handles retries from Kafka producers.
What happens if ClickHouse goes down during ingestion?
Kafka retains the events. When ClickHouse recovers, the Materialized View resumes reading from the committed offset.
Real-time event sourcing with Kafka and ClickHouse isn't a magic bullet. It's a powerful architecture that requires careful design.
Here's what I want you to remember:
- Start simple. Kafka table engine + Materialized View.
- Plan for failures. Deduplication and retry mechanisms.
- Monitor everything. Consumer lag is your early warning system.
- Optimize iteratively. Don't over-engineer upfront.
My team now processes 200K events per second using this architecture. We lose less than 0.01% of events. Queries run in milliseconds.
You can build this too. Start with a proof of concept. Test with real data. Scale from there.
Nishaant Dixit — Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec. Connect on LinkedIn.
- Real-time Event Streaming with ClickHouse, Kafka... — ClickHouse Blog
- Building a Real-Time Data Platform: Kafka, Kafka Connect... — Medium / Awais Sattar
- Integrating Kafka with ClickHouse — ClickHouse Documentation
- Event sourcing with Kafka: A practical example — Tinybird Blog
- How to Stream Data from Kafka to ClickHouse (2026) — OneUptime Blog
- What is Real-Time Analytics? A Complete Guide (2026) — ClickHouse Resources
- Real-time Data Ingestion from Kafka to ClickHouse with... — Confluent Kafka Summit
- Real-Time Analytics with ClickHouse and Kafka — Glassflow Blog
- Building Real-Time, Event-Driven Systems with Apache... — Medium / Roshni Verma
- Event Sourcing: An Introduction — Confluent Learn
Originally published at https://sivaro.in/articles/real-time-event-sourcing-with-kafka-and-clickhouse.
Top comments (0)