DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Architecture Teardown: A 1PB Analytics Stack with ClickHouse 24.10 and Kafka 3.7

Architecture Teardown: A 1PB Analytics Stack with ClickHouse 24.10 and Kafka 3.7

Modern analytics workloads demand petabyte-scale storage, low-latency querying, and high-throughput ingestion. This teardown breaks down a production-grade 1PB analytics stack leveraging ClickHouse 24.10 (the latest LTS release as of Q4 2024) and Kafka 3.7, detailing design choices, tradeoffs, and operational best practices.

Core Stack Overview

The stack is built for a global ad-tech platform processing 12TB of raw event data daily, with 3-year retention for compliance and trend analysis. Key components:

  • Kafka 3.7: Primary ingestion bus, handling 4.2M events/sec peak throughput
  • ClickHouse 24.10: OLAP engine for real-time and historical queries
  • ZooKeeper 3.9: Coordination for Kafka and ClickHouse replication
  • MinIO: S3-compatible object storage for ClickHouse tiered storage
  • Prometheus + Grafana: Full-stack observability

Ingestion Pipeline: Kafka 3.7 at Scale

Kafka 3.7 introduces native support for KRaft (Kafka Raft) mode, eliminating ZooKeeper dependency for Kafka-specific coordination (we retain ZooKeeper only for ClickHouse here). The cluster runs 48 brokers across 3 AWS us-east-1 AZs, with:

  • 12x r6g.4xlarge brokers per AZ (16 vCPU, 128GB RAM, 4x 2TB NVMe SSD)
  • Topic configuration: 256 partitions per topic, 3x replication factor, 7-day retention for unprocessed data
  • Producers use idempotent writes and 1MB batch sizes to maximize throughput
  • Schema Registry (Confluent 7.5) enforces Avro schemas for all event types, reducing malformed data by 99.8%

A custom Kafka Connect cluster runs 12 Debezium connectors for CDC from transactional databases, and 8 S3 sink connectors for raw data backups. ClickHouse consumes directly from Kafka via its native Kafka table engine, with 64 parallel consumers per ClickHouse shard.

Storage Layer: ClickHouse 24.10 Configuration

ClickHouse 24.10 adds native tiered storage improvements, atomic inserts for replicated tables, and faster Parquet ingestion – all critical for 1PB scale. The cluster consists of 96 shards, 2 replicas per shard, deployed on i4i.4xlarge instances (16 vCPU, 128GB RAM, 2x 3.8TB NVMe SSD) across 3 AZs.

Storage strategy:

  • Hot tier: NVMe SSD, 30-day retention for frequently queried data (last 30 days of events). Uses MergeTree engine with zstd compression (3:1 ratio for event data).
  • Cold tier: MinIO object storage, 3-year retention for historical data. ClickHouse 24.10’s tiered storage automatically migrates data older than 30 days to MinIO, with transparent query access across tiers.
  • Replication: Asynchronous replication via ClickHouse’s internal replication mechanism, with quorum writes enabled for critical datasets.

Key 24.10 features leveraged:

  • Dynamic merging: Adjusts merge threads based on CPU load, reducing merge-induced latency spikes by 40%
  • Projected columns: Pre-compute aggregated columns for common queries, cutting query time by 60% for top 10 most frequent workloads
  • Improved Kafka engine: Native exactly-once semantics (EOS) for Kafka consumers, eliminating duplicate event processing

Query Optimization for Petabyte Scale

With 1PB of data, naive queries can scan terabytes of data. Optimization strategies:

  • Partitioning: All tables partitioned by event_date (daily partitions), with sub-partitioning by region for high-cardinality datasets
  • Primary keys: Compound primary keys ordered by (event_date, region, user_id) to align with 90% of query patterns
  • Materialized views: 42 materialized views pre-aggregate hourly, daily, and weekly metrics for dashboards, reducing scan volume by 98%
  • Query caching: ClickHouse’s built-in query cache enabled for read-only workloads, with 15-minute TTL matching dashboard refresh rates

Performance metrics:

  • P95 query latency: 120ms for hot tier data, 450ms for cold tier data
  • Max scan rate: 120GB/sec across the cluster for full table scans

Scaling to 1PB: Lessons Learned

  • Kafka scaling: Avoid over-partitioning – we capped partitions per topic at 256, as more partitions increased broker memory overhead without throughput gains. KRaft mode reduced failover time from 45 seconds (ZooKeeper mode) to 8 seconds.
  • ClickHouse scaling: Add shards rather than vertical scaling – horizontal scaling with 96 shards provided better cost-performance than scaling to 192 vCPU instances. Tiered storage cut hot storage costs by 72% compared to all-SSD storage.
  • Data lifecycle: Automated TTL policies delete data older than 3 years, with pre-delete validation to avoid accidental data loss. Monthly storage growth is ~85TB, requiring quarterly shard additions.

Monitoring & Maintenance

Full stack observability is critical for 1PB scale:

  • Kafka metrics: Track under-replicated partitions, broker request latency, and topic retention via Prometheus JMX exporter
  • ClickHouse metrics: Monitor merge queue depth, disk I/O, query latency, and replication lag via ClickHouse’s built-in Prometheus endpoint
  • Alerting: PagerDuty alerts for replication lag > 5 minutes, broker failure, or query latency P95 > 1 second
  • Maintenance: Rolling updates for ClickHouse and Kafka, with blue-green deployments for schema changes

Conclusion

This 1PB stack demonstrates that ClickHouse 24.10 and Kafka 3.7 can deliver low-latency analytics at petabyte scale with careful tuning. Key takeaways: leverage tiered storage for cost efficiency, use KRaft mode for Kafka resilience, and align partitioning/primary keys with query patterns. The stack supports 14,000 daily queries from 200+ analysts, with 99.99% uptime over the past 12 months.

Top comments (0)