Gowtham Potureddi

Posted on Jun 22

Apache Druid vs Pinot vs ClickHouse: Real-Time OLAP Compared

#python #sql #interview #dataengineering

apache druid vs pinot vs clickhouse is the comparison every data platform team eventually argues through — usually at 11pm on a Tuesday after a dashboard latency SLO breached and a streaming pipeline began coughing. The three projects converged on the same workload (sub-second analytics over append-heavy, billion-row, time-stamped event streams) from three different starting points: Druid grew out of Metamarkets ad-tech in 2011, Pinot out of LinkedIn's user-facing dashboards in 2014, and ClickHouse out of Yandex's metrics platform in 2016. They look alike on a benchmark page and behave very differently in production.

This guide is the comparison you wished existed the first time a "just pick the fastest OLAP engine" Slack message landed in your DMs. It walks the real-time OLAP problem, the architectural philosophies of each engine, ingestion personalities, query-pattern fit, and the cheat sheet of recipes that ship in every senior data-engineering interview answer. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the real-time analytics practice library →, rehearse the streaming problems →, and stack the aggregation muscles with aggregation drills →.

On this page

The real-time OLAP problem
The real-time OLAP landscape — where each engine sits
Architecture trio — under the hood
Ingestion models — streaming vs batch in each engine
Query patterns and engine fit
Cheat sheet — Druid vs Pinot vs ClickHouse recipes
Frequently asked questions
Practice on PipeCode

1. The real-time OLAP problem

Sub-second analytics over billion-row streams is a different problem than warehouse OLAP — and that gap is what spawned three different open-source engines

The one-sentence invariant: real-time OLAP means returning aggregated answers in well under a second over data that is both very large and very fresh — usually time-stamped, append-heavy, and arriving from Kafka or a similar stream. Once you accept that constraint, neither a row-oriented OLTP database (Postgres / MySQL) nor a batch warehouse (Snowflake / BigQuery) can serve the workload — and the gap between them is what Druid, Pinot, and ClickHouse fill.

The three forces that define the workload.

Streaming ingest. Events arrive continuously, often at hundreds of thousands of rows per second per topic. The engine must accept those rows and immediately make them queryable, often within seconds. Batch warehouses load on a schedule (minutes to hours); OLTP databases choke on the write rate.
Time-bucketed reads. Almost every query carries a time predicate: "last 5 minutes," "yesterday by hour," "this week." The engine that wins is the one that partitions and indexes on time first, then on dimensions. Druid bakes this into the storage model; Pinot and ClickHouse expose it via partition keys.
Multi-dimensional filters. Dashboards do not ask "give me everything" — they ask "give me revenue by region by product by minute for the last hour." The engine must filter on multiple high-cardinality columns simultaneously, then aggregate. Bitmap indexes, inverted indexes, and star-tree pre-aggregation are how each engine solves this.

Why row-oriented OLTP cannot serve this.

Storage layout. Row stores write every column together. A SELECT SUM(amount) scans every row's every column from disk. A 10-column table that needs only one column wastes 90% of the I/O on irrelevant data.
Indexes are point-lookup-shaped. B-tree indexes are optimised for "find a single row by key," not "scan a billion rows by a filter."
Write amplification. OLTP databases use redo logs, checkpoints, and MVCC tombstones — features that protect transactional correctness at the cost of throughput. For pure-append event streams, those features are pure overhead.

Why batch warehouses cannot hit sub-second.

Compute is spun up per query. Snowflake suspends warehouses to save money; BigQuery allocates slots from a shared pool. Cold-start cost is in the seconds. Real-time OLAP needs warm, always-on workers.
Storage is decoupled but cold. Warehouse data lives in object storage with high read latency on the first byte. A "scan 100 MB" query takes seconds — fine for ad-hoc analytics, fatal for a dashboard refresh.
Ingest is batch-first. A streaming insert lands in a write buffer and waits for the next micro-batch. Freshness is measured in minutes, not seconds.

Why three different engines, not one.

Druid (2011, Metamarkets / now Apache). Born to power ad-tech dashboards. Optimised for time-series workloads with heavy pre-aggregation ("rollup"). Segments are immutable time-bucketed files written to deep storage (S3 / HDFS).
Pinot (2014, LinkedIn / now Apache). Born to power user-facing analytics ("People you may know" dashboards). Optimised for ultra-low-latency lookups via per-segment indexes (star-tree, bitmap, range). Helix-coordinated.
ClickHouse (2016, Yandex / now ClickHouse Inc). Born to power Yandex.Metrica (Russian web-analytics platform like Google Analytics). Optimised as a general-purpose columnar SQL engine with a flexible MergeTree storage family. Coordinator-less peer servers.

Who actually runs these in production.

Druid powers analytics at Netflix, Confluent, Lyft, Airbnb, Walmart, Twitch, and Cisco — wherever the workload is time-series with heavy filtering and pre-computed rollups.
Pinot powers user-facing analytics at LinkedIn, Uber, Stripe, Slack, and Razorpay — wherever the latency budget is in the tens of milliseconds.
ClickHouse powers analytics at Cloudflare, Uber, eBay, Bloomberg, and Sentry — wherever the team wants real-time OLAP without a separate ingestion service and with the freedom to write arbitrary SQL.

What interviewers listen for.

Do you say "real-time OLAP fills the gap between OLTP and batch warehouses" when asked why these engines exist? — senior signal.
Do you mention pre-aggregation as the unifying trick (rollup, star-tree, AggregatingMergeTree)? — required answer.
Do you recognise that Druid decouples storage, Pinot decouples coordination, and ClickHouse fuses both? — senior signal.
Do you say "I'd pick the engine by query shape, not by benchmark" when asked which is fastest? — senior signal.

Worked example — why a row store cannot serve a "by-minute" dashboard

Detailed explanation. A product team wants a live dashboard of revenue by minute over the last 24 hours, refreshed every 5 seconds, against a 2-billion-row orders table. The obvious answer is "just query Postgres" — and the obvious answer fails immediately because of how row stores read disk.

Question. Estimate the I/O cost of a row-store scan for the last 24 hours of orders versus a columnar engine reading only the amount and minute columns. Assume 100 bytes per row, 2 billion rows, and 5% in the time window.

Input.

Engine	Storage layout	Bytes per row read	Compression
Postgres (row store)	row-oriented heap	100 bytes (whole row)	none / TOAST
Columnar OLAP	column-oriented	~16 bytes (2 cols × 8)	5x typical

Code.

-- The same query, different engines, very different I/O
SELECT
    date_trunc('minute', order_ts) AS minute,
    SUM(amount)                    AS revenue
FROM orders
WHERE order_ts >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;

Step-by-step explanation.

Rows in the 24h window: 2_000_000_000 * 0.05 = 100_000_000 rows.
Postgres reads the entire row (100 bytes each) because the heap is row-oriented. I/O = 100_000_000 * 100 = 10_000_000_000 bytes = ~10 GB.
A columnar OLAP engine reads only order_ts and amount — about 16 bytes/row uncompressed. I/O = 100_000_000 * 16 = 1.6 GB.
After columnar compression (typical 5x for numeric / timestamp data), the actual disk read drops to ~320 MB.
The columnar engine is roughly 30x faster on I/O alone, before considering vectorised execution and pre-aggregation.

Output.

Engine	Disk read	Approx latency on warm cache
Postgres row scan	10 GB	tens of seconds
Columnar OLAP scan	320 MB	hundreds of ms
Columnar + pre-agg (Druid rollup)	< 10 MB	tens of ms

Rule of thumb. Whenever the workload is "scan a wide column subset over a time window," start columnar. Whenever the queries themselves are repetitive enough that you can pre-aggregate at ingest, you can buy another 10–100x — that is the bet Druid and Pinot make on every real-time pipeline.

Worked example — why a batch warehouse cannot hit sub-second freshness

Detailed explanation. A growth team wants impressions-per-second analytics with under 5-second end-to-end latency from event to dashboard. The "just use BigQuery + Looker" answer breaks on two fronts: the streaming insert API has minutes of buffering, and the dashboard slot warm-up is seconds.

Question. Trace the path of a single event from Kafka to a BigQuery dashboard tile. Compare with a Druid path. Where does each engine spend its time budget?

Input.

Stage	BigQuery path (typical)	Druid path (typical)
Kafka → engine	dataflow / firehose buffer 1–5 min	Kafka indexing service ~1–5 s
Engine → query-visible	streaming buffer 1–2 min	real-time segment, immediate
Query slot cold-start	1–3 s	warm broker, ~0 s
Scan	0.5–2 s (slot-bound)	50–200 ms (vectorised)

Code.

# Pseudocode — the time budget for "live analytics"
event_to_dashboard_ms = sum([
    kafka_to_engine_ms,       # ingestion
    engine_to_visible_ms,     # query-visible delay
    query_slot_warmup_ms,     # cold start
    scan_ms,                  # actual query
])

Step-by-step explanation.

The "scan" stage looks similar on both engines (BigQuery's vectorised scan is fast; Druid's bitmap-filtered scan is fast). The fight is elsewhere.
BigQuery's streaming insert buffers for ~60 seconds; rows are only available to query after the buffer flushes. That alone breaks any "under 5 seconds" SLA.
BigQuery dashboard tiles pay slot-warmup cost on the first query of a new tile — measured in seconds for a cold slot pool.
Druid's Kafka indexing service writes to a real-time segment that is immediately queryable. The end-to-end latency is dominated by the Kafka commit delay (1–3 seconds), not engine plumbing.
The pre-aggregation in Druid's rollup step also shrinks the scan further — many "by-minute" queries hit only a tiny pre-aggregated row count instead of raw events.

Output.

Path	Realistic end-to-end latency	Sub-second feasible?
Kafka → BigQuery → Looker	60–120 s	no
Kafka → Druid → Superset	1–5 s	yes
Kafka → Pinot → Druid-style dashboard	1–3 s	yes
Kafka → ClickHouse → Grafana	2–8 s	yes (with care)

Rule of thumb. If your latency SLO is measured in seconds, batch warehouses simply do not fit. Move to a real-time OLAP engine — and choose between Druid, Pinot, and ClickHouse on the next axis (query shape, ops appetite, indexing needs).

Worked example — the three forces in a single query

Detailed explanation. "Top 10 advertisers by spend in the last 5 minutes, grouped by country" is the canonical real-time OLAP query. It packs streaming ingest, time-bucketed reads, and multi-dimensional filters into a single SELECT — and the engine choice determines whether it returns in 30 ms or 30 s.

Question. Walk through how Druid, Pinot, and ClickHouse each solve this query. Show what part of the pipeline carries the load.

Input — common assumption. A ad_events stream from Kafka with columns (event_ts, advertiser_id, country, spend), arriving at 200k events/sec.

Code.

-- The canonical real-time OLAP query
SELECT
    advertiser_id,
    country,
    SUM(spend) AS spend_5m
FROM ad_events
WHERE event_ts >= NOW() - INTERVAL '5 minutes'
GROUP BY advertiser_id, country
ORDER BY spend_5m DESC
LIMIT 10;

Step-by-step explanation.

Druid path. Rollup is configured on (minute, advertiser_id, country). The 5-minute window is already pre-aggregated; the query reads ~5 rows per (advertiser × country) pair instead of millions. Latency: tens of ms.
Pinot path. A star-tree index is built on (advertiser_id, country) with spend as the metric. The 5-minute slice is found via the time-partitioned segments; the star-tree returns the aggregated answer in a single lookup. Latency: tens of ms.
ClickHouse path. An AggregatingMergeTree materialised view pre-aggregates (toStartOfMinute(event_ts), advertiser_id, country) -> sum(spend). The query merges 5 minute-buckets and sorts. Latency: tens of ms — provided the MV exists.
The fork. All three are fast when configured correctly. The differences show up in the operational story: Druid needs deep storage and the Kafka indexing service; Pinot needs Helix and a star-tree config; ClickHouse needs the MV definition and the Kafka engine table.

Output.

Engine	Pre-agg structure	Query latency	Ops surface area
Druid	rollup on ingest	tens of ms	high — many services
Pinot	star-tree index	tens of ms	medium — Helix
ClickHouse	AggregatingMergeTree MV	tens of ms	low — peer servers

Rule of thumb. Real-time OLAP queries are won at ingest time by pre-aggregating, not at query time by scanning faster. Pick the engine whose pre-aggregation model fits your workload — and accept the ops surface area as the price of that pre-aggregation.

Real-time OLAP system design interview question

A senior interviewer often opens with: "Design a real-time ads dashboard that shows top advertisers by spend over the last 5 minutes, refreshed every 2 seconds, against a 500k events/sec Kafka stream. Which OLAP engine would you choose and why?" It blends ingest, pre-aggregation, and the three-forces analysis into a single answer.

Solution Using a Druid pipeline with rollup pre-aggregation

[Kafka topic ad_events]
        |
        v
[Druid Kafka indexing service]
        |  (rollup on (minute, advertiser_id, country) with SUM(spend))
        v
[Real-time segment]  --(handoff)-->  [Historical segment on S3]
        |                                     |
        +------> [Broker] <--- [Coordinator] -+
                     |
                     v
              [Superset dashboard, 2s refresh]

A worked-example SQL the dashboard tile would emit (Druid SQL):

SELECT
    advertiser_id,
    country,
    SUM(spend_sum) AS spend_5m
FROM "ad_events_rollup"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '5' MINUTE
GROUP BY advertiser_id, country
ORDER BY spend_5m DESC
LIMIT 10;

Step-by-step trace.

Stage	Input rate	Output rate	Latency contribution
Kafka topic	500k events/s	—	0 (source)
Druid Kafka indexing	500k events/s	~5k pre-agg rows/s	1–3 s ingest delay
Real-time segment	append-only	broker-visible	< 1 s
Broker scan (5 min)	~1.5M pre-agg rows	10 top rows	20–80 ms
Superset render	—	dashboard tile	50 ms

The rollup at ingest is the load-bearing optimisation: 500k events/s collapses to ~5k pre-agg rows/s because most events share (minute, advertiser_id, country) tuples. The broker only scans pre-aggregated rows, not raw events.

Output:

advertiser_id	country	spend_5m
adv_77	US	18,420.50
adv_12	DE	14,003.10
adv_88	IN	12,950.75

Why this works — concept by concept:

Rollup at ingest — Druid's rollup: true config groups input rows by the dimension tuple at ingest and stores only the aggregated metric. A 100x cardinality collapse is normal for ad-tech workloads.
Time-partitioned segments — every segment carries a __time range; the broker prunes segments outside the WHERE window before scanning. The "last 5 minutes" query touches one or two segments.
Real-time + historical handoff — newly ingested data lives in a real-time segment served by the middle-manager. After a configurable period, it is handed off to a historical server backed by S3. Queries see both transparently.
Broker fan-out — the broker scatters the query across all segment-owning servers and merges results. Pre-aggregation keeps the per-server scan tiny.
Cost — ingest cost: O(events/s × indexing parallelism). Query cost: O(pre-agg rows in window × log fan-out). Storage cost: O(rollup ratio × event volume).

Data Engineering
Topic — real-time analytics
Real-time analytics problems

Practice →

2. The real-time OLAP landscape — where each engine sits

Druid, Pinot, and ClickHouse converged on the same workload from three different starting points — and the gaps between them are what dictates the choice

The mental model in one line: Druid is the time-series specialist with mandatory pre-aggregation, Pinot is the low-latency dashboard specialist with index-heavy storage, and ClickHouse is the general-purpose columnar SQL engine with the most flexible schema model. Once you can place each engine on that spectrum, the "which one should we use" debate collapses into a sentence about query shape.

The spectrum from warehouse to search index.

Far left — batch warehouses (Snowflake, BigQuery, Redshift). Wide, deep storage. Minutes of latency. Best for ad-hoc analytics and long history.
Center-left — Druid. Time-bucketed, pre-aggregated, segment-storage. Best for time-series with heavy rollup.
Center — Pinot. Index-heavy per-segment storage with star-tree pre-aggregation. Best for user-facing dashboards under 100 ms.
Center-right — ClickHouse. Flexible MergeTree family with vectorised execution. Best for general analytics SQL with arbitrary filters and joins.
Far right — search indexes (Elasticsearch, OpenSearch). Inverted-index document stores. Best for full-text and fuzzy lookup.

Druid's positioning.

Time-series first. Segments are partitioned by __time ranges. The schema must include a timestamp column, and every query is implicitly time-filtered.
Pre-aggregation is mandatory in practice. "Raw" Druid (rollup off) works, but the value of Druid evaporates without rollup — you'd use ClickHouse instead.
Decoupled storage. Segments are immutable files on S3 or HDFS ("deep storage"). Historical servers cache them locally; coordinators rebalance.
Heavy service surface. Coordinator, overlord, broker, historical, middle-manager, router — six services in a healthy production deploy.

Pinot's positioning.

User-facing first. Pinot was born to serve LinkedIn's member-facing dashboards under 100 ms. Latency budget is the dominant design constraint.
Index-heavy. Bitmap, inverted, range, sorted, text, and the famous star-tree index. The star-tree is a pre-aggregated tree where every node is a partial aggregate — query latency is logarithmic.
Helix-coordinated. Apache Helix (also from LinkedIn) handles state-machine-based cluster coordination; controllers, brokers, and servers are the three roles.
Two server types. "Realtime" servers consume from Kafka and seal segments; "offline" servers serve completed segments. The split is the key insight.

ClickHouse's positioning.

General-purpose SQL. Real ANSI-ish SQL with JOINs, window functions, CTEs, dictionaries, and aggregate functions for every shape (uniqCombined, quantileTDigest, topK).
MergeTree family. A dozen storage engines: MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree. Each tunes the merge step for a specific workload.
Coordinator-less. Peer servers, optionally coordinated by ZooKeeper or the newer ClickHouse Keeper. Sharding and replication are explicit.
Lightest ops surface. A single binary; you can run a one-node ClickHouse on a laptop and query a billion rows.

How each one sits next to Kafka, S3, and BI tools.

Kafka. All three have a first-class Kafka ingest path (Druid's Kafka indexing service, Pinot's realtime servers, ClickHouse's Kafka engine).
S3 / object storage. Druid uses it as deep storage by design. Pinot uses it for completed segments. ClickHouse uses it via the S3 table engine or as a backup target — not as primary storage.
BI tools. Druid + Pinot speak SQL via dedicated layers (Druid SQL, Pinot's SQL). ClickHouse speaks SQL natively. All three plug into Superset, Tableau, Looker, Grafana.

Comparison points — the dimensions that matter.

Query latency. All three target sub-second. Pinot is fastest on indexed point queries; ClickHouse wins on arbitrary SQL; Druid wins on heavily rolled-up time-series.
Data freshness. All three offer real-time ingest from Kafka with seconds-to-tens-of-seconds visibility.
SQL maturity. ClickHouse > Pinot > Druid. ClickHouse has the broadest SQL surface; Druid is the most limited (no full JOIN until recently).
Ops cost. ClickHouse cheapest, Pinot middle, Druid most expensive. The number of services to babysit follows the same ranking.

Worked example — placing a workload on the spectrum

Detailed explanation. A team runs three workloads in the same platform: ad-tech rollups, member-facing dashboards, and internal ad-hoc analytics. Putting all three on a single engine is a false economy — each one lives at a different spot on the spectrum.

Question. For each of the three workloads, name the engine that fits best and justify in one sentence.

Input.

Workload	Latency budget	Query shape	Data freshness
Ad-tech rollups	< 200 ms	time-bucketed top-N	seconds
Member-facing dashboards	< 100 ms	indexed point + group	seconds
Internal ad-hoc analytics	< 5 s	arbitrary SQL + JOINs	minutes ok

Code (engine selection logic).

if workload.shape == "time-bucketed + pre-aggregable":
    pick(DRUID)
elif workload.shape == "user-facing + indexed":
    pick(PINOT)
elif workload.shape == "general SQL + ad-hoc":
    pick(CLICKHOUSE)
else:
    fall_back_to(BATCH_WAREHOUSE)

Step-by-step explanation.

Ad-tech rollups are the canonical Druid workload: pre-aggregate at ingest, time-partition segments, query the rolled-up table.
Member-facing dashboards are the canonical Pinot workload: pre-build star-tree and bitmap indexes, query with strict latency SLO.
Internal ad-hoc analytics are the canonical ClickHouse workload: write arbitrary SQL, use JOINs and window functions, accept a few hundred ms of latency.
Mixing workloads on a single engine is possible but pays the worst-of-all-worlds cost — every engine has a sweet spot, and outside it is a tax.

Output.

Workload	Best fit	One-sentence reason
Ad-tech rollups	Druid	rollup at ingest collapses cardinality 100x
Member-facing dashboards	Pinot	star-tree indexes deliver < 100 ms point queries
Internal ad-hoc analytics	ClickHouse	arbitrary SQL with JOINs and the broadest function library

Rule of thumb. Place every new workload on the spectrum before picking an engine. The fastest engine is the one whose sweet spot overlaps your workload — not the one with the best public benchmark.

Worked example — when a single engine has to serve two workloads

Detailed explanation. Most teams start with one engine and slowly grow into two. The cleanest pattern is "Druid for time-series + ClickHouse for ad-hoc" or "Pinot for user-facing + ClickHouse for internal." Knowing which pairs combine well is a senior-level architecture call.

Question. A platform team has Pinot powering user-facing dashboards but is being asked to support ad-hoc analyst SQL with JOINs. Should they overload Pinot or add ClickHouse?

Input.

Workload	Engine fit
User-facing dashboards	Pinot (excellent)
Ad-hoc analyst SQL with JOINs	Pinot (weak) / ClickHouse (excellent)

Code (pseudocode for the architecture decision).

fit(Pinot, user_facing) = excellent
fit(Pinot, ad_hoc_join) = weak    # Pinot's JOIN story is limited
fit(ClickHouse, ad_hoc_join) = excellent
decision = add(ClickHouse)        # don't overload Pinot

Step-by-step explanation.

Pinot's JOIN support is real but limited (multi-stage query engine is new; complex JOINs hurt latency). Overloading Pinot with JOIN-heavy SQL erodes the latency SLO that justified picking it in the first place.
ClickHouse handles JOINs natively, including JOIN strategies (hash, partial merge, parallel hash) tuned per query.
The combined stack: Kafka feeds both — Pinot consumes for the dashboard topic, ClickHouse consumes for analyst exploration. Both share the same source-of-truth but optimise differently.
The alternative (add JOIN to Pinot) was tried by many teams and reverted; the lesson is "every engine has a sweet spot — respect it."

Output.

Decision	Outcome
Overload Pinot	dashboard latency degrades
Add ClickHouse for ad-hoc	both workloads stay fast
Replace Pinot with ClickHouse	dashboards lose the < 100 ms SLO

Rule of thumb. Stack engines additively rather than swap them. "Pinot + ClickHouse" is a common production combo at Uber, Cloudflare, and others — each engine does what it is best at.

Worked example — when to stay on the batch warehouse

Detailed explanation. Real-time OLAP is not free. Adding Druid, Pinot, or ClickHouse to the stack costs ops, monitoring, and on-call. Many workloads claim "real-time" but can comfortably live on a 5-minute micro-batch in Snowflake or BigQuery — saving the team a year of infrastructure work.

Question. A finance team wants "live" revenue analytics refreshed "every few minutes." Should they introduce a real-time OLAP engine?

Input.

Requirement	Latency SLO	Volume	Ad-hoc SQL?
Live revenue dash	< 5 minutes	10k events/s	yes

Code (decision pseudocode).

if latency_slo_s >= 300 and volume_events_per_s <= 50_000:
    keep(BATCH_WAREHOUSE_micro_batch)
elif latency_slo_s < 5:
    add(REAL_TIME_OLAP)
else:
    evaluate_both_with_cost(...)

Step-by-step explanation.

A 5-minute SLO is a batch SLO, not a real-time one. Snowflake's dynamic tables or BigQuery's MATERIALIZED VIEW refresh on a schedule and serve sub-second SQL on the cached result.
The team would pay real-time OLAP cost (Druid / Pinot / ClickHouse cluster + ops) for a latency they do not need.
The right call is "stay on the warehouse, add a refresh schedule." Save the real-time OLAP introduction for the workload that actually needs it.
Senior interviewers love this answer because it shows the candidate respects cost as a constraint.

Output.

Path	Cost	Latency delivered	Fit
Snowflake dynamic table	low	~5 min	excellent
Druid pipeline	high	< 1 s	overkill

Rule of thumb. Do not adopt a real-time OLAP engine until you have a workload whose SLO is measured in seconds and whose ingest rate exceeds what a 5-minute micro-batch can serve. "Real-time everywhere" is an anti-pattern.

Real-time OLAP positioning interview question

A senior interviewer often frames this as: "Walk me through how you'd decide between Druid, Pinot, ClickHouse, and just micro-batching to Snowflake for a new workload. What's your decision tree?" The probe blends engine positioning with cost discipline.

Solution Using a query-shape decision tree

workload = profile_workload(...)

if workload.latency_slo_s >= 300:
    return BATCH_WAREHOUSE_MICRO_BATCH

if workload.shape == "time-bucketed metrics with rollups":
    return DRUID

if workload.shape == "user-facing dashboard < 100 ms":
    return PINOT

if workload.shape == "general SQL + JOINs + windows":
    return CLICKHOUSE

return PROFILE_DEEPER

Step-by-step trace.

Workload	latency SLO	shape	Decision
Ad-tech rollups	200 ms	time-bucketed + pre-agg	DRUID
LinkedIn-style dashboards	80 ms	user-facing indexed	PINOT
Cloudflare analytics	500 ms	general SQL	CLICKHOUSE
Finance daily revenue	5 min	tabular SQL	BATCH WAREHOUSE

The tree is intentionally lean: latency SLO eliminates the warehouse first, then query shape picks among the three real-time engines. Adding extra branches (e.g. "data volume," "operator headcount") refines the answer but does not invalidate the tree.

Output:

Workload	Recommended engine
Ad-tech rollups	Druid
User-facing dashboards	Pinot
Cloudflare-style analytics	ClickHouse
Daily revenue rollup	Snowflake (micro-batch)

Why this works — concept by concept:

Latency SLO as the first gate — eliminates the warehouse path early; spares the team from paying real-time ops cost for a batch SLO.
Query shape over benchmark — engines look interchangeable on a single benchmark but diverge sharply on real workloads. Pick by shape.
Pre-aggregation hint — workloads that pre-aggregate well lean Druid or Pinot; workloads that do not lean ClickHouse.
SQL maturity hint — JOINs and windows tip the scale toward ClickHouse, away from Druid.
Cost — adopting a real-time OLAP engine has a fixed ops cost (cluster + monitoring + on-call). Justify it with a workload whose SLO requires it.

Data Engineering
Topic — streaming
Streaming pipeline problems

Practice →

3. Architecture trio — under the hood

Druid + Pinot decouple storage from compute; ClickHouse fuses them — and that single choice explains 80% of the operational differences

The mental model in one line: Druid and Pinot were built as distributed systems first (segment files on deep storage + many specialised services), while ClickHouse was built as a fast single-node engine that was later distributed (peer servers + sharding) — and every operational difference flows from that lineage. Once you can sketch the three architectures on a whiteboard, you can predict where each one shines and where each one breaks.

Druid's architecture — many services, one job each.

Coordinator. Manages segment assignment across historicals; rebalances and replicates.
Overlord. Manages ingestion tasks; assigns work to middle-managers.
Broker. Receives client queries; scatters to historicals / middle-managers, gathers, merges.
Historical servers. Hold completed segments locally (cached from deep storage); serve queries.
Middle-managers. Run real-time tasks; hold real-time segments until handoff.
Deep storage. S3 / HDFS / GCS — the durable source of truth for segments.
Metadata store. Postgres / MySQL — stores segment metadata, task state, coordination data.
Router (optional). Front-door load balancer.

Pinot's architecture — three roles + Helix.

Controller. Manages cluster state; coordinates segment assignment; runs the segment-completion protocol for realtime tables.
Broker. Routes queries; merges per-segment results.
Server. Holds segments; two flavours — "offline servers" hold sealed segments; "realtime servers" consume from Kafka and own consuming segments.
Helix (via ZooKeeper). State-machine-based cluster manager from LinkedIn; tracks server state, segment state, and assignment.
Deep storage. S3 / HDFS — for sealed segments. (Realtime segments live in memory + local disk until completion.)

ClickHouse's architecture — peer servers + optional coordinator.

No dedicated coordinator role. Every server is a peer. Queries can be sent to any server (typically a "distributed" table acts as the fan-out point).
Shards × replicas. Data is sharded across nodes; each shard has N replicas for HA.
ZooKeeper or ClickHouse Keeper. Used for replication coordination (Replicated*MergeTree engines). Smaller dependency than Druid/Pinot's stack.
Local storage. Parts (the ClickHouse equivalent of segments) live on local SSDs by default. Can be tiered to S3 via the S3 storage policy.
Distributed table. A virtual table that fans out queries to all shards; results are merged by the receiving server.

Segment / part files — the unit of storage.

Druid segment. Immutable file, time-partitioned, columnar internal layout, contains dictionary-encoded dimensions, bitmap indexes, and aggregated metrics. Typical size: 300 MB–1 GB.
Pinot segment. Immutable file, indexed columns (bitmap / inverted / star-tree / sorted), columnar layout. Realtime segments are mutable in memory then "sealed" into immutable offline segments. Typical size: hundreds of MB.
ClickHouse part. Folder of column files on disk, immutable once written. Background merges combine parts into larger ones. Typical size: tens of MB to a few GB.

Deep-storage models.

Druid. Mandatory deep storage. Historicals are caches. Loss of a historical means re-download from S3, not data loss.
Pinot. Deep storage holds sealed segments; servers cache them locally. Same "cache + durable backing" model as Druid.
ClickHouse. Local storage is primary; replication provides durability. Tiered storage to S3 is opt-in via storage policies.

Index types — the per-column accelerators.

Bitmap indexes. All three support bitmap (or roaring-bitmap) indexes on dimension columns. Filter pushdown becomes a bitmap AND/OR.
Inverted indexes. Pinot's signature; maps term → document list.
Star-tree indexes. Pinot's killer feature — a multi-level pre-aggregation tree where every node holds partial aggregates. Trades disk for query latency.
Skip indexes. ClickHouse's per-granule index (minmax, set, bloom_filter, tokenbf_v1). Skips entire granules during a scan.
Sorted indexes. Pinot's "sorted index" — actual sort order on a column, enabling range pushdown.

Worked example — sketching the Druid service map

Detailed explanation. Drawing the six-service Druid diagram from memory is a senior-interview reflex. The trick is grouping: "control plane" (coordinator, overlord), "data plane" (broker, historicals, middle-managers), "storage plane" (deep storage, metadata DB).

Question. Draw the Druid architecture from memory. Group services into control / data / storage planes and explain the role of each.

Input. A blank whiteboard and an interviewer asking about Druid.

Code (text sketch).

                   [Router (optional)]
                          |
                       [Broker] <----+
                       /       \      \
                      v         v      v
              [Historical]   [Historical]   [Middle-Manager]
                    \         /                |
                     v       v                 v
                 [Coordinator] <---> [Overlord]
                          |                |
                          v                v
                  [Metadata DB]      [Deep Storage S3]

Step-by-step explanation.

Router. Optional load balancer in front of brokers. Skipped in small deploys.
Broker. Receives every client query, parses, plans, and scatters to historicals + middle-managers that own the relevant segments.
Historicals. Hold sealed segments cached from deep storage; respond to broker fan-out.
Middle-managers. Run ingest tasks; hold real-time segments until handoff to historicals.
Coordinator. Decides which historical owns which segment; rebalances and replicates.
Overlord. Decides which middle-manager runs which ingest task.
Metadata DB (Postgres / MySQL). Source of truth for segment metadata, task state, and rule configs.
Deep storage (S3 / HDFS). Source of truth for segment files; historicals cache locally.

Output.

Plane	Services
Control	Coordinator, Overlord
Data	Broker, Historicals, Middle-Managers
Storage	Deep Storage, Metadata DB
Edge	Router (optional)

Rule of thumb. When asked about Druid in an interview, lead with the three-plane grouping. Six services in a flat list looks intimidating; three planes feels organised.

Worked example — sketching the Pinot service map

Detailed explanation. Pinot's service map is simpler than Druid's — three roles plus Helix. The interesting part is the realtime / offline server split, which is unique to Pinot and the reason it wins low-latency dashboard workloads.

Question. Draw the Pinot architecture. Highlight the realtime / offline server split and explain Helix's role.

Input. A blank whiteboard.

Code (text sketch).

                  [Controller(s)] <---+
                          |            \
                          v             v
                     [Helix / ZK] <---> (cluster state)
                          ^
                          |
                       [Broker]
                       /      \
                      v        v
            [Realtime Server]  [Offline Server]
                  |                  |
                  v                  v
           [Kafka topic]      [Deep storage S3]
                                     ^
                                     |
                              [Sealed segments]

Step-by-step explanation.

Controller(s). Manage segment assignment, run the segment-completion protocol, expose the admin API.
Brokers. Route queries; consult Helix for routing tables.
Realtime servers. Consume from Kafka into "consuming segments." When a segment reaches a size or row-count threshold, it is sealed and offloaded.
Offline servers. Hold sealed segments; serve queries against them.
Helix (built on ZooKeeper). Cluster state machine — tracks every server's state (online/offline), every segment's state (consuming/sealed/error), and triggers transitions.
Deep storage. Holds sealed segments; offline servers download them locally.

Output.

Service	Job
Controller	cluster admin + segment-completion
Broker	query routing
Realtime server	Kafka consumer + consuming segment
Offline server	sealed segment serving
Helix	state machine over the above

Rule of thumb. When asked about Pinot, lead with "realtime + offline servers, coordinated by Helix" — that captures the two unique design choices in one sentence.

Worked example — sketching the ClickHouse cluster topology

Detailed explanation. ClickHouse looks deceptively simple because there is no coordinator role. The real complexity is in the table-engine choice (MergeTree, ReplicatedMergeTree, Distributed) and the storage policy. A senior whiteboard answer names the engines.

Question. Draw a 2-shard, 2-replica ClickHouse cluster. Name the table engines on each node.

Input. A 4-node cluster spec.

Code (text sketch + schema).

            [Distributed table on any node]
                          |
            +-------------+-------------+
            v                           v
       [Shard 1]                   [Shard 2]
       /        \                  /        \
      v          v                v          v
[Node 1 r1]  [Node 1 r2]     [Node 2 r1]  [Node 2 r2]
   |             |               |             |
[ReplicatedMergeTree]      [ReplicatedMergeTree]
              \                 /
               v               v
             [Keeper / ZooKeeper cluster]

-- Local replicated table on each node
CREATE TABLE events_local ON CLUSTER my_cluster
(
    event_ts DateTime,
    user_id  UInt64,
    region   LowCardinality(String),
    amount   Float64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_ts)
ORDER BY (region, event_ts, user_id);

-- Distributed virtual table on every node (the entry point)
CREATE TABLE events ON CLUSTER my_cluster
AS events_local
ENGINE = Distributed(my_cluster, default, events_local, rand());

Step-by-step explanation.

ReplicatedMergeTree. Each node has a local table replicated to its peer within the shard via Keeper. Inserts on one replica auto-propagate.
Distributed table. A virtual table on every node that fans out reads to all shards and combines them. Inserts to the Distributed table can be sharded by rand() or by a key.
Keeper / ZooKeeper. Coordinates replication metadata, distributed DDL (ON CLUSTER), and replica election. No data flows through it.
No coordinator role. The system is symmetric — any node can serve any query.

Output.

Concept	Detail
Shard	a horizontal data slice (2 shards × 2 replicas = 4 nodes)
Replica	a copy of a shard for HA
ReplicatedMergeTree	per-node table with replication
Distributed	virtual fan-out table
Keeper / ZooKeeper	metadata-only coordinator

Rule of thumb. When asked about ClickHouse, name the table engines first (ReplicatedMergeTree + Distributed) and the keeper second. That distinguishes you from candidates who confuse it with a "just a fast Postgres."

Real-time OLAP architecture interview question

A senior interviewer often opens with: "Compare the architectures of Druid, Pinot, and ClickHouse on three axes — storage layout, coordination model, and ops surface area. Which architectural choice has the biggest impact on operability?" It blends the three sketches into a single judgement call.

Solution Using a three-axis architecture matrix

                STORAGE        COORDINATION         OPS SURFACE
Druid       deep storage +    coordinator +        6 services + metadata DB
            local cache       overlord + Helix-     + deep storage
                              like rules
Pinot       deep storage +    Helix + controller   3 roles + Helix + deep storage
            local cache
ClickHouse  local + opt S3    Keeper / ZK          1 binary + Keeper
            tiering           (metadata only)

Step-by-step trace.

Engine	Coordination services	Storage tiers	Ops surface
Druid	2 (Coord + Overlord) + Router	deep + local	highest
Pinot	1 (Controller) + Helix	deep + local	medium
ClickHouse	0 dedicated + Keeper	local + opt S3	lowest

The matrix shows that coordination service count tracks with ops surface area almost perfectly. Druid's six services and Pinot's Helix dependency are the source of their operational reputation; ClickHouse's "just a binary" model is why solo data engineers reach for it first.

Output:

Engine	Stated ops complexity (1-5)	Big-O of "things to monitor"
Druid	5	O(7 services × replicas)
Pinot	3	O(3 roles × replicas + Helix)
ClickHouse	2	O(nodes + Keeper)

Why this works — concept by concept:

Decoupled storage vs fused storage — Druid + Pinot pay a deep-storage tax for the right to lose any node without losing data; ClickHouse pays a replication tax to keep data local for I/O latency.
Coordination service count — every coordination service is a thing on-call has to watch. Fewer services = lower MTTR.
Helix as a dependency — Pinot inherits LinkedIn's Helix project; another moving part to learn, but a battle-tested one.
Keeper as the minimal coordinator — ClickHouse Keeper is a drop-in replacement for ZooKeeper, smaller binary, simpler ops. The newer the cluster, the more likely you should use Keeper.
Cost — ops cost scales with service count, not data volume. A 100 TB ClickHouse cluster can be smaller (on-call hours) than a 10 TB Druid cluster.

Data Engineering
Topic — aggregation
OLAP aggregation problems

Practice →

4. Ingestion models — streaming vs batch in each engine

All three engines accept Kafka input — but Druid wraps it in a managed indexing service, Pinot in a state-machine-coordinated server, and ClickHouse in a SQL-shaped Kafka engine

The mental model in one line: Druid's Kafka indexing service writes real-time segments that hand off to historicals; Pinot's realtime servers consume into consuming segments that are sealed via the segment-completion protocol; ClickHouse's Kafka engine table + materialised view drains messages into a MergeTree target — and the difference shows up in how you reason about ingest lag and replay. Once you can sketch the three lanes, ingestion troubleshooting becomes a deterministic exercise.

Druid's streaming ingest.

Kafka indexing service. A first-class subsystem (KafkaIndexTask) that consumes a topic, builds real-time segments, and reports progress to the overlord.
Real-time segment. Lives on a middle-manager; immediately queryable as it grows. Queries against it are served alongside historical segments via the broker.
Handoff. When a segment "completes" (size / row-count / time threshold), it is published to deep storage and a historical server downloads it.
At-least-once semantics by default. Supervised via overlord checkpoints; exactly-once requires the use of partition-locked offsets and idempotent ingest configs.
Rollup configured at ingest time. The supervisor spec declares rollup: true and the metric aggregators (e.g. longSum, doubleSum, hyperUnique).

Pinot's streaming ingest.

Realtime servers. Each realtime server owns a consuming segment per partition. Reads directly from Kafka via the low-level consumer API.
Consuming segment → completing → sealed. When a threshold is hit, the segment "completes": one realtime server is elected to seal the segment, others discard their in-memory copy, and the controller assigns the sealed segment to an offline server.
Segment-completion protocol. Helix-coordinated; ensures exactly one sealed copy per logical segment across the cluster.
Star-tree pre-aggregation. Configured per table; the star-tree is built when the segment is sealed.

ClickHouse's streaming ingest.

Kafka engine table. A virtual table that consumes from a Kafka topic. Reading from it consumes messages — you do not read from it directly.
Materialised view trigger. An MV with TO target_table runs on every batch of messages and INSERTs into a real MergeTree table.
Target = MergeTree family. Typically MergeTree for raw events or AggregatingMergeTree for pre-aggregated data.
Exactly-once is not free. The Kafka engine commits offsets after a successful INSERT, but failure modes (crash between INSERT and offset commit) can lead to duplicates. Dedup is via ReplicatingMergeTree keys or external means.
Pre-aggregation via the merge step. SummingMergeTree / AggregatingMergeTree collapse rows that share the ORDER BY key during background merges.

Batch ingest lanes.

Druid. Native batch ingest (Hadoop / index_parallel) reads from S3/HDFS and writes segments directly. Used for backfills.
Pinot. Offline tables ingested via segment generation jobs (often Spark) and uploaded via the controller's segment upload API. Used for backfills and historical loads.
ClickHouse. Plain INSERT INTO ... SELECT FROM s3(...), the s3 table function, or clickhouse-client --query "INSERT ...". Most natural batch experience of the three.

Pre-aggregation models.

Druid rollup. Group-by at ingest on (timestamp_truncated, dimensions...) with aggregator functions (longSum, hyperUnique). Storage drops linearly with cardinality collapse.
Pinot star-tree. A pre-computed multi-level tree where leaf nodes are full rows and intermediate nodes are partial aggregates. Storage tax for query speed.
ClickHouse AggregatingMergeTree. Stores partial-state aggregates (e.g. quantileTDigestState, uniqCombinedState) that are merged by ORDER BY key on every background merge. Finalised at query time with -Merge combinators.

Schema evolution.

Druid. Dimensions can be added at ingest spec change; old segments do not have the new dimension and return NULL. Removing a dimension is harder (segments hold it forever).
Pinot. Schema versioning is per-table; new columns appear NULL in old segments. Renames are not supported directly.
ClickHouse. ALTER TABLE ... ADD COLUMN works in seconds (metadata change); the column is NULL or default in old parts until those parts are rewritten via mutations.

Operational cost ranking.

Druid. Highest. Six services, deep storage, metadata DB, Kafka indexing supervision.
Pinot. Medium. Three roles, Helix, controller config sprawl.
ClickHouse. Lowest. Kafka engine table + MV + target table is a three-line schema.

Worked example — Druid Kafka indexing supervisor spec

Detailed explanation. Druid's ingestion is configured by a supervisor spec — a JSON document submitted to the overlord that describes the topic, schema, partitions, and rollup. Reading one fluently is a Druid-specific interview test.

Question. Write a minimal Druid supervisor spec for the ad_events topic with rollup on (minute, advertiser_id, country) summing spend.

Input. A Kafka topic ad_events with JSON messages {ts, advertiser_id, country, spend}.

Code.

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "ad_events_rollup",
      "timestampSpec": { "column": "ts", "format": "iso" },
      "dimensionsSpec": {
        "dimensions": ["advertiser_id", "country"]
      },
      "metricsSpec": [
        { "name": "count",     "type": "count" },
        { "name": "spend_sum", "type": "doubleSum", "fieldName": "spend" }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "HOUR",
        "queryGranularity": "MINUTE",
        "rollup": true
      }
    },
    "ioConfig": {
      "topic": "ad_events",
      "inputFormat": { "type": "json" },
      "consumerProperties": {
        "bootstrap.servers": "kafka-1:9092,kafka-2:9092"
      },
      "useEarliestOffset": false
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsInMemory": 100000
    }
  }
}

Step-by-step explanation.

timestampSpec tells Druid which JSON field is the event timestamp and how to parse it.
dimensionsSpec lists the columns to keep as dimensions; metrics are not dimensions in Druid.
metricsSpec declares the aggregators; doubleSum pre-aggregates spend at ingest.
segmentGranularity: HOUR means each segment file holds one hour of data.
queryGranularity: MINUTE is the rollup bucket — events are truncated to the minute before grouping.
rollup: true enables pre-aggregation. With these settings, 500k events/s often collapses to a few thousand pre-agg rows/s.

Output.

Setting	Effect
`queryGranularity: MINUTE`	rollup bucket = 1 min
`segmentGranularity: HOUR`	file size scope
`rollup: true`	pre-aggregate at ingest
`metricsSpec`	which columns to sum

Rule of thumb. Memorise the four sections of the supervisor spec — dataSchema, ioConfig, tuningConfig, and the top-level type — and you can read any Druid spec in production.

Worked example — Pinot realtime table config

Detailed explanation. Pinot configures realtime tables via two JSON files: a schema and a table config. The "low-level consumer" mode is the production default — it gives Pinot direct control over Kafka partitions and segment completion.

Question. Write a minimal Pinot realtime table config for the same ad_events topic with a star-tree index on (advertiser_id, country) summing spend.

Input. Same Kafka topic, JSON messages, daily segment cadence.

Code.

// schema
{
  "schemaName": "ad_events",
  "dimensionFieldSpecs": [
    {"name": "advertiser_id", "dataType": "STRING"},
    {"name": "country",       "dataType": "STRING"}
  ],
  "metricFieldSpecs": [
    {"name": "spend", "dataType": "DOUBLE"}
  ],
  "dateTimeFieldSpecs": [
    {"name": "ts", "dataType": "TIMESTAMP", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MINUTES"}
  ]
}

// table config (excerpt)
{
  "tableName": "ad_events_REALTIME",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "ts",
    "schemaName": "ad_events",
    "replication": "2"
  },
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "starTreeIndexConfigs": [
      {
        "dimensionsSplitOrder": ["advertiser_id", "country"],
        "functionColumnPairs": ["SUM__spend", "COUNT__*"]
      }
    ],
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.topic.name": "ad_events",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.broker.list": "kafka-1:9092",
      "realtime.segment.flush.threshold.rows": "5000000"
    }
  }
}

Step-by-step explanation.

The schema names two dimensions, one metric, and a time column.
The table config attaches a star-tree on (advertiser_id, country) precomputing SUM(spend) and COUNT(*).
stream.kafka.consumer.type: lowlevel enables Pinot's preferred consumer — controllers assign Kafka partitions to realtime servers explicitly.
realtime.segment.flush.threshold.rows: 5_000_000 is the segment-completion trigger; when a consuming segment reaches 5M rows, Helix runs the completion protocol.
After completion, the sealed segment is uploaded to deep storage and assigned to an offline server.

Output.

Mechanism	Effect
Star-tree on `(advertiser_id, country)`	query latency O(log N) on aggregations
Low-level consumer	controller-managed partition assignment
`flush.threshold.rows`	controls segment size + cadence
Replication = 2	two realtime + two offline copies

Rule of thumb. Always configure star-tree indexes on the dimensions that drive your dashboard filters — that single setting is the difference between Pinot at 30 ms and Pinot at 300 ms.

Worked example — ClickHouse Kafka engine + MV

Detailed explanation. ClickHouse ingestion is a three-table pattern: a Kafka engine source, a MergeTree target, and a materialised view glue. Once you have written it once, every subsequent topic is copy-paste.

Question. Write the three CREATE TABLE statements that consume ad_events and write to a MergeTree partitioned by day.

Input. Same topic, JSON messages.

Code.

-- 1) Kafka source table
CREATE TABLE ad_events_kafka
(
    ts             DateTime64(3),
    advertiser_id  String,
    country        LowCardinality(String),
    spend          Float64
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'kafka-1:9092',
    kafka_topic_list  = 'ad_events',
    kafka_group_name  = 'clickhouse_ad_events',
    kafka_format      = 'JSONEachRow',
    kafka_num_consumers = 2;

-- 2) Target MergeTree
CREATE TABLE ad_events
(
    ts             DateTime64(3),
    advertiser_id  String,
    country        LowCardinality(String),
    spend          Float64
)
ENGINE = MergeTree()
PARTITION BY toDate(ts)
ORDER BY (country, advertiser_id, ts);

-- 3) Materialised view glue
CREATE MATERIALIZED VIEW ad_events_mv TO ad_events AS
SELECT ts, advertiser_id, country, spend
FROM ad_events_kafka;

Step-by-step explanation.

The Kafka engine table is a stream — every row is consumed exactly once per consumer-group offset.
SELECTing directly from ad_events_kafka consumes messages; this is not what you want.
The materialised view triggers on every batch of incoming rows, runs the SELECT, and INSERTs into ad_events.
The target table is a normal MergeTree, partitioned by day and ordered by (country, advertiser_id, ts) — the ORDER BY is the primary key and the sort order on disk.
Pre-aggregation can be added by changing the target to AggregatingMergeTree and using -State combinators in the MV SELECT.

Output.

Table	Engine	Role
`ad_events_kafka`	Kafka	source (consumes the topic)
`ad_events_mv`	MaterializedView	trigger (consumes + inserts)
`ad_events`	MergeTree	target (storage + queries)

Rule of thumb. Always use the three-table pattern. SELECTing from the Kafka table directly works once and then mysteriously stops because offsets advanced — a classic ClickHouse-onboarding bug.

Real-time OLAP ingestion interview question

A senior interviewer often opens with: "Compare how Druid, Pinot, and ClickHouse handle exactly-once Kafka ingestion. Which one would you trust by default and why?" It blends the three ingest models with the realities of exactly-once delivery.

Solution Using a per-engine exactly-once matrix

Druid:
  - Kafka indexing service tracks offsets in metadata DB
  - At-least-once by default; exactly-once via supervisor "exactlyOnce" mode
  - Handoff to historicals does not duplicate; segments are idempotent by ID

Pinot:
  - Realtime servers track offsets in Helix property store
  - Exactly-once-ish: sealed segments are deterministic by partition + offset range
  - Crashed consuming segment is re-consumed; deduplication via segment ID

ClickHouse:
  - Kafka engine commits offsets after MV INSERT succeeds
  - Crash between INSERT and commit -> duplicates
  - Dedup via ReplacingMergeTree on (event_id) or external idempotency key

Step-by-step trace.

Engine	Offset tracking	Default semantics	Strict exactly-once path
Druid	metadata DB	at-least-once	supervisor `exactlyOnce` mode
Pinot	Helix property store	effectively exactly-once via segment IDs	enable transactional sealing
ClickHouse	Kafka group offsets	at-least-once	`ReplacingMergeTree` on event_id

The lesson: none of the three give pure exactly-once for free. Druid has the most explicit story (supervisor flag), Pinot's segment-ID determinism is the most elegant, and ClickHouse defers exactly-once to the schema layer (ReplacingMergeTree).

Output:

Engine	Trust level for exactly-once Kafka
Druid (exactlyOnce supervisor)	high
Pinot (segment-ID dedup)	high
ClickHouse (ReplacingMergeTree)	high if schema is correct

Why this works — concept by concept:

Offset-tracking ownership — the engine that owns offsets reliably (Druid metadata DB, Pinot Helix) has an easier exactly-once story than one that delegates to Kafka (ClickHouse).
Segment ID determinism — Pinot's clever trick: a sealed segment ID is (partition, start_offset, end_offset). Re-sealing the same range produces the same ID, which the controller dedups.
Schema-level dedup — ClickHouse asks you to design the dedup at the table layer, not the ingestion layer. Powerful but easy to get wrong.
Replay safety — all three support replaying a Kafka topic from a known offset; only Druid and Pinot make replay idempotent without schema-side work.
Cost — exactly-once costs throughput. Druid's exactlyOnce mode caps ingest parallelism; Pinot's deterministic sealing adds Helix coordination cost; ClickHouse's ReplacingMergeTree adds a merge-time dedup pass.

Data Engineering
Topic — streaming
Kafka ingestion problems

Practice →

5. Query patterns and engine fit

The same dashboard query lands on three different execution shapes — pick the engine that matches your filter pattern, not the one with the best benchmark

The mental model in one line: Druid wins time-filtered group-bys over rolled-up data; Pinot wins low-latency point-and-index queries; ClickHouse wins arbitrary analytics SQL with JOINs and window functions — and the right answer to "which is fastest" is always "for what query shape?". Once you can sketch the same query in all three engines, the fit decision is a one-sentence check.

Druid's query shape.

Time-filtered group-by with pre-aggregated metrics. "Show me top advertisers by spend in the last 24h, by hour" — the broker prunes segments by __time, scans the rolled-up rows, and merges.
No expensive JOINs by default. Druid added lookup joins and broadcast joins, but heavy multi-table JOINs are an anti-pattern; you denormalise at ingest.
TopN as a first-class query. Druid's topN query type is faster than a generic groupBy for the common "top-K" pattern.
HLL / sketches built in. hyperUnique columns store HLL sketches at ingest; cardinality queries become a sketch read.

Pinot's query shape.

Indexed point + group queries. "Give me the click-through-rate for advertiser X in the last hour" — the broker hits the star-tree index and returns in tens of ms.
Predicate pushdown via bitmap. Multi-dim filters become bitmap intersections per segment.
Single-table workhorse. Multi-stage query engine (MSQE) added in 2023 enables JOINs, but single-table queries are where Pinot shines.
Lookup joins via dimension tables. Common pattern for user-facing dashboards.

ClickHouse's query shape.

Arbitrary analytics SQL. Window functions, CTEs, complex JOINs, subqueries, array functions, dictionary lookups — all first-class.
JOINs work but require care. JOIN strategies (hash, partial merge, parallel hash) tune per query; the right strategy is workload-dependent.
Wide function library. quantileTDigest, uniqCombined, topK, groupArray, arrayJoin — half the time a dedicated function replaces a complex SELECT.
Materialised views as pre-agg. Same pattern as ingestion MV but reused for query acceleration.

SQL maturity ranking.

ClickHouse > Pinot > Druid. ClickHouse is closest to a "real" SQL engine. Pinot's SQL is targeted at dashboard patterns. Druid's SQL is a thin layer over its native JSON query language.

Worked example — the same query in all three.

Same query: "spend per country in the last hour from ad_events."

-- Druid SQL
SELECT country, SUM(spend_sum) AS spend
FROM ad_events_rollup
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY country
ORDER BY spend DESC;

-- Pinot SQL
SELECT country, SUM(spend) AS spend
FROM ad_events
WHERE ts >= ago('PT1H')
GROUP BY country
ORDER BY spend DESC;

-- ClickHouse SQL
SELECT country, sum(spend) AS spend
FROM ad_events
WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY country
ORDER BY spend DESC;

The text is nearly identical. The execution differs sharply: Druid hits a rolled-up table; Pinot hits a star-tree (or bitmap-indexed segments); ClickHouse vectorises a column scan over partitions in the time window.

Decision matrix — query shape vs engine.

Query shape	Best fit
Time-bucketed metrics + rollups	Druid
User-facing dashboards < 100 ms with indexed dims	Pinot
Arbitrary SQL + JOINs + windows	ClickHouse
Long-tail ad-hoc analytics	ClickHouse
Real-time top-K by dimension	Druid or Pinot
Multi-table joins as a first-class need	ClickHouse

Combining engines in the same stack.

Pinot + ClickHouse. Pinot for user-facing dashboards; ClickHouse for internal ad-hoc. Common at Uber and Stripe.
Druid + ClickHouse. Druid for pre-aggregated time-series; ClickHouse for analyst exploration. Common in ad-tech.
All three? Rare. Sign of org sprawl or a misaligned mandate. Most teams settle on two.

When you should stop comparing and just pick one.

If your workload is clearly time-series with heavy rollup potential — Druid.
If your latency budget is < 100 ms and queries are dashboard-shaped — Pinot.
If you want a single binary that does general analytics — ClickHouse.
If you cannot decide between the latter two for a non-dashboard workload, default to ClickHouse — it has the lowest ops cost and the broadest SQL.

Worked example — Druid TopN vs GroupBy

Detailed explanation. Druid's topN query is a specialised, approximate (by default) version of groupBy LIMIT N. It is faster and uses less memory because it avoids materialising the full group-by hash. Senior Druid users know to reach for it.

Question. Rewrite a "top 10 advertisers by spend in the last hour" as both a Druid groupBy and a Druid topN. Explain the cost difference.

Input. The ad_events_rollup table on Druid.

Code.

-- GroupBy variant (Druid SQL)
SELECT advertiser_id, SUM(spend_sum) AS spend
FROM ad_events_rollup
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY advertiser_id
ORDER BY spend DESC
LIMIT 10;

// TopN variant (native JSON; Druid SQL auto-translates eligible queries)
{
  "queryType": "topN",
  "dataSource": "ad_events_rollup",
  "intervals": ["2026-06-15T00:00:00Z/2026-06-15T01:00:00Z"],
  "granularity": "all",
  "dimension": "advertiser_id",
  "metric": "spend",
  "threshold": 10,
  "aggregations": [
    {"type": "doubleSum", "name": "spend", "fieldName": "spend_sum"}
  ]
}

Step-by-step explanation.

The groupBy query allocates a hash map sized by the cardinality of advertiser_id (could be millions); the broker then sorts and trims to 10.
The topN query maintains a per-segment top-N heap of size threshold * factor and merges heaps at the broker. Memory and CPU drop sharply.
By default, topN is approximate beyond the top-K; results are deterministic for the actual top-K rows but the long tail may shuffle.
Druid SQL automatically rewrites eligible GROUP BY ... ORDER BY ... LIMIT N queries to the native topN when safe.

Output.

Variant	Memory per segment	Latency on hot cache
`groupBy`	O(cardinality)	tens to hundreds of ms
`topN`	O(threshold × small factor)	low tens of ms

Rule of thumb. For "top K by dimension" dashboards, prefer topN. Druid's SQL planner will choose it automatically when the query shape is recognisably top-K — write the SQL idiomatically and trust the planner.

Worked example — Pinot star-tree lookup

Detailed explanation. A star-tree index pre-aggregates the table along chosen dimensions. A query that filters on those dimensions hits a single tree node instead of scanning rows. The result is logarithmic-in-N latency for grouped queries — Pinot's signature trick.

Question. Show how a star-tree on (advertiser_id, country) accelerates the canonical query. Sketch the tree.

Input. Star-tree configured per the Pinot table config in section 4.

Code (the query that benefits).

SELECT country, SUM(spend) AS spend
FROM ad_events
WHERE advertiser_id = 'adv_77'
  AND ts >= ago('PT1H')
GROUP BY country
ORDER BY spend DESC;

Step-by-step explanation.

The star-tree root holds the total SUM(spend) across all dimensions.
The first split is on advertiser_id; child nodes correspond to each advertiser plus a special "star" node (no filter on advertiser).
The second split is on country; each (advertiser, country) leaf node holds the SUM(spend) for that pair.
The query "advertiser = adv_77, group by country" finds the adv_77 child, then enumerates its country leaves — returning the per-country sums directly.
No row scan; one tree walk per segment, merged at the broker.

Output.

Without star-tree	With star-tree
scan all matching rows	walk tree to leaf nodes
O(rows) per segment	O(log dimensions) per segment
hundreds of ms	tens of ms

Rule of thumb. Build star-tree indexes on the dimensions your dashboards filter and group by. Disk cost is real (a few extra GB per segment is typical); query latency drops by 5–20x.

Worked example — ClickHouse window function

Detailed explanation. ClickHouse's window functions (OVER (PARTITION BY ... ORDER BY ...)) bring it close to a "real" SQL engine. The canonical example is a rolling 5-minute spend per advertiser — a workload that Druid handles via rollup, Pinot avoids, and ClickHouse runs as plain SQL.

Question. Compute rolling 5-minute spend per advertiser using a ClickHouse window function.

Input. A ad_events table on ClickHouse.

Code.

SELECT
    advertiser_id,
    toStartOfMinute(ts)                                 AS minute,
    sum(spend)                                          AS minute_spend,
    sum(sum(spend)) OVER (
        PARTITION BY advertiser_id
        ORDER BY toStartOfMinute(ts)
        ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
    )                                                   AS rolling_5m_spend
FROM ad_events
WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY advertiser_id, minute
ORDER BY advertiser_id, minute;

Step-by-step explanation.

The inner sum(spend) aggregates per (advertiser_id, minute).
The window function sum(sum(spend)) OVER (...) sums the per-minute totals across a 5-row sliding frame (current row plus 4 preceding rows = 5 minutes).
PARTITION BY advertiser_id keeps each advertiser's sliding window independent.
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW is the frame spec — 5 minute-rows total.
Vectorised execution evaluates this in one streaming pass over the partition.

Output (excerpt).

advertiser_id	minute	minute_spend	rolling_5m_spend
adv_77	10:00	200	200
adv_77	10:01	150	350
adv_77	10:02	300	650
adv_77	10:03	100	750
adv_77	10:04	250	1000
adv_77	10:05	400	1200

Rule of thumb. When you need window functions on real-time OLAP data, ClickHouse is the default answer. Druid does not support windows natively; Pinot's MSQE engine is improving but not the first choice.

Real-time OLAP query-fit interview question

A senior interviewer often opens with: "Given three workloads — ad-tech rollups, member-facing < 100 ms dashboards, and analyst SQL with window functions — write the same 'top countries by spend' query in the engine that fits each and explain why." It blends fit + idiomatic SQL.

Solution Using engine-idiomatic SQL per workload

-- Workload 1: ad-tech rollups -> Druid SQL
SELECT country, SUM(spend_sum) AS spend
FROM ad_events_rollup
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY country
ORDER BY spend DESC
LIMIT 10;

-- Workload 2: member-facing < 100 ms -> Pinot SQL (star-tree backed)
SELECT country, SUM(spend) AS spend
FROM ad_events
WHERE ts >= ago('PT1H')
GROUP BY country
ORDER BY spend DESC
LIMIT 10;

-- Workload 3: analyst SQL with window -> ClickHouse SQL
SELECT
    country,
    sum(spend)                                              AS spend,
    sum(sum(spend)) OVER (ORDER BY country)                 AS running_total
FROM ad_events
WHERE ts >= now() - INTERVAL 1 HOUR
GROUP BY country
ORDER BY spend DESC
LIMIT 10;

Step-by-step trace.

Workload	Engine	Execution shape	Approx latency
Ad-tech rollups	Druid	rolled-up table scan + topN	tens of ms
< 100 ms dashboard	Pinot	star-tree leaf walk + broker merge	tens of ms
Analyst with window	ClickHouse	vectorised scan + window pass	low hundreds of ms

Each engine is given a query it is designed for. The latency numbers match each engine's strong suit; running the same workload on the wrong engine would degrade by 5–20x.

Output:

Workload	Best query latency
Druid + rollup	~25 ms
Pinot + star-tree	~30 ms
ClickHouse + window	~150 ms

Why this works — concept by concept:

Engine-idiomatic SQL — each query uses functions and time-filter syntax native to that engine (__time, ago('PT1H'), now() - INTERVAL). Avoid forcing one engine's SQL onto another.
Workload-engine alignment — Druid handles the rolled-up scan; Pinot handles the indexed dashboard; ClickHouse handles the window. None is "the best" — each is best for its workload.
Latency by construction — pre-aggregation (Druid, Pinot) trades disk for latency; raw scans (ClickHouse) trade latency for flexibility. Choose your trade.
Cost — total cost = engine ops cost + ingest cost + storage cost + query cost. Optimise the workload-engine fit before optimising any single line item.
Decision discipline — pick by query shape, not by benchmark; revisit when the workload changes; do not let one bad fit poison the whole platform.

Data Engineering
Topic — real-time analytics
Real-time OLAP query patterns

Practice →

Cheat sheet — Druid vs Pinot vs ClickHouse recipes

Time-bucketed rollup. Druid granularitySpec: {queryGranularity: MINUTE, rollup: true} with metricsSpec aggregators (doubleSum, hyperUnique). 100x cardinality collapse is normal for ad-tech.
Star-tree pre-aggregation. Pinot starTreeIndexConfigs with dimensionsSplitOrder listing the filter dimensions and functionColumnPairs listing the aggregations. Build at segment-seal time.
Real-time materialised view. ClickHouse AggregatingMergeTree target + MaterializedView TO target AS SELECT ... groupArray*State(...). Finalise with -Merge combinators at query time.
Kafka ingestion — Druid. Submit a supervisor spec to overlord with type: kafka and the JSON dataSchema; the indexing service runs Kafka tasks on middle-managers.
Kafka ingestion — Pinot. Configure stream.kafka.consumer.type: lowlevel in the table config; controllers assign partitions to realtime servers; segments seal via Helix.
Kafka ingestion — ClickHouse. Three-table pattern: Kafka source + MaterializedView trigger + MergeTree target. Always insert via the MV, never SELECT from the Kafka table.
Sub-second filter — Druid. Bitmap indexes on dimensions are automatic; segment pruning by __time is automatic. Add searched CASE WHEN is not needed — Druid handles it natively.
Sub-second filter — Pinot. Configure invertedIndexColumns + bloomFilterColumns per table; the broker pushes predicates down to bitmap intersections per segment.
Sub-second filter — ClickHouse. Use ORDER BY (primary_key_cols) to align disk order with filter predicates; add INDEX skip_idx col TYPE bloom_filter GRANULARITY 4 for high-cardinality columns.
Deduplication — Druid. Set exactlyOnce mode in the supervisor; segments are idempotent by ID. Re-ingestion of the same offset range overwrites cleanly.
Deduplication — Pinot. Sealed segments are deterministic by (partition, offset_range); controllers dedup at completion. Use upsert table type for primary-key dedup.
Deduplication — ClickHouse. Use ReplacingMergeTree on (event_id) or external idempotency key; dedup happens at merge time, not insert time.
Top-K — Druid. Use the topN query type (or let Druid SQL rewrite GROUP BY ... ORDER BY ... LIMIT N). Per-segment heap merging beats full groupBy.
Top-K — Pinot. Plain SELECT ... GROUP BY ... ORDER BY ... LIMIT N; the star-tree returns the answer in a tree walk.
Top-K — ClickHouse. SELECT ... ORDER BY ... LIMIT N with LIMIT BY for top-K-per-group; the topK(N)(col) aggregate function for approximate top-K.
HLL / cardinality — Druid. Configure a hyperUnique metric at ingest; query with APPROX_COUNT_DISTINCT(metric).
HLL / cardinality — Pinot. Configure noDictionaryColumns plus fieldConfigList with H3 or HLL_PLUS indexes; query with DISTINCTCOUNTHLL(col).
HLL / cardinality — ClickHouse. Use uniqCombined(col) aggregate (HLL-based) or uniqHLL12(col) for fixed-precision; both are first-class.
JOINs. Druid avoids them — denormalise at ingest. Pinot supports them via the multi-stage query engine but with care. ClickHouse handles them natively; pick the JOIN algorithm with SETTINGS join_algorithm.
Window functions. Druid does not support window functions directly. Pinot's MSQE has limited support. ClickHouse has full SQL window functions — preferred for windowed analytics.
Time-zone handling. Druid stores epochs; query-time conversion via TIME_FORMAT. Pinot stores epochs; DATETRUNC for conversion. ClickHouse stores DateTime with optional TZ; toTimeZone(col, 'America/Los_Angeles') per column.
Choose the engine — one-line rules. Time-series + rollup → Druid. User-facing < 100 ms with indexes → Pinot. General SQL + JOINs + windows → ClickHouse. Latency SLO >= 5 minutes → stay on the batch warehouse.

Frequently asked questions

Druid vs Pinot — which is faster for user-facing analytics?

Pinot is generally faster for user-facing analytics where queries hit indexed dimensions and the latency budget is under 100 ms. Pinot's star-tree, bitmap, and sorted indexes are designed for the "filter + group by indexed dim" shape that powers LinkedIn-style member dashboards. Druid is competitive for time-bucketed pre-aggregated workloads but typically lands in the tens-to-hundreds-of-ms range rather than under 50 ms. If your workload is "show user X their personalised dashboard" Pinot wins; if your workload is "top advertisers by spend over rolled-up time," Druid is more economical.

Is ClickHouse a real-time OLAP database?

Yes — ClickHouse is a real-time OLAP engine, but with a different design philosophy than Druid or Pinot. It is a general-purpose columnar SQL engine with first-class Kafka ingest via the Kafka engine table + materialised view pattern, sub-second query latency on rolled-up AggregatingMergeTree data, and the broadest SQL surface of the three (JOINs, window functions, dictionary lookups, array functions). The trade-off is that ClickHouse asks you to design the pre-aggregation strategy explicitly (MV + AggregatingMergeTree) instead of declaring it once in a supervisor spec. For most teams without an ad-tech-grade time-series workload, ClickHouse is the lowest-ops real-time OLAP choice.

Can I use Druid without deep storage?

No — deep storage is mandatory in Druid. Historical servers cache segments locally for serving, but the durable copy lives in S3 (or HDFS / GCS / Azure Blob). The architecture is intentional: losing a historical server is a non-event because the missing segments are re-downloaded from deep storage. This is the inverse of ClickHouse, where local replicated storage is primary and S3 tiering is optional. If your storage budget rules out S3 (rare today), Druid is not the right engine — pick ClickHouse or stay on the batch warehouse.

What is Pinot's star-tree index?

A star-tree is a pre-aggregated multi-level tree where every node holds partial aggregates over a subset of dimensions. The leaves correspond to fully-grouped tuples; intermediate nodes correspond to "wildcard on this dimension" partial aggregates. A query that filters on the tree's split dimensions walks a few nodes and returns the answer without scanning rows. The cost is disk (a few extra GB per segment) and ingest time (build the tree when sealing a segment); the benefit is 5–20x faster grouped queries. Star-trees are configured per table via starTreeIndexConfigs listing dimensionsSplitOrder and functionColumnPairs.

Which engine has the best SQL support?

ClickHouse has the most complete SQL surface of the three. It supports full ANSI-ish SQL with window functions, CTEs, complex JOINs (with tunable algorithms), subqueries, array functions, and a vast aggregate-function library (quantileTDigest, uniqCombined, topK, argMax). Pinot's SQL is targeted at dashboard shapes and supports JOINs via the multi-stage query engine added in 2023, but with caveats around latency. Druid's SQL is a layer over its native JSON query language; it handles GROUP BY, topN, and filters well but lacks proper window functions and has limited JOIN support. If your workload requires general SQL flexibility, ClickHouse is the default answer.

Druid vs Pinot vs ClickHouse — when does each one win?

Druid wins time-bucketed metric workloads where heavy pre-aggregation (rollup) collapses billions of events into millions of rolled-up rows — ad-tech, observability, network telemetry. Pinot wins user-facing dashboards where the latency budget is under 100 ms and queries hit indexed dimensions — LinkedIn-style personalised analytics, payments dashboards, marketplace operations. ClickHouse wins everything else — general analytics SQL, ad-hoc analyst exploration, anything that needs JOINs or window functions, and any team that wants the lowest-ops real-time OLAP option. The three are not interchangeable; the right answer to "which one" is always "for what query shape and what latency SLO?"

Practice on PipeCode

Drill the real-time analytics practice library → for the time-bucketed, sub-second query patterns that show up in every Druid / Pinot / ClickHouse interview.
Rehearse on streaming pipeline problems → when the interviewer wants Kafka ingest + windowing + state-management answers.
Sharpen aggregation drills → for the rollup, star-tree, and AggregatingMergeTree muscle that pre-aggregation engines reward.
Layer the data aggregation library → for the broader GROUP BY + cardinality + HLL surface.
Stack the conditional aggregation library → for the FILTER + CASE patterns inside engine-specific SQL.
For the broader surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the system-design axis with the ETL system design for data engineering interviews course →.
For long-form schema craft, work through data modelling for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every Druid / Pinot / ClickHouse recipe above ships with hands-on practice rooms where you write the rollup supervisor, the star-tree config, and the AggregatingMergeTree materialised view against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your design for a sub-second top-K dashboard actually scales the way you claimed.

Practice real-time analytics now →
Streaming drills →