apache pulsar vs kafka is the messaging question that every senior data engineer has answered twice in the last five years — once when their first multi-region pipeline shipped, and again when retention requirements outgrew local disk. Kafka set the partition-log template in 2011; Pulsar reframed it in 2016 with a split-storage model, native geo-replication, and a built-in functions runtime. Both engines are now mature, both have tiered storage, both have transactions and exactly-once — and yet the right pick for a brand-new pipeline in 2026 still depends on whether you need topic-per-tenant scale, active/active multi-region by default, or simple single-region throughput at maximum economy.
This guide walks the four deltas that actually drive the decision: how the brokers and storage layer differ (Kafka's coupled broker-plus-log vs Pulsar's stateless brokers with bookkeeper segment storage), how each engine handles cross-region replication (Pulsar's pulsar geo replication baked into the topic property vs Kafka's MirrorMaker 2 deployment), how tiered cold storage is implemented (Pulsar's BookKeeper offloader vs Kafka KIP-405's pulsar tiered storage equivalent), and where lightweight stream processing lives (pulsar functions inside the broker vs Kafka Streams + Kafka Connect outside it). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on event-processing problems →, and stack the systems muscles with real-time analytics drills →.
On this page
- Why two log platforms exist — Kafka 2011 vs Pulsar 2016
- Architecture — brokers, partitions, segments, BookKeeper
- Geo-replication — Pulsar native vs Kafka MirrorMaker 2
- Tiered storage — BookKeeper offload vs KIP-405
- Pulsar Functions + multi-tenancy + topic-per-tenant model
- Cheat sheet — Pulsar vs Kafka recipes
- Frequently asked questions
- Practice on PipeCode
1. Why two log platforms exist — Kafka 2011 vs Pulsar 2016 design choices
Two engines, two eras, two architectural pivots — and a market that grew big enough for both
The one-sentence invariant: Kafka was designed in 2011 to be the cheapest, fastest, simplest distributed commit log; Pulsar was designed in 2016 to be the most operationally flexible multi-tenant messaging fabric, and the two architectural pivots — split storage and native geo-replication — still drive every trade-off you make in 2026. Once you internalise "Kafka picked simplicity, Pulsar picked flexibility," every probe about apache pulsar vs kafka reduces to a workload question, not a benchmark one.
The Kafka origin story.
- 2011 — LinkedIn. Jay Kreps, Neha Narkhede, and Jun Rao build Kafka as a unified log to replace a forest of point-to-point queues. The mental model is one big distributed commit log — every event lands on a partition, every consumer reads sequentially, and the log is the source of truth.
- The 2011 design pivots. (a) Storage and compute live in the same broker — every Kafka broker owns a set of partition log files on its local disk; (b) replication is intra-cluster — leader/follower per partition, controlled by ZooKeeper; (c) the consumer position lives on the consumer (offset commits), not the broker.
-
What that bought Kafka. Disk-sequential I/O, page-cache friendliness, zero-copy reads via
sendfile, and a simple ops model — one process per machine, one log per partition. The cost: rebalancing a hot partition means physically copying gigabytes of log data, and ZooKeeper coordination becomes the bottleneck at very large topic counts.
The Pulsar origin story.
- 2016 — Yahoo (open-sourced). Yahoo's messaging team builds Pulsar to serve hundreds of internal tenants on the same fabric — Mail, Finance, Search, Sports — and to replicate across data centres natively. The mental model is a broker is a stateless proxy over a separately scaled storage layer.
- The 2016 design pivots. (a) Brokers are stateless — they own no on-disk data; (b) storage lives in BookKeeper, a separate distributed log designed at Yahoo from 2011, where each ledger (segment) is striped across an ensemble of bookies; (c) geo-replication is built into the topic as a property — set a replication cluster list and writes flow to every region asynchronously; (d) multi-tenancy is first-class — tenants → namespaces → topics form a path that hosts ACLs, quotas, and retention.
- What that bought Pulsar. Rebalancing a broker is seconds because brokers carry no state; one cluster can host hundreds of thousands of topics because metadata is in ZooKeeper-or-etcd and storage is striped, not co-resident; geo-replication is a configuration flag, not a separate deployment. The cost: more moving parts (broker + bookie + metadata), a steeper ops learning curve, and a longer historical track record needed to feel safe on the most demanding throughput workloads.
The 2018–2026 convergence.
- Kafka added geo-replication. MirrorMaker 1 (2014) was simple but unreliable on offsets; MirrorMaker 2 (2019, KIP-382) ships with Kafka Connect, supports active/active replication, and translates consumer offsets between clusters.
- Kafka removed ZooKeeper. KRaft (KIP-500, GA in Kafka 3.3, default in 3.7+) replaces ZooKeeper with an internal Raft quorum, shrinking the operational surface to one process per node.
- Kafka added tiered storage. KIP-405 (proposed 2018, Apache release in 3.6 / 2023) lets brokers offload cold segments to S3 / GCS / Azure Blob, mirroring what Pulsar shipped in 2.1 (2018).
- Pulsar added transactions and exactly-once. Pulsar 2.7 (2021) shipped transactions on top of the segment model; Pulsar Functions matured into a per-topic stream processor with windows and retries.
- Pulsar added etcd as a metadata store. Pulsar 3.x reduces the ZooKeeper dependency by allowing etcd as the metadata backend.
Where each engine still wins in 2026.
- Kafka wins on: the largest single-region throughput numbers, the deepest ecosystem (Confluent, Aiven, MSK, Strimzi, Redpanda), the broadest tooling (kSQL, Kafka Streams, Debezium), and the simplest "one process per node" mental model after KRaft.
-
Pulsar wins on: multi-region active/active by default, very-high-topic-count workloads (
multi tenancy kafka pulsaris the canonical search), light per-topic stream processing without an external app, and rebalance/recovery speed under broker churn. - The honest 2026 picture. Neither engine is "better" at the workload-agnostic level. The right answer depends on whether your top-priority constraint is raw single-region throughput economics (lean Kafka), multi-tenant topic explosion + geo (lean Pulsar), or team familiarity with one ecosystem already (always wins).
What interviewers listen for on pulsar vs kafka performance.
- Do you frame Pulsar's
bookkeeperas a separate scalable storage tier, not "just another broker"? — senior signal. - Do you mention that Kafka brokers store partition logs locally and Pulsar brokers store nothing? — required answer.
- Do you flag that MirrorMaker 2 is a separate Kafka Connect job and Pulsar geo-replication is a topic property? — senior signal.
- Do you separate "raw throughput" (Kafka still wins) from "ops at 100k topics" (Pulsar wins)? — senior signal.
Worked example — the storage-coupling delta in one diagram
Detailed explanation. Imagine two clusters, each with three brokers, each serving the same workload of 100 GB / hour into one topic with 12 partitions. In Kafka, the partition log files physically sit on the broker that leads each partition — when broker 1 dies, ZooKeeper / KRaft picks a new leader, and the follower replicas (which were already in-sync) become the new leader. In Pulsar, the broker that owns the topic is a stateless proxy — it forwards every write to an ensemble of bookies (say 3 bookies, write-quorum 2, ack-quorum 2) and remembers nothing locally. When the broker dies, the cluster simply reassigns topic ownership to a healthy broker — the new broker can start serving instantly because the data is already in the bookies.
Question. In an "all three brokers running" baseline, write the producer-side path for one message in both engines. Then trace what happens when the leader broker for one topic dies. Compare the recovery time and what data has to move.
Input (baseline cluster).
| component | Kafka | Pulsar |
|---|---|---|
| broker count | 3 | 3 |
| storage layer | local disk (replicated) | 3 BookKeeper bookies |
| partitions / topic | 12 | 12 |
| replication factor | RF=3 | Qw=3, Qa=2 |
| metadata | ZooKeeper / KRaft | ZooKeeper / etcd |
Code (pseudocode for the write path).
# Kafka — producer to broker
producer.send(topic="events", key=k, value=v)
-> Kafka client hashes k -> partition P
-> client looks up partition leader (broker-2 owns P)
-> sends to broker-2
-> broker-2 appends to its local log file for P
-> broker-2 replicates to in-sync followers (broker-1, broker-3)
-> broker-2 returns acks after min.insync.replicas met
-> producer receives ack
# Pulsar — producer to broker -> bookies
producer.send(topic="events", key=k, value=v)
-> Pulsar client hashes topic name -> broker (broker-2 owns the topic)
-> sends to broker-2 over binary protobuf
-> broker-2 wraps message in an entry, sends to bookie ensemble (3 bookies)
-> bookies write entry to journal + ledger
-> Qw=3 means all 3 bookies must receive; Qa=2 means 2 acks before broker confirms
-> broker-2 returns ack to producer
-> broker-2 itself stores no message data
Step-by-step explanation.
- Kafka write. The data lands on the leader broker's local log; replication is broker-to-broker. The broker is the storage system.
- Pulsar write. The data lands in BookKeeper bookies via the broker. The broker is purely a routing and protocol layer.
- Kafka broker-2 dies. ZooKeeper / KRaft elects a new leader from the in-sync follower set (broker-1 or broker-3 takes over leadership of P). No data movement is required if the follower was in-sync. If the follower lagged, it must catch up before it can serve reads. Time to recover: seconds (election) + minutes (potential follower catch-up).
- Pulsar broker-2 dies. The metadata store notices the broker is gone, and the load-manager reassigns the broker-2 topics to broker-1 or broker-3. No data movement occurs because the data lives in bookies, not on brokers. The new broker connects to the same bookies and continues serving. Time to recover: seconds — and there is no notion of "out-of-sync follower" because storage is not a per-broker concept.
- The deeper consequence. In Kafka, scaling out a hot partition means physically moving its log to a new broker (a partition reassignment can take hours on a busy 100 GB partition). In Pulsar, scaling out a hot topic means changing topic-to-broker ownership in the metadata store — instant, and the bookies were already striped across the cluster.
Output (recovery profile).
| event | Kafka recovery | Pulsar recovery |
|---|---|---|
| broker death (in-sync followers) | ~5–10 s leader election | ~5–10 s ownership reassignment |
| broker death (lagging followers) | minutes — follower catch-up | unchanged — bookies hold data |
| hot-partition rebalance | minutes-to-hours (data movement) | seconds (ownership flip) |
| add new broker | minutes-to-hours (partition copy) | seconds (load manager rebalances) |
Rule of thumb. If your workload churns brokers often (autoscaling on Kubernetes, frequent rolling restarts, opportunistic capacity), the Pulsar split-storage model pays back the extra moving parts. If your workload is "ten brokers running steady state for two years," the Kafka coupled model has fewer surfaces and a simpler ops story.
SQL interview question on event-time deduplication for replayed streams
A senior interviewer might frame this as: "After a Pulsar-to-Kafka or Kafka-to-Kafka replay, the downstream warehouse table may contain duplicates. Write a SQL deduplication query that keeps the latest event per event_id using ingestion time as the tie-breaker." This blends streaming semantics with SQL — a common follow-up after an architecture discussion.
Solution Using a deduplication window function
-- Keep the latest event per event_id; ties broken by ingestion_ts DESC.
WITH ranked AS (
SELECT
event_id,
tenant_id,
payload,
event_ts,
ingestion_ts,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY ingestion_ts DESC, event_ts DESC
) AS rn
FROM stage_events
WHERE ingestion_ts >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT event_id, tenant_id, payload, event_ts, ingestion_ts
FROM ranked
WHERE rn = 1;
Step-by-step trace.
| event_id | ingestion_ts | event_ts | row_number | kept? |
|---|---|---|---|---|
| evt-1 | 2026-06-10 09:00 | 2026-06-10 08:59 | 2 | no |
| evt-1 | 2026-06-10 09:05 | 2026-06-10 08:59 | 1 | yes |
| evt-2 | 2026-06-10 09:01 | 2026-06-10 09:01 | 1 | yes |
| evt-3 | 2026-06-10 09:02 | 2026-06-10 09:02 | 2 | no |
| evt-3 | 2026-06-10 09:07 | 2026-06-10 09:02 | 1 | yes (replay wins) |
Under PARTITION BY event_id ORDER BY ingestion_ts DESC, the later ingestion_ts gets rn=1. So the replayed row (09:07) is rn=1 and the original (09:02) is rn=2. The final result keeps the most-recently ingested copy per event_id — exactly the semantic you want when a stream is replayed and may carry corrections.
Output:
| event_id | kept ingestion_ts | source |
|---|---|---|
| evt-1 | 2026-06-10 09:05 | latest |
| evt-2 | 2026-06-10 09:01 | only copy |
| evt-3 | 2026-06-10 09:07 | replayed wins |
Why this works — concept by concept:
-
PARTITION BY event_id — the dedup key. Every row sharing an
event_identers the same row-number sequence, independent of all other events. -
ORDER BY ingestion_ts DESC — picks the most recent copy by broker-ack time, which is the right tie-breaker after a replay (the new copy carries the corrected payload). For "first-write-wins" semantics, use
ASC. -
ROW_NUMBER() vs RANK() vs DENSE_RANK() — ROW_NUMBER assigns a unique sequence per partition, so the
WHERE rn = 1filter always keeps exactly one row. RANK and DENSE_RANK can produce ties and would keep multiple rows for the same event_id, defeating the dedup. - WHERE ingestion_ts >= now - 7 days — a sliding window prunes the scan; without it the dedup runs over the entire history every batch, which is O(table).
-
Cost — O(n log n) for the window sort per partition, where n is the rows per event_id (usually small). With a proper index on
(event_id, ingestion_ts DESC), the scan is near-linear in the date window.
SQL
Topic — streaming
Streaming problems (SQL)
2. Architecture — Kafka brokers + partitioned log vs Pulsar brokers + BookKeeper segments + metadata store
The deepest delta in apache pulsar vs kafka is where the bytes live — and what that means for every operational gesture you make on the cluster
The mental model in one line: Kafka brokers are fat — they hold compute, the partition log, and replication; Pulsar brokers are thin — they hold only compute, and the log lives in a separate BookKeeper layer keyed by segments (ledgers) instead of by partition. Once you can draw the two architectures side by side, every other trade-off — geo-replication, tiered storage, multi-tenancy — follows from that single split.
The Kafka broker in five bullets.
- One process. A Kafka broker is a single JVM that handles client protocol, partition leadership, log writes, log compaction, replication, and (in legacy mode) talks to ZooKeeper or (in modern mode) participates in the KRaft quorum.
-
Local log files. Each partition is a directory on local disk containing rolled
.logsegment files plus index files. Reads use the page cache +sendfilefor zero-copy. -
Replication is broker-to-broker. Leader broker writes to local log, follower brokers fetch and replicate.
min.insync.replicascontrols durability. - Scaling = data movement. Adding a broker or rebalancing partitions physically copies log data over the network. A 500 GB partition takes hours.
- Metadata in ZooKeeper or KRaft. Pre-3.3, ZooKeeper coordinates leader election and metadata. KRaft (3.3+, default in 3.7) replaces ZooKeeper with an internal Raft quorum of controller nodes.
The Pulsar broker in five bullets.
- Stateless proxy. A Pulsar broker is a JVM that handles client protocol, topic ownership, and forwards writes to BookKeeper. It owns no on-disk message data.
- Topic ownership. Each topic is owned by exactly one broker at a time; the load manager rebalances ownership when brokers come and go.
- BookKeeper does the durability. Writes flow broker → bookies. A ledger is a write-once segment striped across an ensemble of bookies (e.g. 3 bookies, write-quorum 3, ack-quorum 2).
- Scaling = ownership flip. Adding a broker or moving a hot topic is a metadata change — milliseconds — because the data does not move.
- Metadata in ZooKeeper or etcd. Pulsar uses ZooKeeper for cluster metadata historically; Pulsar 3.x supports etcd as the metadata backend.
BookKeeper in five bullets.
- A separate log service. BookKeeper is an Apache top-level project — it predates Pulsar (Yahoo, 2011). Pulsar happens to be its biggest user.
- Ledger is the unit. A ledger is a write-once, append-only segment. When a ledger closes, it is immutable.
- Ensemble + quorum. Each ledger is striped across an ensemble of bookies. Writes go to Qw bookies and the broker considers the entry committed once Qa of them ack.
- Bookies are storage nodes. A bookie holds a journal (for write durability) and ledger directories (for read efficiency). Hard separation of write path and read path inside the bookie.
- Striping = parallelism. Different ledgers can live on different bookies; one topic's ledgers can be spread across the whole cluster, giving high parallel write throughput.
Three concrete architectural consequences.
- Rebalance speed. Pulsar broker rebalance is a metadata operation; Kafka partition rebalance is a data copy. For a cluster that churns capacity, this matters every week.
- Topic count ceiling. Pulsar's metadata model and stateless brokers scale topic counts further than classic Kafka. KRaft has narrowed the gap but Pulsar still leads at the 100k+ topic axis.
- Operational surface. Kafka after KRaft is fewer moving parts (broker = everything). Pulsar is more moving parts (broker + bookie + metadata) but each part scales independently.
Common interview probes on pulsar vs kafka performance architecture.
- "Why does Pulsar have BookKeeper at all?" — to decouple the durability layer from the compute layer, which makes brokers stateless and storage independently scalable.
- "What is a ledger?" — a write-once segment in BookKeeper. Multiple ledgers per topic. Closed ledgers are immutable, which makes offload to S3 safe.
- "How does Kafka handle stateless brokers?" — it doesn't (in the classic sense). KRaft removed ZooKeeper, but partition data still lives on the broker. The path to "stateless Kafka" is tiered storage (KIP-405), which keeps the local log shorter so rebalance is faster.
- "What is the write quorum in Pulsar?" —
Qwis the number of bookies the broker writes to per entry;Qais the number of acks the broker waits for before confirming the write. TypicalQw=3, Qa=2.
Worked example — write-path latency budget in both engines
Detailed explanation. Both engines target single-digit-millisecond producer-to-broker p99 latency on a healthy cluster. The path differs: Kafka is one hop (producer → leader broker → disk + replicate), Pulsar is two hops (producer → broker → bookies → disk). The extra hop is one of the things that makes Pulsar architecturally more flexible but also adds a small fixed latency cost.
Question. Lay out a per-message write budget for both engines on a healthy cluster. Identify which hops are fixed cost and which are dependent on min.insync.replicas / Qa.
Input (assumptions).
| factor | value |
|---|---|
| network hop | 0.2 ms |
| disk write (journal) | 0.5 ms |
| RF / Qw, Qa | RF=3 min.isr=2 ; Qw=3 Qa=2 |
| client → broker RTT | 0.4 ms |
Code (latency math, pseudocode).
# Kafka — producer p99 budget
client -> leader broker : 0.4 ms (one RTT, half is included in ack)
leader broker append to log : 0.5 ms (disk + page cache)
leader -> follower replicate : 0.4 ms (one RTT, parallel to leader write)
follower ack to leader : already counted
leader confirms when min.isr=2 : leader + 1 follower ack
TOTAL : ~1.5-3 ms p99 healthy
# Pulsar — producer p99 budget
client -> broker : 0.4 ms
broker -> bookies (Qw=3) : 0.4 ms RTT (parallel to all 3)
bookies journal write : 0.5 ms each (parallel)
Qa=2 acks back to broker : 0.4 ms
broker -> client : 0.4 ms
TOTAL : ~2-4 ms p99 healthy
Step-by-step explanation.
-
Kafka write path. The client sends the message to the partition leader broker. The leader appends to its local log file and replicates to in-sync followers in parallel. Once
min.insync.replicasfollowers ack, the leader returns success. One network hop client-to-leader, plus one parallel replication hop. - Pulsar write path. The client sends the message to the topic-owning broker (which it discovered via a lookup the first time). The broker forwards the entry to the BookKeeper ensemble of bookies. Each bookie writes to its journal and acks. Once Qa bookies ack, the broker returns success to the client. Two network hops in the critical path: client-to-broker and broker-to-bookies.
- Why the extra hop is small. Bookies are in the same data centre as brokers; the RTT is a fraction of a millisecond. The journal write at the bookie is the dominant cost — same as the log append on a Kafka broker.
- Why the architectural cost pays back. That extra hop purchased statelessness, which buys instant rebalance and easier scaling. For most workloads, an extra 0.5–1 ms p99 is fine.
- The tunable knobs. Lowering Qa from 3 to 2 trades durability for latency (one less ack to wait for). Lowering RF from 3 to 2 in Kafka does the same. Be explicit about durability requirements before you tune.
Output (write-path latency).
| engine | hops in critical path | typical p99 (healthy) | dominant cost |
|---|---|---|---|
| Kafka | client → leader; leader → 1 follower (parallel) | 1.5–3 ms | leader log append + replication ack |
| Pulsar | client → broker; broker → bookies | 2–4 ms | bookie journal write + Qa ack |
Rule of thumb. Kafka p99 is typically a tick lower because there is one fewer network hop in the critical path. Pulsar p99 is rarely the bottleneck for analytics or business-event workloads — it matters only for sub-2 ms transactional bus use cases.
Worked example — partition reassignment in Kafka vs topic ownership flip in Pulsar
Detailed explanation. An on-call engineer is adding a new broker to a hot cluster. In Kafka, this is a partition reassignment — log data has to move. In Pulsar, this is a load-manager rebalance — ownership of some topics flips to the new broker, but no data moves. The operational story is wildly different.
Question. Walk through both procedures step by step. Estimate the time and the impact on producers / consumers.
Input.
| dimension | value |
|---|---|
| existing cluster | 6 brokers, 600 partitions / 6,000 topics |
| new capacity | +2 brokers |
| hot partition log size | 500 GB |
| inter-broker bandwidth | 1 Gbps (~125 MB/s) |
Code (Kafka reassignment, kafka-reassign-partitions.sh).
# 1) generate a proposed reassignment plan
kafka-reassign-partitions.sh \
--bootstrap-server broker-1:9092 \
--topics-to-move-json-file topics.json \
--broker-list "1,2,3,4,5,6,7,8" \
--generate
# 2) execute (throttled)
kafka-reassign-partitions.sh \
--bootstrap-server broker-1:9092 \
--reassignment-json-file plan.json \
--execute \
--throttle 50000000 # 50 MB/s
# 3) monitor — can take hours on large partitions
kafka-reassign-partitions.sh \
--bootstrap-server broker-1:9092 \
--reassignment-json-file plan.json \
--verify
Pulsar load-manager rebalance.
# Brokers self-rebalance via the modular load manager; you usually do nothing.
# Manual flip if needed:
pulsar-admin namespaces unload public/default
# Or for one bundle:
pulsar-admin namespaces unload public/default --bundle 0x00000000_0x40000000
# Inspect ownership:
pulsar-admin brokers list saas-co-cluster
pulsar-admin broker-stats topics
Step-by-step explanation.
- Kafka reassignment, step 1. Generate a plan that moves some partition replicas to the new brokers. The plan is a JSON file mapping partition → broker list.
- Kafka reassignment, step 2. Execute the plan with a bandwidth throttle. The leader broker starts shipping log data to the new broker; the new broker becomes a follower and catches up.
- Kafka reassignment, step 3. Monitor until every partition in the plan is fully replicated to the new broker set. For a 500 GB partition at 50 MB/s throttled, that is ~10,000 seconds = ~2.7 hours per partition. In practice teams run multiple partitions in parallel but the total wall-clock can be hours.
- Pulsar rebalance, step 1. The modular load manager continuously samples broker CPU, memory, throughput, and topic count. When a new broker joins, the load manager reassigns ownership of some topics — a metadata update.
- Pulsar rebalance, step 2. Clients receive a "topic moved" hint from the old broker; they reconnect to the new broker. No data movement; the new broker connects to the same bookies.
- The 100x difference. Kafka rebalance is bound by network bandwidth (data movement). Pulsar rebalance is bound by metadata-write latency. For a 500 GB topic, the difference is hours vs. seconds.
Output.
| step | Kafka time | Pulsar time |
|---|---|---|
| add broker | seconds (join cluster) | seconds (join cluster) |
| rebalance plan | minutes | none (automatic) |
| data movement | hours per hot partition | none |
| total to "hot" new broker | hours–days | seconds–minutes |
Rule of thumb. If your team rebalances Kafka every few weeks and the operation is well-tooled, the data-movement cost is acceptable. If your cluster scales aggressively (autoscaling on Kubernetes, frequent capacity changes, multiple migrations a quarter), Pulsar's stateless rebalance pays back the architectural complexity by week 4.
SQL interview question on per-broker load skew detection
The probe usually sounds like: "Given a per-broker minute-level metric table, write a query that flags brokers running more than 1.5x the cluster median CPU for 15 consecutive minutes — that is the classic 'hot broker' signal that triggers a rebalance."
Solution Using percentile_disc + a running streak
WITH per_minute AS (
SELECT
broker_id,
ts,
cpu_pct,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY cpu_pct)
OVER (PARTITION BY ts) AS cluster_median_cpu
FROM broker_metrics
WHERE ts >= NOW() - INTERVAL '2 hours'
),
flagged AS (
SELECT
broker_id,
ts,
cpu_pct,
cluster_median_cpu,
CASE WHEN cpu_pct > 1.5 * cluster_median_cpu THEN 1 ELSE 0 END AS hot
FROM per_minute
),
runs AS (
SELECT
broker_id,
ts,
hot,
SUM(CASE WHEN hot = 0 THEN 1 ELSE 0 END)
OVER (PARTITION BY broker_id ORDER BY ts) AS cool_grp
FROM flagged
)
SELECT broker_id, MIN(ts) AS streak_start, COUNT(*) AS streak_len
FROM runs
WHERE hot = 1
GROUP BY broker_id, cool_grp
HAVING COUNT(*) >= 15
ORDER BY streak_len DESC;
Step-by-step trace.
| broker_id | ts | cpu_pct | median | hot | cool_grp | streak_len |
|---|---|---|---|---|---|---|
| b-4 | 10:00 | 78 | 40 | 1 | 0 | 1 |
| b-4 | 10:01 | 82 | 41 | 1 | 0 | 2 |
| b-4 | … | … | … | 1 | 0 | … |
| b-4 | 10:14 | 80 | 42 | 1 | 0 | 15 |
| b-4 | 10:15 | 35 | 40 | 0 | 1 | (resets) |
The cool_grp running-sum trick groups consecutive hot minutes into one streak per broker. The HAVING clause keeps only streaks of 15+.
Output:
| broker_id | streak_start | streak_len |
|---|---|---|
| b-4 | 10:00 | 15 |
Why this works — concept by concept:
- PERCENTILE_DISC OVER PARTITION BY ts — computes the cluster median per minute so the threshold is relative to the current cluster, not a static number. A 70% CPU broker is "cold" when the cluster median is 60% and "hot" when the cluster median is 30%.
- Hot flag as 0/1 — turns the threshold check into an integer indicator. Cheap, indexable, and composable with running sums.
-
SUM OVER (PARTITION BY broker_id ORDER BY ts) running-grouping — every time a row is cold (
hot=0), the running sum bumps; consecutive hot rows share the samecool_grpvalue. That is the canonical "consecutive-streak" trick from Blog106 and Blog121. - GROUP BY (broker_id, cool_grp) — each streak collapses into one row carrying its length.
-
Cost — three window passes over a 2-hour window of broker metrics; with an index on
(broker_id, ts)the scan is O(rows-in-window).
SQL
Topic — window functions
Window function problems (SQL)
3. Geo-replication — Pulsar native multi-region clusters vs Kafka MirrorMaker 2 + offset translation
pulsar geo replication is a property of the topic — Kafka MirrorMaker 2 is a separate Kafka Connect job
The mental model in one line: Pulsar treats geo-replication as a per-topic configuration (set the replication-cluster list and writes flow to every region asynchronously); Kafka treats geo-replication as an out-of-process stream-processing job (MirrorMaker 2) that reads from a source cluster and writes to a destination cluster while translating consumer offsets. Once you can name those two patterns, every probe on pulsar geo replication reduces to "which side carries the operational load — the broker or a separate worker fleet?"
Pulsar geo-replication in detail.
-
Per-topic property. Set the replication-cluster list at the namespace level:
pulsar-admin namespaces set-clusters tenant/ns --clusters us-east,eu-west,ap-south. Every topic under that namespace is automatically replicated to the listed clusters. - Asynchronous by default. Pulsar replicates messages asynchronously between clusters — a producer in us-east acks the moment the local cluster persists; the message then flows to eu-west and ap-south in the background.
- Active/active out of the box. Producers can publish to the topic in any cluster; consumers can subscribe in any cluster. The same topic name in all regions; the same subscription name.
- Replicator subscription. Internally, each cluster maintains a special "replicator" subscription that reads local-cluster writes and ships them to remote clusters. The subscription handles backpressure and retries.
- Sync replication is opt-in. For workloads that demand cross-region durability before ack, Pulsar 2.7+ supports synchronous geo-replication per-namespace — at the cost of inter-region RTT in the write path.
Kafka MirrorMaker 2 in detail.
- A Kafka Connect job. MirrorMaker 2 (KIP-382, shipped in Kafka 2.4) runs as Kafka Connect tasks. You deploy a worker fleet separate from the brokers.
-
Source and destination cluster. MM2 reads from a source cluster and writes to a destination cluster. Topic names are prefixed with the source cluster alias by default (e.g.
prod-us.eventsin dr-us). -
Offset translation. MM2 maintains a
checkpointtopic mapping (source partition, source offset) → (target partition, target offset). A consumer that fails over from prod-us to dr-us can read the checkpoint and resume from the correct position. - Active/active is configurable. A bidirectional MM2 deployment (us → eu and eu → us) achieves active/active, but you must adopt a naming convention to avoid replication loops (use the cluster alias prefix).
- Operational cost. MM2 is a separate fleet: you patch it, monitor it, alert on its lag, and scale it independently from the brokers. Comparable cost to running a Kafka Streams app at the same throughput.
Three trade-offs that drive the decision.
- Topology overhead. Pulsar geo-replication = topic property. Kafka MM2 = separate Kafka Connect fleet. The Pulsar side has fewer moving parts at runtime.
-
Topic naming. Pulsar uses the same topic name in every region — "events" is "events" everywhere. Kafka MM2 uses a source-cluster-prefixed topic name to avoid loops (e.g.
dr-us.events). Consumers must know whether to read the local or the remote name. - Offset semantics. Pulsar's subscription cursor is per-cluster (each cluster's broker tracks its own subscriber position); Kafka's offsets are per-cluster too, but MM2 must explicitly translate them via the checkpoint topic for clean failover.
Common interview probes on pulsar geo replication.
- "How does Pulsar handle a cross-region replication loop?" — each message carries a
replicated_fromheader; a cluster does not re-replicate a message it received from elsewhere. - "Is Pulsar geo-replication synchronous?" — no, by default it is asynchronous. Sync replication is per-namespace opt-in (2.7+).
- "Why is MirrorMaker 2 a separate fleet?" — because Kafka's architecture has no concept of cross-cluster replication inside the broker. MM2 is the bridge built outside it.
- "What is offset translation in MM2?" — a mapping between source-cluster offsets and target-cluster offsets, kept in a
checkpointtopic so consumers can resume on the target after a failover.
Worked example — same-data, two-region topic in Pulsar
Detailed explanation. A team needs the same "events" topic in us-east-1 and eu-west-1. Producers in either region publish; consumers in either region subscribe; failover should be transparent. In Pulsar, this is two commands.
Question. Set up a geo-replicated events topic in two Pulsar clusters. Show the admin commands, the producer code, and how a consumer in either region sees the same stream.
Input.
| dimension | value |
|---|---|
| tenant | acme |
| namespace | acme/prod |
| topic | events |
| clusters | us-east, eu-west |
| replication | async (default) |
Code (Pulsar admin + producer/consumer).
# 1) Configure clusters on the metadata side (one-time)
pulsar-admin clusters create us-east --url http://us-east-broker:8080
pulsar-admin clusters create eu-west --url http://eu-west-broker:8080
# 2) Allow the tenant to use both clusters
pulsar-admin tenants create acme --allowed-clusters us-east,eu-west
# 3) Set the replication-cluster list on the namespace
pulsar-admin namespaces create acme/prod --clusters us-east,eu-west
pulsar-admin namespaces set-clusters acme/prod --clusters us-east,eu-west
// Producer in us-east-1
PulsarClient client = PulsarClient.builder()
.serviceUrl("pulsar://us-east-broker:6650")
.build();
Producer<byte[]> producer = client.newProducer()
.topic("persistent://acme/prod/events")
.create();
producer.send("hello-from-us".getBytes());
// Consumer in eu-west-1 — same topic name
PulsarClient euClient = PulsarClient.builder()
.serviceUrl("pulsar://eu-west-broker:6650")
.build();
Consumer<byte[]> consumer = euClient.newConsumer()
.topic("persistent://acme/prod/events")
.subscriptionName("eu-app-subscription")
.subscribe();
Message<byte[]> msg = consumer.receive();
System.out.println(new String(msg.getValue())); // "hello-from-us"
consumer.acknowledge(msg);
Step-by-step explanation.
- Cluster wiring. The two Pulsar clusters know about each other via the metadata store (cluster URL registration). One-time setup.
- Tenant scope. The tenant "acme" is allowed on both clusters. Namespaces inherit this list.
-
Replication-cluster list. Setting the list on the namespace means every topic under
acme/prodis automatically replicated to both clusters. No per-topic config needed. - Producer in us-east. The producer connects to the us-east broker and publishes. The message is persisted locally (BookKeeper in us-east) and acked.
- Replicator subscription. A background "replicator" subscription on the us-east cluster reads new messages and ships them to eu-west. Latency depends on inter-region RTT.
- Consumer in eu-west. The consumer connects to the eu-west broker on the same topic name. The eu-west broker has the message (replicated) and delivers it.
- Subscription scope. The subscription "eu-app-subscription" is local to eu-west — the us-east cluster does not see it. Each cluster tracks its own subscriber positions.
Output.
| component | location | sees the message |
|---|---|---|
| producer | us-east | acked locally |
| us-east BookKeeper | us-east | persisted |
| replicator subscription | us-east → eu-west | shipped async |
| eu-west BookKeeper | eu-west | persisted |
| eu-west consumer | eu-west | receives + acks |
Rule of thumb. Pulsar geo-replication is a namespace decision. Once you set the cluster list, every topic is automatically replicated — there is no per-topic plumbing. For workloads that need different replication topologies per topic, use different namespaces.
Worked example — same-data, two-region topic in Kafka MirrorMaker 2
Detailed explanation. The Kafka equivalent of the Pulsar setup above requires deploying a MirrorMaker 2 cluster and configuring a replication flow. Producers still publish to one cluster (the source); consumers in the other region read the prefixed mirror topic.
Question. Set up a Kafka geo-replication topology with MM2. Show the connector config, the producer code, and what consumers in the target region need to know.
Input.
| dimension | value |
|---|---|
| source cluster | prod-us (3 brokers) |
| target cluster | dr-eu (3 brokers) |
| topic | events (in prod-us) |
| MM2 fleet | 3 Connect workers in eu-west |
| direction | active/passive (us → eu) |
Code (mm2.properties + producer/consumer).
# mm2.properties — MirrorMaker 2 Connect config
clusters = prod-us, dr-eu
prod-us.bootstrap.servers = us-broker-1:9092,us-broker-2:9092,us-broker-3:9092
dr-eu.bootstrap.servers = eu-broker-1:9092,eu-broker-2:9092,eu-broker-3:9092
# Replication flow: prod-us -> dr-eu
prod-us->dr-eu.enabled = true
prod-us->dr-eu.topics = events
# Offset translation (the checkpoint topic)
prod-us->dr-eu.emit.checkpoints.enabled = true
prod-us->dr-eu.sync.group.offsets.enabled = true
prod-us->dr-eu.refresh.topics.interval.seconds = 30
# Naming convention — source-prefixed topic in destination
replication.policy.separator = .
# Result: 'events' in prod-us appears as 'prod-us.events' in dr-eu
// Producer in us-east-1
KafkaProducer<String, String> p = new KafkaProducer<>(producerProps);
p.send(new ProducerRecord<>("events", key, value)); // writes to prod-us
// Consumer in eu-west-1 — must read the prefixed topic
KafkaConsumer<String, String> c = new KafkaConsumer<>(consumerProps);
c.subscribe(List.of("prod-us.events"));
// On failover, the app re-subscribes to "prod-us.events" in dr-eu
// MM2 has been writing the local-region copy + translating offsets
Step-by-step explanation.
-
MM2 config. Two clusters listed; bootstrap servers for each; the replication flow
prod-us->dr-euis enabled for theeventstopic. -
Naming convention. The destination topic in dr-eu is
prod-us.events(source-cluster prefix). This prevents replication loops if you also enabledr-eu->prod-us. - Producer in us-east. Standard Kafka producer to prod-us. No knowledge of MM2.
-
MM2 in flight. MM2 workers in eu-west read from
prod-us.events(well, the sourceeventstopic in prod-us via the bootstrap servers) and write toprod-us.eventsin dr-eu. They also write checkpoint records. -
Consumer in eu-west. The consumer subscribes to
prod-us.events— the prefixed name. If the app failed over from prod-us to dr-eu, it would re-read the checkpoint to find the correct offset to resume from. - Compared with Pulsar. The producer and consumer code look similar. The difference is the naming convention (prefix) and the operational overhead (an extra fleet of MM2 workers to deploy, scale, and alert on).
Output.
| component | location | sees the message |
|---|---|---|
| producer | prod-us | acks locally |
prod-us topic events
|
prod-us | persisted |
| MM2 worker fleet | eu-west | reads + writes |
dr-eu topic prod-us.events
|
dr-eu | persisted with mapped offsets |
| eu-west consumer (failover) | dr-eu | subscribes to prefixed topic |
Rule of thumb. Kafka MM2 is an excellent tool for one-direction or bidirectional replication, but it carries the cost of a separate fleet. If your workload only needs occasional cross-region replication (DR replica), MM2 is reasonable. If your workload is active/active by design, Pulsar's per-topic native replication is operationally lighter.
Worked example — failover and offset translation in MM2
Detailed explanation. The interview moment: a consumer in prod-us has been reading at offset 1,000,000 on the events topic. The us region goes dark. The consumer must fail over to dr-eu and pick up reading the equivalent message on prod-us.events. The MM2 checkpoint topic is the key.
Question. Walk through the offset-translation steps. What does the consumer read first after failover?
Input.
| dimension | value |
|---|---|
| source offset (prod-us.events) | 1,000,000 |
| target offset (prod-us.events in dr-eu) | 1,000,000 (approximately) |
| checkpoint record | (consumer-group=app, partition=0, src=1000000, tgt=999987) |
Code (consumer-side reconciliation, simplified).
// On failover detected
String checkpointTopic = "prod-us.checkpoints.internal";
// MM2 has been writing records like:
// key = (consumer-group, partition)
// value = ConsumerCheckpoint { upstreamOffset=1_000_000, downstreamOffset=999_987 }
ConsumerCheckpoint cp = lookupCheckpointFor("app", 0);
long resumeOffset = cp.downstreamOffset(); // 999,987
KafkaConsumer<String, String> c = new KafkaConsumer<>(drProps);
c.assign(List.of(new TopicPartition("prod-us.events", 0)));
c.seek(new TopicPartition("prod-us.events", 0), resumeOffset);
// Now read forward — the next poll() returns the message that was at
// offset 1,000,001 in prod-us, which is at 999,988 in dr-eu.
Step-by-step explanation.
-
The checkpoint topic. MM2 emits a record to
prod-us.checkpoints.internalperiodically (default every 60s) summarising "consumer-groupappwas at upstream offset X; in the target cluster that corresponds to downstream offset Y." - The consumer failover. The application code (or a sidecar) reads the most recent checkpoint for the group + partition. The downstream offset becomes the resume point.
-
Why offsets do not match 1:1. The target cluster's
prod-us.eventstopic was built by MM2 reading from source. The first message MM2 ever wrote into the target had offset 0 in the target, regardless of its source offset. The mapping is therefore not a constant — it can shift if MM2 was restarted, lagged, or skipped messages on a partition rebalance. - The "approximately" caveat. Offset translation in MM2 is best-effort. The consumer may re-process or skip a small number of messages around the failover boundary. Idempotent processing downstream is the standard mitigation.
- Compared with Pulsar. Pulsar subscriptions are per-cluster. The eu-west consumer has its own subscription cursor in eu-west — no translation table to maintain. Failover is just "switch clients to the eu-west broker, keep using the same subscription name."
Output.
| step | action |
|---|---|
| 1 | failover detected |
| 2 | lookup checkpoint (group, partition) |
| 3 | seek downstream offset 999,987 |
| 4 | resume; reprocess a few messages around the boundary |
| 5 | downstream dedup absorbs the small overlap |
Rule of thumb. MM2 failover is workable but requires consumers to be idempotent. Pulsar failover is workable because subscriptions are local — but the application must accept that the eu-west cursor has its own progress, independent of us-east.
SQL interview question on cross-region replication lag
A senior interviewer might frame this as: "Given a per-message audit table that records (message_id, source_region, source_ts, target_region, target_ts), write a SQL query that returns the 95th-percentile replication lag per (source, target) pair over the last hour. This is the standard 'is geo-replication healthy?' metric."
Solution Using a percentile aggregate over the lag distribution
SELECT
source_region,
target_region,
COUNT(*) AS messages_replicated,
PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (target_ts - source_ts)))
AS p50_lag_sec,
PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (target_ts - source_ts)))
AS p95_lag_sec,
PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (target_ts - source_ts)))
AS p99_lag_sec
FROM replication_audit
WHERE source_ts >= NOW() - INTERVAL '1 hour'
AND target_ts IS NOT NULL
GROUP BY source_region, target_region
ORDER BY p95_lag_sec DESC;
Step-by-step trace.
| message_id | source_region | source_ts | target_region | target_ts | lag_sec |
|---|---|---|---|---|---|
| m-1 | us-east | 10:00:00.0 | eu-west | 10:00:00.45 | 0.45 |
| m-2 | us-east | 10:00:00.0 | eu-west | 10:00:01.20 | 1.20 |
| m-3 | us-east | 10:00:01.0 | eu-west | 10:00:01.40 | 0.40 |
| m-4 | us-east | 10:00:02.0 | ap-south | 10:00:03.10 | 1.10 |
| m-5 | us-east | 10:00:02.0 | ap-south | 10:00:04.50 | 2.50 |
After the percentile aggregate per (source, target) pair, you get the lag distribution. The p95 value answers "what is the worst-case the application sees 95% of the time?"
Output:
| source_region | target_region | messages | p50_lag_sec | p95_lag_sec | p99_lag_sec |
|---|---|---|---|---|---|
| us-east | eu-west | 1_200_000 | 0.50 | 1.30 | 2.10 |
| us-east | ap-south | 1_200_000 | 1.20 | 2.80 | 4.50 |
Why this works — concept by concept:
- PERCENTILE_DISC vs PERCENTILE_CONT — DISC picks an actual value from the input (no interpolation). CONT linearly interpolates. For lag SLOs ("our SLO is 'p95 below 2 seconds'"), DISC is more honest — it returns a value that actually happened.
- WITHIN GROUP (ORDER BY ...) — the ordered-set-aggregate clause. It tells the percentile function which expression to rank by.
-
EXTRACT(EPOCH FROM ts_diff) — converts a
INTERVALinto seconds for arithmetic. Without it, the percentile would be computed on intervals, which is engine-dependent. - GROUP BY (source_region, target_region) — each region pair gets its own distribution. A failing us-east→ap-south pair is invisible if you aggregate globally.
-
Cost — O(n log n) per group due to the percentile sort. With an index on
(source_region, target_region, source_ts)the scan is local to the hour window.
SQL
Topic — real-time analytics
Real-time analytics problems (SQL)
4. Tiered storage — BookKeeper offload to S3 vs Kafka tiered storage (KIP-405)
Both engines now offload cold log segments to object storage — Pulsar shipped it in 2018, Kafka caught up with KIP-405 in 2023
The mental model in one line: A closed log segment is immutable; immutable data belongs in cheap object storage; the broker can fetch it back transparently when a consumer wants to read history. Once you internalise the three shelves (hot SSD / warm BookKeeper or local segments / cold S3), the entire pulsar tiered storage and KIP-405 conversation becomes a cost-and-latency exercise.
Pulsar tiered storage in detail.
- BookKeeper offloader. Pulsar 2.1 (2018) shipped a per-namespace offloader that moves closed ledgers from BookKeeper bookies to object storage. Closed ledgers are immutable, so the offload is a one-shot copy.
-
Drivers.
tiered-storage-jcloudsupports AWS S3, GCS, Azure Blob;tiered-storage-filesystemsupports HDFS. The offloader runs inside the broker process and is configured per-namespace. - Transparent reads. Once a ledger is offloaded, the BookKeeper copy can be deleted. When a consumer requests a message from that ledger, the broker reads from the object store directly and streams to the consumer.
-
Trigger. Configurable: by namespace size threshold, by ledger age, or manually via
pulsar-admin topics offload. The "default" pattern is "offload everything older than 1 day" for a hot-tail-cold-history workload. - Cost shape. Object storage is ~10x cheaper per GB than SSD. Pulsar's offloader makes "infinite retention" actually affordable.
Kafka KIP-405 in detail.
- KIP-405. Proposed 2018, adopted in Apache Kafka 3.6 (October 2023). Confluent shipped a proprietary version (Confluent Tiered Storage) in 2020; Aiven and Strimzi shipped variants soon after.
- Local hot tier, remote cold tier. Each partition has an active segment on local disk (hot) plus rolled segments. Rolled segments are uploaded to the remote tier (S3 / GCS / Azure Blob) and the local copy is deleted when the local retention size is exceeded.
- Transparent reads. When a consumer requests an offset that is in the remote tier, the broker fetches from the object store and streams to the consumer. Latency is higher than local reads.
-
Configuration.
remote.log.storage.system.enable=true,remote.log.storage.manager.class.path=...,log.retention.ms(local),log.local.retention.ms(local cap), plus broker-specific cloud credentials. - Same cost shape. Cold tier ~$0.023/GB-month for S3 standard; warm tier ~$0.05/GB-month for EBS gp3; hot tier ~$0.20/GB-month for io2.
Three trade-offs.
- Maturity. Pulsar's offloader has been production since 2018 (more battle-testing). Apache Kafka tiered storage has been GA since late 2023 — the operational track record is shorter but the design is sound.
- Read latency from cold. Both engines hit the same physics: a cold-tier read pulls a segment object from S3, which is hundreds of milliseconds to first-byte. Latency-sensitive consumers should stay in the hot tier.
- Cost / GB-month. Hot ~$0.20, warm ~$0.05, cold ~$0.023. A 1 TB topic with 30-day local + 365-day cold retention costs about $20 cold + $200 hot, instead of $7,000 if you ran a 365-day all-hot retention.
Common interview probes on pulsar tiered storage.
- "Why is Pulsar's offload safe?" — because BookKeeper ledgers are immutable once closed; the offload is a copy, not a move, until the bookie copy is reclaimed.
- "Can a consumer read from cold storage transparently?" — yes, in both engines. The broker proxies the cold read; the consumer protocol is unchanged.
- "What is the typical hot-tail retention?" — 1–7 days local, 30–365+ days cold. Depends on the read-history pattern.
- "How does cost compare?" — cold tier is ~10x cheaper per GB-month than hot. Tiered storage is the single biggest cost lever in modern messaging.
Worked example — Pulsar BookKeeper offload to S3
Detailed explanation. A namespace's topics accumulate 100 GB / day of events. The team wants 7 days of fast local access + 365 days of replayable history. Without tiered storage, that is 365 days × 100 GB = 36.5 TB of BookKeeper SSD — expensive. With offload, the warm tier holds 7 days (700 GB SSD), and the cold tier (S3) holds the rest (35.8 TB) at one-tenth the price.
Question. Configure the BookKeeper offloader on a namespace. Show the admin commands, the threshold, and how a consumer reads a one-month-old message.
Input.
| dimension | value |
|---|---|
| namespace | acme/prod |
| daily write | 100 GB |
| local (warm) retention | 7 days |
| cold retention | 365 days |
| object store | S3 (us-east-1) |
Code (broker.conf + pulsar-admin commands).
# broker.conf — enable the S3 offloader (one-time per broker)
managedLedgerOffloadDriver=aws-s3
s3ManagedLedgerOffloadBucket=acme-pulsar-coldstore
s3ManagedLedgerOffloadRegion=us-east-1
managedLedgerOffloadThresholdInBytes=107374182400 # 100 GiB per topic
# Per-namespace offload policy: trigger offload when 700 GiB accumulates
pulsar-admin namespaces set-offload-threshold \
--size 751619276800 acme/prod # 700 GiB
pulsar-admin namespaces set-offload-deletion-lag \
--lag 4h acme/prod # delete BK copy 4h after S3 upload
# Force an offload right now (for testing)
pulsar-admin topics offload \
-s 1G \
persistent://acme/prod/events
# Inspect the offload state
pulsar-admin topics offload-status persistent://acme/prod/events
Step-by-step explanation.
- Broker config. Tell every broker how to talk to S3: bucket, region, driver. Credentials come from the EC2 instance profile or the standard AWS SDK chain.
- Per-namespace threshold. Configure a size threshold (700 GiB) — when the namespace's local size exceeds this, the broker offloads the oldest closed ledger.
- Deletion lag. After a successful S3 upload, the broker waits 4 hours before deleting the BookKeeper copy. The lag is a safety net — if the S3 upload was corrupted, the BookKeeper copy still serves reads.
- Read path. A consumer asks for a one-month-old message. The broker checks its managed-ledger metadata: the ledger holding that message has been offloaded. The broker fetches the segment object from S3, decodes it, and streams the message to the consumer.
- Latency. The first read of a cold segment pays the S3 first-byte latency (~100–500 ms). Subsequent reads from the same segment are served from the broker's read cache.
Output.
| age of message | served from | typical latency |
|---|---|---|
| 0–7 days | BookKeeper (SSD) | <5 ms |
| 7+ days | S3 (cold) | 100–500 ms first byte; then streaming |
| force-offloaded | S3 | as above |
Rule of thumb. Tune the local-retention size to match your "fast replay" window — the time the application would re-read in a normal operation. Anything older is replay-on-rare-occasion, and the cold-tier latency is fine.
Worked example — Kafka KIP-405 tiered storage
Detailed explanation. The Kafka equivalent ships in Apache 3.6+. The configuration moves from "broker holds everything" to "broker holds local-retention; remote tier holds the rest." The consumer code does not change.
Question. Configure KIP-405 tiered storage on a broker. Show the local vs remote retention split and the consumer experience.
Input.
| dimension | value |
|---|---|
| Kafka version | 3.6+ |
| topic | events |
| local retention | 7 days |
| remote retention | 365 days |
| remote store | S3 |
Code (broker config + topic config).
# server.properties — enable tiered storage (one-time per broker)
remote.log.storage.system.enable=true
remote.log.storage.manager.class.path=/opt/kafka/libs/tiered/*
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.RemoteLogManager
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager
# Cloud config (driver-specific)
rsm.config.storage.s3.bucket.name=acme-kafka-coldstore
rsm.config.storage.s3.region=us-east-1
# Topic-level configuration: enable remote storage + retention split
kafka-configs.sh --bootstrap-server broker-1:9092 \
--entity-type topics --entity-name events --alter \
--add-config remote.storage.enable=true,\
local.retention.ms=604800000,\
retention.ms=31536000000
# local.retention.ms = 7 days (in local broker tier)
# retention.ms = 365 days (total, local + remote)
Step-by-step explanation.
- Broker config. Enable tiered storage; point to the remote log manager implementation and the cloud bucket. Restart brokers in a rolling fashion.
-
Topic config.
remote.storage.enable=trueactivates tiering for the topic.local.retention.mscaps how much stays on the broker;retention.msis the total horizon. - What happens at write time. Producers publish as usual. Brokers append to active segments. When a segment is rolled (size or time), a background task uploads it to S3 and registers metadata.
- What happens at consume time, hot data. Consumer asks for a recent offset. Broker reads from local log. Indistinguishable from non-tiered Kafka.
- What happens at consume time, cold data. Consumer asks for a 30-day-old offset. Broker looks up the segment in remote metadata, fetches from S3, streams to the consumer. Latency higher than local but the consumer protocol is identical.
- Compared with Pulsar. Same conceptual model, same cost shape, same trade-offs. Implementation maturity is the main delta in 2026 (Pulsar has 5 more years of production experience).
Output.
| age of message | served from | latency |
|---|---|---|
| 0–7 days | local broker disk | <5 ms |
| 7+ days | S3 (remote tier) | 100–500 ms first byte; then streaming |
Rule of thumb. Both engines now have tiered storage. The right choice is determined by which engine you are already running, not by the tiering capability — both make "infinite retention" economically feasible.
Worked example — the cost curve for 1 PB of historical events
Detailed explanation. Take a one-petabyte topic, growing 1 GB / hour. Compute the monthly cost of running all-hot, hot-tail + warm-recent, and hot-tail + cold-history. The cold tier is the entire game.
Question. Estimate the monthly storage cost in three retention strategies. Compare both engines (they have the same cost shape).
Input — pricing assumptions.
| tier | $ / GB-month (rough 2026) |
|---|---|
| hot (broker SSD / io2 / NVMe) | $0.20 |
| warm (BookKeeper SSD / gp3) | $0.05 |
| cold (S3 standard) | $0.023 |
Code (math, not SQL).
data set: 1 PB = 1,048,576 GB
# Strategy A: all hot SSD
1_048_576 * 0.20 = $209,715 / month
# Strategy B: 7d hot + 30d warm + 365d cold
hot = 7 * 24 GB = 168 GB * 0.20 = $33.6
warm = 23 * 24 GB = 552 GB * 0.05 = $27.6
cold = remainder ~1,048,000 GB (minus a small recent slice)
* 0.023 ~ $24,104
TOTAL ~ $24,165 / month
# Strategy C: 1d hot + 365d cold (BookKeeper offload / KIP-405)
hot = 1 * 24 GB = 24 GB * 0.20 = $4.8
cold = ~1,048,552 GB * 0.023 = $24,116
TOTAL ~ $24,121 / month
Step-by-step explanation.
- All-hot strategy. $210k/month is a sticker shock that drives most teams to tiering. This is the cost without offload — Kafka pre-3.6 or Pulsar pre-2.1.
- Three-tier strategy. Pushing 23 days of recent data to a warm SSD tier and the rest to cold S3 lowers the cost by ~90%. This is the Pulsar 2.1+ pattern with BookKeeper as the warm tier.
- Hot-tail-cold-history strategy. Pushing all but 1 day to cold drops the cost almost identically — the warm tier was a small fraction of the total anyway.
- Real-world adjustment. Add S3 PUT/GET request charges (~$0.005 / 1k PUT; ~$0.0004 / 1k GET) and inter-AZ traffic for replication if applicable. The relative ranking does not change; the absolute number shifts by 5–15%.
Output.
| strategy | monthly cost | savings vs all-hot |
|---|---|---|
| all hot | $209,715 | baseline |
| 7d hot / 30d warm / 365d cold | ~$24,165 | 89% saved |
| 1d hot / 365d cold | ~$24,121 | 89% saved |
Rule of thumb. Once any topic exceeds a few weeks of retention, tiered storage saves an order of magnitude on bills. Both Pulsar and Kafka offer it; the cost shape is the same; the decision is "which engine is your team already running?"
SQL interview question on cold-tier hit rate
A senior interviewer might probe: "Given a per-fetch audit table that records (consumer_id, fetched_offset, tier_served, latency_ms), write a SQL query that returns the cold-tier hit rate per consumer over the last 24 hours — and flag consumers reading >50% from cold." This is the production health metric for any tiered-storage rollout.
Solution Using a conditional aggregation per consumer
WITH per_consumer AS (
SELECT
consumer_id,
COUNT(*) AS total_fetches,
SUM(CASE WHEN tier_served = 'cold' THEN 1 ELSE 0 END) AS cold_fetches,
AVG(CASE WHEN tier_served = 'cold' THEN latency_ms END) AS cold_avg_ms,
AVG(CASE WHEN tier_served = 'hot' THEN latency_ms END) AS hot_avg_ms
FROM fetch_audit
WHERE fetch_ts >= NOW() - INTERVAL '24 hours'
GROUP BY consumer_id
)
SELECT
consumer_id,
total_fetches,
cold_fetches,
ROUND(100.0 * cold_fetches / NULLIF(total_fetches, 0), 2) AS cold_pct,
ROUND(cold_avg_ms, 1) AS cold_avg_ms,
ROUND(hot_avg_ms, 1) AS hot_avg_ms,
CASE WHEN 1.0 * cold_fetches / NULLIF(total_fetches, 0) > 0.50
THEN 'investigate'
ELSE 'ok' END AS flag
FROM per_consumer
ORDER BY cold_pct DESC;
Step-by-step trace.
| consumer_id | total | cold | hot_avg | cold_avg | cold_pct |
|---|---|---|---|---|---|
| etl-historical | 12,000 | 11,800 | 3.0 ms | 240 ms | 98.3% |
| ml-feature | 50,000 | 1,000 | 2.5 ms | 220 ms | 2.0% |
| dashboard-live | 500,000 | 0 | 2.0 ms | NULL | 0.0% |
The first consumer is a historical backfill that legitimately reads cold; the second is a feature pipeline mostly on hot; the third is purely hot. The flag fires for the first.
Output:
| consumer_id | cold_pct | flag |
|---|---|---|
| etl-historical | 98.3 | investigate |
| ml-feature | 2.0 | ok |
| dashboard-live | 0.0 | ok |
Why this works — concept by concept:
-
Conditional aggregation —
SUM(CASE WHEN tier='cold' THEN 1 ELSE 0 END)is the portable way to count rows that satisfy a predicate inside a single pass. Some dialects supportCOUNT(*) FILTER (WHERE ...)as a cleaner equivalent. - NULLIF(total, 0) — turns the "divide by zero" runtime error into NULL when a consumer fetched nothing in the window. Standard safe-division idiom (Blog126 ref).
- AVG(CASE WHEN ... THEN x END) — using NULL for the non-matching rows means AVG ignores them, giving per-tier mean latency in one pass.
- The 'investigate' threshold — 50% is a starting heuristic; teams often tune to "consumer p95 latency exceeds 1s and cold_pct > 30%" before paging.
-
Cost — one pass over the 24h fetch_audit; with
(consumer_id, fetch_ts)index the scan is local. Aggregation is O(n) per group.
SQL
Topic — aggregation
Aggregation problems (SQL)
5. Pulsar Functions + multi-tenancy + topic-per-tenant model — picking by workload
pulsar functions is light per-topic stream processing baked into the broker — Kafka achieves the same with Streams + Connect outside the broker
The mental model in one line: Pulsar Functions are lightweight, per-topic transform tasks that run inside the Pulsar runtime; Kafka achieves the same workloads with the Streams library (inside your app) plus Kafka Connect (in a separate worker fleet). Combined with the topic-per-tenant model, this drives the most workload-shaped part of the apache pulsar vs kafka decision.
Pulsar Functions in detail.
-
Per-topic transform. A Function is a small Java/Python/Go program that reads from one or more input topics and writes to one or more output topics. The shape is
(input) -> (output)— exactly the lambda function shape. - Built-in runtime. Functions run inside the Pulsar runtime: the Function Worker process (separate from the broker process but in the same cluster) hosts Function instances. There is also a "Kubernetes runtime" that runs each Function as a pod.
-
Deploy via admin.
pulsar-admin functions create --tenant acme --namespace prod --name dedup --jar dedup.jar --classname Dedup --inputs raw-events --output clean-events. One command, one Function. - Window support. Tumbling windows and sliding windows are supported natively (since Pulsar 2.6). Stateful operations are supported via a key-value state store API.
- Dead-letter topics + retry topics. Each Function can declare a DLQ topic and a retry topic — failed messages flow to DLQ instead of poisoning the pipeline.
Kafka Streams + Connect in detail.
- Kafka Streams. A library, not a runtime. You embed it in your application JAR. It uses Kafka topics for internal state (changelog topics) and offset tracking.
- Kafka Connect. A separate worker fleet for source and sink connectors. JDBC source → Kafka topic, Kafka topic → S3 sink, etc. Connectors are reusable modules.
- Why the split. Streams is for in-process stream transforms inside your app; Connect is for moving data between Kafka and external systems. They cover the same surface area as Pulsar Functions + Pulsar IO connectors but as two separate deployment models.
- Maturity. Kafka Streams is one of the most battle-tested stream-processing libraries in the world. Stateful joins, windowed aggregations, exactly-once with EOS-v2 — all mature.
- Connect ecosystem. Hundreds of pre-built connectors (Debezium for CDC, S3 sink, BigQuery sink, MongoDB source). Pulsar IO has a smaller catalog but covers the major sinks/sources.
Multi-tenancy in detail.
-
Pulsar hierarchy.
tenant → namespace → topic. Tenants are usually a customer or a business unit; namespaces group related topics; topics are the unit of data. - Per-level policies. Namespace-level retention, backlog quota, dispatch rate, replication-cluster list, schema validation. Tenant-level admin roles.
-
Kafka equivalent. Topic-level ACLs and quotas; per-client-id quotas. There is no
tenant → namespacetree; the operator must encode the hierarchy as a topic-naming convention (e.g.acme.prod.events). -
Why it matters. For
multi tenancy kafka pulsarworkloads with hundreds of customer-isolated streams, Pulsar's native model is operationally lighter. Kafka can do it; it just requires more application-layer plumbing.
Three workload shapes and the canonical choice.
- Light per-topic transform (filter, mask, enrich). Pulsar Function in three lines vs. Kafka Streams app deployment. Pulsar wins on operational simplicity.
- Stateful windowed aggregation, exactly-once. Kafka Streams is the most mature option. Pulsar Functions support windows but the ecosystem is thinner. Lean Kafka for hard stateful workloads, or pair Pulsar with Flink.
- External system integration (DB sink, S3, ElasticSearch). Kafka Connect's connector catalog is the broadest in the industry. Pulsar IO has the major sinks; for niche systems you may need a custom connector.
Common interview probes on pulsar functions.
- "Are Pulsar Functions stateful?" — yes, optionally. They have a built-in key-value state store; pure stateless Functions are also fine.
- "How does a Pulsar Function compare to a Kafka Streams app?" — same workload shape, different deployment model. Pulsar Function runs in the cluster; Streams runs in your app.
- "What is the topic-per-tenant pattern?" — one topic per tenant under a tenant-scoped namespace. Native to Pulsar; possible but heavier in Kafka.
- "When would I still pick Kafka for a multi-tenant workload?" — when the tenant axis is small (<100), the throughput-per-tenant is high, and the team already runs Kafka. The operational cost of switching outweighs the architectural fit.
Worked example — a Pulsar Function that masks PII in three languages
Detailed explanation. A team needs to mask email addresses in a stream before downstream consumers see them. In Pulsar, this is a 20-line Function. In Kafka Streams, it is a small app — more files, more deploy steps, but conceptually equivalent.
Question. Write the masking Function in Python and Java. Deploy it. Compare with the Kafka Streams equivalent.
Input event.
{ "user_id": 42, "email": "alice@example.com", "action": "login" }
Code (Python Function).
# mask.py
from pulsar import Function
import re
EMAIL = re.compile(r"([^@]+)@([^.]+)\.(.+)")
class MaskEmail(Function):
def process(self, input, context):
import json
obj = json.loads(input)
if "email" in obj and obj["email"]:
obj["email"] = EMAIL.sub(r"***@\2.\3", obj["email"])
return json.dumps(obj)
# Deploy the Function
pulsar-admin functions create \
--tenant acme --namespace prod \
--name mask-email \
--py mask.py \
--classname mask.MaskEmail \
--inputs persistent://acme/prod/raw-events \
--output persistent://acme/prod/clean-events
Code (Java equivalent).
// MaskEmail.java
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;
public class MaskEmail implements Function<String, String> {
private static final ObjectMapper M = new ObjectMapper();
@Override
public String process(String input, Context ctx) throws Exception {
JsonNode n = M.readTree(input);
String email = n.path("email").asText("");
String masked = email.replaceFirst("^[^@]+", "***");
((com.fasterxml.jackson.databind.node.ObjectNode) n).put("email", masked);
return M.writeValueAsString(n);
}
}
Kafka Streams equivalent (sketch).
StreamsBuilder b = new StreamsBuilder();
KStream<String, String> raw = b.stream("raw-events");
KStream<String, String> clean = raw.mapValues(MaskEmail::mask);
clean.to("clean-events");
KafkaStreams app = new KafkaStreams(b.build(), props);
app.start();
Step-by-step explanation.
-
The Pulsar Function deploy is one command.
pulsar-admin functions createregisters the code in the cluster, schedules instances, and starts forwarding messages. No app to build, no pod to deploy. -
The Pulsar Function lifecycle is cluster-managed. Upgrade with
functions update. Scale with--parallelism N. Inspect withfunctions stats. Same admin pattern as topics. - The Kafka Streams app is your application. You build a JAR, deploy it as a pod or service, manage its lifecycle through your normal CI/CD. The internal state (changelog topics) is in Kafka but the process is yours.
- For light transforms, Pulsar wins on overhead. No CI/CD pipeline needed; the cluster hosts the code.
- For complex stateful joins, the maturity gap reverses — Kafka Streams has a richer DSL, more operators, and a longer track record.
Output (operational comparison).
| dimension | Pulsar Function | Kafka Streams app |
|---|---|---|
| deploy command | functions create |
build + push + deploy |
| process owner | cluster | your team |
| scale knob | --parallelism N |
replica count |
| state store | built-in KV | RocksDB + changelog topic |
| best fit | light transforms, low ops | stateful, mature DSL |
Rule of thumb. Pulsar Functions are a 90% solution for the most common per-topic transforms (filter, mask, enrich, route). For stateful joins and windowed aggregations at scale, Kafka Streams or a pure stream-processing engine (Flink) remains the senior choice.
Worked example — the topic-per-tenant SaaS pattern in Pulsar
Detailed explanation. A SaaS platform creates one Pulsar tenant per customer. Each customer has a prod namespace and a staging namespace, each with a handful of topics. The number of customers grows from 100 to 10,000 over two years. The operational model scales gracefully.
Question. Walk through the topology, the per-tenant retention policy, and the per-tenant ACL. Show how a new tenant is provisioned.
Input.
| dimension | value |
|---|---|
| platform | acme SaaS |
| customers | 10,000 |
| topics per customer | 4 (events, metrics, errors, audit) |
| retention | 30 days (default) / 90 days (premium) |
| isolation | per-customer ACL |
Code (provisioning, pulsar-admin).
# Provision a new tenant (one-off per customer signup)
pulsar-admin tenants create cust-123 \
--allowed-clusters us-east,eu-west \
--admin-roles cust-123-admin
# Create namespaces
pulsar-admin namespaces create cust-123/prod \
--clusters us-east,eu-west
pulsar-admin namespaces create cust-123/staging \
--clusters us-east
# Per-tier retention
pulsar-admin namespaces set-retention cust-123/prod \
--size -1 --time 30d
# Upgrade to premium: 90d retention
pulsar-admin namespaces set-retention cust-123/prod \
--size -1 --time 90d
# Per-customer rate limit
pulsar-admin namespaces set-publish-rate cust-123/prod \
--msg-publish-rate 5000 --byte-publish-rate 10485760
# Topics are auto-created on first publish (auto-topic-creation enabled)
# or created explicitly:
pulsar-admin topics create-partitioned-topic \
persistent://cust-123/prod/events --partitions 4
Step-by-step explanation.
- Tenant. Each customer is its own tenant. Admin roles, allowed clusters, and policies cascade down.
- Namespace. Two per customer: prod and staging. Namespaces hold per-environment policies (retention, quotas, replication clusters).
- Topics. Four per environment: events, metrics, errors, audit. Auto-created on first publish if enabled, or pre-created.
- Retention. Set at the namespace level. Upgrading a customer to premium changes one config; no schema migration.
- Rate limits. Per-namespace publish and dispatch rates. Per-customer quota with zero application logic.
- Scaling math. 10,000 customers × 2 environments × 4 topics × 4 partitions = 320,000 partitions. Comfortably inside Pulsar's operating envelope.
Output (topology summary).
| level | count | per-level policy |
|---|---|---|
| tenant | 10,000 | admin roles, clusters |
| namespace | 20,000 | retention, ACL, quota, replication |
| topic | 80,000 | schema, compaction |
| partition | 320,000 | striped across BookKeeper |
Rule of thumb. If your SaaS platform serves more than a few hundred isolated tenants and each tenant needs per-tenant policy, the Pulsar tenant/namespace tree is doing significant work for you. The same model in Kafka would require a heavy application-layer abstraction over plain topic ACLs.
Worked example — when to still pick Kafka in 2026
Detailed explanation. Pulsar's architectural advantages are real, but Kafka still wins on a meaningful slice of workloads. Listing the cases explicitly avoids the "we should use Pulsar everywhere" bias.
Question. Identify three workload shapes where Kafka remains the senior choice. Argue each from first principles.
Input — three scenarios.
Scenario A: a single very-high-throughput event bus
- 5 GB/s sustained on one topic
- modest topic count (10-50)
- team has 5 years of Kafka operations
Scenario B: a Confluent / Aiven / MSK customer
- already paying for managed Kafka
- has Kafka Connect connectors for Snowflake, S3, BigQuery
- migration cost > architectural benefit
Scenario C: a stateful exactly-once stream-processing workload
- complex windowed joins, exactly-once semantics
- Kafka Streams DSL fits the team's mental model
- Pulsar + Flink is the alternative, but extra fleet
Step-by-step explanation.
- Scenario A. Kafka's coupled compute+storage model is extremely efficient for a few hot topics with high throughput. The lack of an extra hop (broker → bookies) gives a small but real latency advantage. The team's existing Kafka expertise is a multiplier.
- Scenario B. Vendor lock-in cost. Confluent and Aiven and MSK have rich managed-service stories. Switching to Pulsar means moving off a managed Kafka SaaS, redeploying connectors, retraining the team. The operational cost dwarfs the architectural fit.
- Scenario C. Kafka Streams has the deepest stateful-processing DSL. Pulsar Functions support windows but not at the same maturity. For exactly-once windowed joins on a large state, Streams + EOS-v2 is the senior choice (or Flink, paired with either engine).
- The common thread. Kafka wins when (a) the workload's primary constraint is single-region throughput or stateful processing maturity, and (b) the team already runs Kafka. Pulsar wins when the primary constraint is multi-tenancy, multi-region, or topic-count.
Output (decision summary).
| workload primary constraint | senior choice |
|---|---|
| single-region max throughput | Kafka |
| existing managed Kafka SaaS | Kafka (migration cost) |
| stateful exactly-once joins | Kafka Streams (or Flink + either) |
| multi-region active/active | Pulsar |
| topic-per-tenant SaaS | Pulsar |
| light per-topic transforms | Pulsar Functions |
| frequent broker churn (autoscaling) | Pulsar |
Rule of thumb. "What is your team running today?" is usually the strongest signal. Architectural elegance does not outweigh five years of operational muscle.
SQL interview question on a per-tenant throughput SLO
A senior interviewer might frame this as: "Given a per-tenant throughput audit table, write a SQL query that returns each tenant's hourly message rate and flags tenants exceeding their configured rate limit. This is the daily multi-tenancy health check."
Solution Using a per-tenant hourly aggregate with a join to the rate-limit config
SELECT
t.tenant_id,
DATE_TRUNC('hour', a.ts) AS hour_bucket,
COUNT(*) AS messages_in_hour,
t.msg_per_sec_limit,
ROUND(COUNT(*) / 3600.0, 2) AS observed_msg_per_sec,
CASE
WHEN COUNT(*) / 3600.0 > t.msg_per_sec_limit THEN 'exceeded'
WHEN COUNT(*) / 3600.0 > 0.8 * t.msg_per_sec_limit THEN 'warning'
ELSE 'ok'
END AS slo_status
FROM tenant_audit a
JOIN tenants t
ON t.tenant_id = a.tenant_id
WHERE a.ts >= NOW() - INTERVAL '24 hours'
GROUP BY t.tenant_id, DATE_TRUNC('hour', a.ts), t.msg_per_sec_limit
HAVING COUNT(*) / 3600.0 > 0.8 * t.msg_per_sec_limit
ORDER BY observed_msg_per_sec DESC;
Step-by-step trace.
| tenant_id | hour | msgs | limit | observed | status |
|---|---|---|---|---|---|
| cust-42 | 14:00 | 18,500,000 | 5,000 | 5138.9 | exceeded |
| cust-77 | 14:00 | 14,800,000 | 5,000 | 4111.1 | warning |
| cust-7 | 14:00 | 3,000,000 | 5,000 | 833.3 | ok (filtered out) |
The HAVING clause keeps only tenants that crossed 80% of their limit, surfacing only the SLO-relevant rows.
Output:
| tenant_id | hour | observed | limit | status |
|---|---|---|---|---|
| cust-42 | 14:00 | 5138.9 | 5000 | exceeded |
| cust-77 | 14:00 | 4111.1 | 5000 | warning |
Why this works — concept by concept:
-
DATE_TRUNC('hour', ts) — buckets timestamps to hour boundaries. Standard idiom for hourly aggregation in Postgres / Snowflake / BigQuery (
TIMESTAMP_TRUNCin BigQuery). - JOIN to tenants config — the SLO threshold lives in a separate config table; joining brings the per-tenant limit into the same row as the observed rate.
- HAVING for SLO filter — push the SLO filter into HAVING so the query returns only actionable rows. Tenants at 5% of their limit are noise.
- Three-tier status — 'exceeded' / 'warning' / 'ok' lets the same query feed both paging alerts and quarterly review dashboards.
-
Cost — one pass over 24h
tenant_auditwith(tenant_id, ts)index; the join is hash-join to a smalltenantstable. Aggregate is O(rows-in-window).
SQL
Topic — event-processing
Event-processing problems (SQL)
Cheat sheet — Pulsar vs Kafka recipes
-
Multi-region active/active default. Pulsar: set the namespace replication-cluster list (
pulsar-admin namespaces set-clusters tenant/ns --clusters us,eu,ap). Kafka: deploy MirrorMaker 2 with bidirectional flows and the cluster-alias prefix convention. Pulsar wins on operational lightness; Kafka wins on tooling maturity. -
Infinite retention without infinite SSD. Pulsar: enable the BookKeeper offloader (
managedLedgerOffloadDriver=aws-s3+ namespace threshold). Kafka: enable KIP-405 tiered storage (remote.log.storage.system.enable=true+ topicremote.storage.enable=true). Both engines now offer the same cost shape. -
100k+ topics, per-tenant policy. Pulsar:
tenant/namespace/topictree, per-namespace retention/quota/ACL. Kafka: per-topic ACL and per-client-id quota; encode tenants as a topic-naming convention. Pulsar wins on the high-topic-count axis. -
Light per-topic stream transform. Pulsar:
pulsar-admin functions create --jar transform.jar --inputs raw --output clean. Kafka: build a Kafka Streams app, deploy as a pod, monitor lifecycle. Pulsar Functions win on overhead. - Stateful exactly-once windowed aggregation. Kafka: Kafka Streams with EOS-v2 (state stores backed by changelog topics). Pulsar: Functions with built-in KV state + transactions (mature since 2.7), or pair Pulsar with Flink. Kafka Streams has the deeper DSL today.
- External system integration (DB, S3, ES, BigQuery). Kafka: Kafka Connect with the open-source connector catalog (Debezium, Confluent Hub, Aiven connectors). Pulsar: Pulsar IO with a smaller but growing catalog. Kafka Connect's ecosystem is broader.
-
Strict ordering per key. Both engines: hash the key to a single partition; consumers see ordered messages within a partition. Pulsar adds the
key_sharedsubscription type — multiple consumers share a topic but each key sticks to one consumer. - High-throughput single topic. Kafka still leads on the raw maximum throughput per partition. Pulsar reaches parity on non-persistent topics (no BookKeeper write).
-
Strong durability semantics. Both: quorum writes. Kafka:
min.insync.replicas=2, acks=all. Pulsar:Qw=3, Qa=2. Same conceptual model, different knob names. -
Strict exactly-once. Kafka: transactions + idempotent producer +
read_committedconsumer. Pulsar: transactions (since 2.7) + idempotent producer; consumer-side dedup via the transaction coordinator. Both engines now support exactly-once for the common patterns. - Stateless broker rebalance. Pulsar: broker ownership flips in seconds. Kafka: partition reassignment moves bytes; minutes to hours per hot partition. Pulsar wins when capacity churns.
- Lightweight client. Both: many SDKs (Java, Go, Python, C++, Node.js, Rust). Pulsar adds first-class WebSocket protocol and a built-in HTTP/REST proxy — useful for browser-side producers.
- Cluster topology. Kafka after KRaft: one process type (broker, with controller role on a quorum subset). Pulsar: three process types (broker, bookie, metadata). Kafka is operationally simpler at one-region scale; Pulsar offers more independent scaling axes.
Frequently asked questions
Is Pulsar faster than Kafka?
It depends on the metric. For raw single-topic throughput, Kafka still leads — one fewer hop in the write path means a slight latency edge, and the coupled broker+log model is extremely well-optimised. For rebalance and recovery time on a churning cluster, Pulsar is dramatically faster — broker ownership flips in seconds because no data movement is required. For multi-region active/active, Pulsar's native geo-replication ships data with less operational overhead than Kafka MirrorMaker 2. In production, "faster" usually means "matches our workload shape" — and that is a workload question, not a benchmark one.
Does Pulsar require ZooKeeper like Kafka used to?
Historically, yes. Pulsar 2.x used ZooKeeper as the metadata store for both broker coordination and BookKeeper ledger metadata. Pulsar 3.x added support for etcd as the metadata backend, reducing the ZooKeeper dependency. Kafka's KRaft (Kafka Raft) protocol replaced ZooKeeper inside the Kafka broker process; modern Kafka 3.7+ deployments are ZooKeeper-free by default. Both engines have therefore moved away from ZooKeeper, though Pulsar still has more deployment topologies that include it.
Can I run Pulsar Functions instead of Kafka Streams?
Yes, for a large class of workloads. Pulsar Functions are a great fit for light per-topic transforms (filter, mask, enrich, route) — they deploy in one admin command, run inside the cluster, and require no separate app. For stateful exactly-once windowed aggregations, Kafka Streams currently has the deeper DSL and longer track record; Pulsar Functions support windows and state but the ecosystem is thinner. Many teams pair Pulsar with Apache Flink for the heaviest stream-processing workloads, which is also a popular pattern with Kafka.
How does Pulsar geo-replication compare to MirrorMaker 2?
Pulsar geo-replication is a per-topic property — set a replication-cluster list at the namespace level and writes flow to every listed region asynchronously, with the same topic name in every region. Kafka MirrorMaker 2 is a separate Kafka Connect deployment that reads from a source cluster and writes to a destination cluster, with a source-cluster prefix in the destination topic name (prod-us.events) and an explicit offset-translation table. Pulsar wins on operational simplicity and on the active/active default; Kafka MirrorMaker 2 wins on flexibility and on ecosystem maturity (it has been in production at large scale since 2019).
What is BookKeeper and why does Pulsar use it?
Apache BookKeeper is a distributed write-ahead log service designed at Yahoo around 2011. It pre-dates Pulsar and is independently used by HDFS NameNode high availability, Apache DistributedLog, and others. Pulsar uses BookKeeper because it gives Pulsar a separately scalable storage tier — Pulsar brokers can be stateless because the durable storage lives in BookKeeper bookies. The unit of storage is a ledger (an append-only segment) striped across an ensemble of bookies with configurable write-quorum (Qw) and ack-quorum (Qa). This is the architectural pivot that distinguishes Pulsar from Kafka's coupled-storage model.
When should I pick Pulsar over Kafka in 2026?
Pick Pulsar when multi-region active/active is a primary requirement (Pulsar's native geo-replication is operationally lighter than MirrorMaker 2), when your workload has 100+ tenants with per-tenant policy (the tenant/namespace tree maps perfectly to SaaS), when your cluster churns brokers often (stateless rebalance is seconds, not hours), or when you want light per-topic stream processing inside the cluster (Pulsar Functions). Pick Kafka when your team already runs it, when single-region maximum throughput is the primary constraint, when stateful exactly-once joins are the main workload (Kafka Streams), or when you depend on the breadth of Kafka Connect's connector catalog. Both engines are mature, and "what the team operates today" is usually the strongest tie-breaker.
Practice on PipeCode
- Drill the streaming practice library → for windowed aggregation, dedup, and event-time problems.
- Rehearse on event-processing problems → when the interview frames it as "consume from Kafka or Pulsar and aggregate."
- Sharpen real-time analytics drills → for the dashboard-and-SLO patterns that ride on top of these brokers.
- Layer the event-modeling library → for the "what shape should this event be?" probes.
- Stack the window-functions library → for the dedup, running-aggregate, and gap-and-island patterns most replay/audit questions reach for.
- Layer the aggregation library → for the per-tenant and per-region SLO style queries.
- Stack the joins library → for cross-cluster reconciliation and audit-vs-config patterns.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the dialect axis with the SQL for data engineering interviews course →.
- For batch + streaming Spark, work through the Apache Spark internals course →.
- For end-to-end pipeline design, work through ETL system design for data engineering interviews →.
Pipecode.ai is Leetcode for Data Engineering — every Pulsar-vs-Kafka recipe above ships with hands-on practice rooms where you write the dedup query, the per-tenant SLO check, and the cross-region reconciliation against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a replay-induced duplicate behaves the same on Pulsar as on Kafka.





Top comments (0)