DEV Community

Cover image for Apache Pulsar vs Kafka for Data Engineering: Geo-Replication, Tiered Storage & Functions
Gowtham Potureddi
Gowtham Potureddi

Posted on

Apache Pulsar vs Kafka for Data Engineering: Geo-Replication, Tiered Storage & Functions

apache pulsar vs kafka is the messaging question that every senior data engineer has answered twice in the last five years — once when their first multi-region pipeline shipped, and again when retention requirements outgrew local disk. Kafka set the partition-log template in 2011; Pulsar reframed it in 2016 with a split-storage model, native geo-replication, and a built-in functions runtime. Both engines are now mature, both have tiered storage, both have transactions and exactly-once — and yet the right pick for a brand-new pipeline in 2026 still depends on whether you need topic-per-tenant scale, active/active multi-region by default, or simple single-region throughput at maximum economy.

This guide walks the four deltas that actually drive the decision: how the brokers and storage layer differ (Kafka's coupled broker-plus-log vs Pulsar's stateless brokers with bookkeeper segment storage), how each engine handles cross-region replication (Pulsar's pulsar geo replication baked into the topic property vs Kafka's MirrorMaker 2 deployment), how tiered cold storage is implemented (Pulsar's BookKeeper offloader vs Kafka KIP-405's pulsar tiered storage equivalent), and where lightweight stream processing lives (pulsar functions inside the broker vs Kafka Streams + Kafka Connect outside it). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for Apache Pulsar vs Kafka for data engineering — bold white headline 'Pulsar vs Kafka' with subtitle 'Geo-Replication · Tiered Storage · Functions' and a split-frame showing Kafka partition log on the left vs Pulsar BookKeeper segments on the right on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on event-processing problems →, and stack the systems muscles with real-time analytics drills →.


On this page


1. Why two log platforms exist — Kafka 2011 vs Pulsar 2016 design choices

Two engines, two eras, two architectural pivots — and a market that grew big enough for both

The one-sentence invariant: Kafka was designed in 2011 to be the cheapest, fastest, simplest distributed commit log; Pulsar was designed in 2016 to be the most operationally flexible multi-tenant messaging fabric, and the two architectural pivots — split storage and native geo-replication — still drive every trade-off you make in 2026. Once you internalise "Kafka picked simplicity, Pulsar picked flexibility," every probe about apache pulsar vs kafka reduces to a workload question, not a benchmark one.

The Kafka origin story.

  • 2011 — LinkedIn. Jay Kreps, Neha Narkhede, and Jun Rao build Kafka as a unified log to replace a forest of point-to-point queues. The mental model is one big distributed commit log — every event lands on a partition, every consumer reads sequentially, and the log is the source of truth.
  • The 2011 design pivots. (a) Storage and compute live in the same broker — every Kafka broker owns a set of partition log files on its local disk; (b) replication is intra-cluster — leader/follower per partition, controlled by ZooKeeper; (c) the consumer position lives on the consumer (offset commits), not the broker.
  • What that bought Kafka. Disk-sequential I/O, page-cache friendliness, zero-copy reads via sendfile, and a simple ops model — one process per machine, one log per partition. The cost: rebalancing a hot partition means physically copying gigabytes of log data, and ZooKeeper coordination becomes the bottleneck at very large topic counts.

The Pulsar origin story.

  • 2016 — Yahoo (open-sourced). Yahoo's messaging team builds Pulsar to serve hundreds of internal tenants on the same fabric — Mail, Finance, Search, Sports — and to replicate across data centres natively. The mental model is a broker is a stateless proxy over a separately scaled storage layer.
  • The 2016 design pivots. (a) Brokers are stateless — they own no on-disk data; (b) storage lives in BookKeeper, a separate distributed log designed at Yahoo from 2011, where each ledger (segment) is striped across an ensemble of bookies; (c) geo-replication is built into the topic as a property — set a replication cluster list and writes flow to every region asynchronously; (d) multi-tenancy is first-class — tenants → namespaces → topics form a path that hosts ACLs, quotas, and retention.
  • What that bought Pulsar. Rebalancing a broker is seconds because brokers carry no state; one cluster can host hundreds of thousands of topics because metadata is in ZooKeeper-or-etcd and storage is striped, not co-resident; geo-replication is a configuration flag, not a separate deployment. The cost: more moving parts (broker + bookie + metadata), a steeper ops learning curve, and a longer historical track record needed to feel safe on the most demanding throughput workloads.

The 2018–2026 convergence.

  • Kafka added geo-replication. MirrorMaker 1 (2014) was simple but unreliable on offsets; MirrorMaker 2 (2019, KIP-382) ships with Kafka Connect, supports active/active replication, and translates consumer offsets between clusters.
  • Kafka removed ZooKeeper. KRaft (KIP-500, GA in Kafka 3.3, default in 3.7+) replaces ZooKeeper with an internal Raft quorum, shrinking the operational surface to one process per node.
  • Kafka added tiered storage. KIP-405 (proposed 2018, Apache release in 3.6 / 2023) lets brokers offload cold segments to S3 / GCS / Azure Blob, mirroring what Pulsar shipped in 2.1 (2018).
  • Pulsar added transactions and exactly-once. Pulsar 2.7 (2021) shipped transactions on top of the segment model; Pulsar Functions matured into a per-topic stream processor with windows and retries.
  • Pulsar added etcd as a metadata store. Pulsar 3.x reduces the ZooKeeper dependency by allowing etcd as the metadata backend.

Where each engine still wins in 2026.

  • Kafka wins on: the largest single-region throughput numbers, the deepest ecosystem (Confluent, Aiven, MSK, Strimzi, Redpanda), the broadest tooling (kSQL, Kafka Streams, Debezium), and the simplest "one process per node" mental model after KRaft.
  • Pulsar wins on: multi-region active/active by default, very-high-topic-count workloads (multi tenancy kafka pulsar is the canonical search), light per-topic stream processing without an external app, and rebalance/recovery speed under broker churn.
  • The honest 2026 picture. Neither engine is "better" at the workload-agnostic level. The right answer depends on whether your top-priority constraint is raw single-region throughput economics (lean Kafka), multi-tenant topic explosion + geo (lean Pulsar), or team familiarity with one ecosystem already (always wins).

What interviewers listen for on pulsar vs kafka performance.

  • Do you frame Pulsar's bookkeeper as a separate scalable storage tier, not "just another broker"? — senior signal.
  • Do you mention that Kafka brokers store partition logs locally and Pulsar brokers store nothing? — required answer.
  • Do you flag that MirrorMaker 2 is a separate Kafka Connect job and Pulsar geo-replication is a topic property? — senior signal.
  • Do you separate "raw throughput" (Kafka still wins) from "ops at 100k topics" (Pulsar wins)? — senior signal.

Worked example — the storage-coupling delta in one diagram

Detailed explanation. Imagine two clusters, each with three brokers, each serving the same workload of 100 GB / hour into one topic with 12 partitions. In Kafka, the partition log files physically sit on the broker that leads each partition — when broker 1 dies, ZooKeeper / KRaft picks a new leader, and the follower replicas (which were already in-sync) become the new leader. In Pulsar, the broker that owns the topic is a stateless proxy — it forwards every write to an ensemble of bookies (say 3 bookies, write-quorum 2, ack-quorum 2) and remembers nothing locally. When the broker dies, the cluster simply reassigns topic ownership to a healthy broker — the new broker can start serving instantly because the data is already in the bookies.

Question. In an "all three brokers running" baseline, write the producer-side path for one message in both engines. Then trace what happens when the leader broker for one topic dies. Compare the recovery time and what data has to move.

Input (baseline cluster).

component Kafka Pulsar
broker count 3 3
storage layer local disk (replicated) 3 BookKeeper bookies
partitions / topic 12 12
replication factor RF=3 Qw=3, Qa=2
metadata ZooKeeper / KRaft ZooKeeper / etcd

Code (pseudocode for the write path).

# Kafka — producer to broker
producer.send(topic="events", key=k, value=v)
  -> Kafka client hashes k -> partition P
  -> client looks up partition leader (broker-2 owns P)
  -> sends to broker-2
  -> broker-2 appends to its local log file for P
  -> broker-2 replicates to in-sync followers (broker-1, broker-3)
  -> broker-2 returns acks after min.insync.replicas met
  -> producer receives ack

# Pulsar — producer to broker -> bookies
producer.send(topic="events", key=k, value=v)
  -> Pulsar client hashes topic name -> broker (broker-2 owns the topic)
  -> sends to broker-2 over binary protobuf
  -> broker-2 wraps message in an entry, sends to bookie ensemble (3 bookies)
  -> bookies write entry to journal + ledger
  -> Qw=3 means all 3 bookies must receive; Qa=2 means 2 acks before broker confirms
  -> broker-2 returns ack to producer
  -> broker-2 itself stores no message data
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Kafka write. The data lands on the leader broker's local log; replication is broker-to-broker. The broker is the storage system.
  2. Pulsar write. The data lands in BookKeeper bookies via the broker. The broker is purely a routing and protocol layer.
  3. Kafka broker-2 dies. ZooKeeper / KRaft elects a new leader from the in-sync follower set (broker-1 or broker-3 takes over leadership of P). No data movement is required if the follower was in-sync. If the follower lagged, it must catch up before it can serve reads. Time to recover: seconds (election) + minutes (potential follower catch-up).
  4. Pulsar broker-2 dies. The metadata store notices the broker is gone, and the load-manager reassigns the broker-2 topics to broker-1 or broker-3. No data movement occurs because the data lives in bookies, not on brokers. The new broker connects to the same bookies and continues serving. Time to recover: seconds — and there is no notion of "out-of-sync follower" because storage is not a per-broker concept.
  5. The deeper consequence. In Kafka, scaling out a hot partition means physically moving its log to a new broker (a partition reassignment can take hours on a busy 100 GB partition). In Pulsar, scaling out a hot topic means changing topic-to-broker ownership in the metadata store — instant, and the bookies were already striped across the cluster.

Output (recovery profile).

event Kafka recovery Pulsar recovery
broker death (in-sync followers) ~5–10 s leader election ~5–10 s ownership reassignment
broker death (lagging followers) minutes — follower catch-up unchanged — bookies hold data
hot-partition rebalance minutes-to-hours (data movement) seconds (ownership flip)
add new broker minutes-to-hours (partition copy) seconds (load manager rebalances)

Rule of thumb. If your workload churns brokers often (autoscaling on Kubernetes, frequent rolling restarts, opportunistic capacity), the Pulsar split-storage model pays back the extra moving parts. If your workload is "ten brokers running steady state for two years," the Kafka coupled model has fewer surfaces and a simpler ops story.

SQL interview question on event-time deduplication for replayed streams

A senior interviewer might frame this as: "After a Pulsar-to-Kafka or Kafka-to-Kafka replay, the downstream warehouse table may contain duplicates. Write a SQL deduplication query that keeps the latest event per event_id using ingestion time as the tie-breaker." This blends streaming semantics with SQL — a common follow-up after an architecture discussion.

Solution Using a deduplication window function

-- Keep the latest event per event_id; ties broken by ingestion_ts DESC.
WITH ranked AS (
    SELECT
        event_id,
        tenant_id,
        payload,
        event_ts,
        ingestion_ts,
        ROW_NUMBER() OVER (
            PARTITION BY event_id
            ORDER BY ingestion_ts DESC, event_ts DESC
        ) AS rn
    FROM stage_events
    WHERE ingestion_ts >= CURRENT_DATE - INTERVAL '7 days'
)
SELECT event_id, tenant_id, payload, event_ts, ingestion_ts
FROM ranked
WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

event_id ingestion_ts event_ts row_number kept?
evt-1 2026-06-10 09:00 2026-06-10 08:59 2 no
evt-1 2026-06-10 09:05 2026-06-10 08:59 1 yes
evt-2 2026-06-10 09:01 2026-06-10 09:01 1 yes
evt-3 2026-06-10 09:02 2026-06-10 09:02 2 no
evt-3 2026-06-10 09:07 2026-06-10 09:02 1 yes (replay wins)

Under PARTITION BY event_id ORDER BY ingestion_ts DESC, the later ingestion_ts gets rn=1. So the replayed row (09:07) is rn=1 and the original (09:02) is rn=2. The final result keeps the most-recently ingested copy per event_id — exactly the semantic you want when a stream is replayed and may carry corrections.

Output:

event_id kept ingestion_ts source
evt-1 2026-06-10 09:05 latest
evt-2 2026-06-10 09:01 only copy
evt-3 2026-06-10 09:07 replayed wins

Why this works — concept by concept:

  • PARTITION BY event_id — the dedup key. Every row sharing an event_id enters the same row-number sequence, independent of all other events.
  • ORDER BY ingestion_ts DESC — picks the most recent copy by broker-ack time, which is the right tie-breaker after a replay (the new copy carries the corrected payload). For "first-write-wins" semantics, use ASC.
  • ROW_NUMBER() vs RANK() vs DENSE_RANK() — ROW_NUMBER assigns a unique sequence per partition, so the WHERE rn = 1 filter always keeps exactly one row. RANK and DENSE_RANK can produce ties and would keep multiple rows for the same event_id, defeating the dedup.
  • WHERE ingestion_ts >= now - 7 days — a sliding window prunes the scan; without it the dedup runs over the entire history every batch, which is O(table).
  • Cost — O(n log n) for the window sort per partition, where n is the rows per event_id (usually small). With a proper index on (event_id, ingestion_ts DESC), the scan is near-linear in the date window.

SQL
Topic — streaming
Streaming problems (SQL)

Practice →


2. Architecture — Kafka brokers + partitioned log vs Pulsar brokers + BookKeeper segments + metadata store

The deepest delta in apache pulsar vs kafka is where the bytes live — and what that means for every operational gesture you make on the cluster

The mental model in one line: Kafka brokers are fat — they hold compute, the partition log, and replication; Pulsar brokers are thin — they hold only compute, and the log lives in a separate BookKeeper layer keyed by segments (ledgers) instead of by partition. Once you can draw the two architectures side by side, every other trade-off — geo-replication, tiered storage, multi-tenancy — follows from that single split.

Visual architecture diagram comparing Kafka and Pulsar — left side Kafka broker holding partition log files directly on local disk with ZooKeeper/KRaft metadata; right side Pulsar stateless broker forwarding writes to a BookKeeper ensemble of bookies storing segments, with a separate metadata store (ZooKeeper or etcd); on a light PipeCode card.

The Kafka broker in five bullets.

  • One process. A Kafka broker is a single JVM that handles client protocol, partition leadership, log writes, log compaction, replication, and (in legacy mode) talks to ZooKeeper or (in modern mode) participates in the KRaft quorum.
  • Local log files. Each partition is a directory on local disk containing rolled .log segment files plus index files. Reads use the page cache + sendfile for zero-copy.
  • Replication is broker-to-broker. Leader broker writes to local log, follower brokers fetch and replicate. min.insync.replicas controls durability.
  • Scaling = data movement. Adding a broker or rebalancing partitions physically copies log data over the network. A 500 GB partition takes hours.
  • Metadata in ZooKeeper or KRaft. Pre-3.3, ZooKeeper coordinates leader election and metadata. KRaft (3.3+, default in 3.7) replaces ZooKeeper with an internal Raft quorum of controller nodes.

The Pulsar broker in five bullets.

  • Stateless proxy. A Pulsar broker is a JVM that handles client protocol, topic ownership, and forwards writes to BookKeeper. It owns no on-disk message data.
  • Topic ownership. Each topic is owned by exactly one broker at a time; the load manager rebalances ownership when brokers come and go.
  • BookKeeper does the durability. Writes flow broker → bookies. A ledger is a write-once segment striped across an ensemble of bookies (e.g. 3 bookies, write-quorum 3, ack-quorum 2).
  • Scaling = ownership flip. Adding a broker or moving a hot topic is a metadata change — milliseconds — because the data does not move.
  • Metadata in ZooKeeper or etcd. Pulsar uses ZooKeeper for cluster metadata historically; Pulsar 3.x supports etcd as the metadata backend.

BookKeeper in five bullets.

  • A separate log service. BookKeeper is an Apache top-level project — it predates Pulsar (Yahoo, 2011). Pulsar happens to be its biggest user.
  • Ledger is the unit. A ledger is a write-once, append-only segment. When a ledger closes, it is immutable.
  • Ensemble + quorum. Each ledger is striped across an ensemble of bookies. Writes go to Qw bookies and the broker considers the entry committed once Qa of them ack.
  • Bookies are storage nodes. A bookie holds a journal (for write durability) and ledger directories (for read efficiency). Hard separation of write path and read path inside the bookie.
  • Striping = parallelism. Different ledgers can live on different bookies; one topic's ledgers can be spread across the whole cluster, giving high parallel write throughput.

Three concrete architectural consequences.

  • Rebalance speed. Pulsar broker rebalance is a metadata operation; Kafka partition rebalance is a data copy. For a cluster that churns capacity, this matters every week.
  • Topic count ceiling. Pulsar's metadata model and stateless brokers scale topic counts further than classic Kafka. KRaft has narrowed the gap but Pulsar still leads at the 100k+ topic axis.
  • Operational surface. Kafka after KRaft is fewer moving parts (broker = everything). Pulsar is more moving parts (broker + bookie + metadata) but each part scales independently.

Common interview probes on pulsar vs kafka performance architecture.

  • "Why does Pulsar have BookKeeper at all?" — to decouple the durability layer from the compute layer, which makes brokers stateless and storage independently scalable.
  • "What is a ledger?" — a write-once segment in BookKeeper. Multiple ledgers per topic. Closed ledgers are immutable, which makes offload to S3 safe.
  • "How does Kafka handle stateless brokers?" — it doesn't (in the classic sense). KRaft removed ZooKeeper, but partition data still lives on the broker. The path to "stateless Kafka" is tiered storage (KIP-405), which keeps the local log shorter so rebalance is faster.
  • "What is the write quorum in Pulsar?" — Qw is the number of bookies the broker writes to per entry; Qa is the number of acks the broker waits for before confirming the write. Typical Qw=3, Qa=2.

Worked example — write-path latency budget in both engines

Detailed explanation. Both engines target single-digit-millisecond producer-to-broker p99 latency on a healthy cluster. The path differs: Kafka is one hop (producer → leader broker → disk + replicate), Pulsar is two hops (producer → broker → bookies → disk). The extra hop is one of the things that makes Pulsar architecturally more flexible but also adds a small fixed latency cost.

Question. Lay out a per-message write budget for both engines on a healthy cluster. Identify which hops are fixed cost and which are dependent on min.insync.replicas / Qa.

Input (assumptions).

factor value
network hop 0.2 ms
disk write (journal) 0.5 ms
RF / Qw, Qa RF=3 min.isr=2 ; Qw=3 Qa=2
client → broker RTT 0.4 ms

Code (latency math, pseudocode).

# Kafka — producer p99 budget
client -> leader broker         : 0.4 ms (one RTT, half is included in ack)
leader broker append to log     : 0.5 ms (disk + page cache)
leader -> follower replicate    : 0.4 ms (one RTT, parallel to leader write)
follower ack to leader          : already counted
leader confirms when min.isr=2  : leader + 1 follower ack
TOTAL                           : ~1.5-3 ms p99 healthy

# Pulsar — producer p99 budget
client -> broker                : 0.4 ms
broker -> bookies (Qw=3)        : 0.4 ms RTT (parallel to all 3)
bookies journal write           : 0.5 ms each (parallel)
Qa=2 acks back to broker        : 0.4 ms
broker -> client                : 0.4 ms
TOTAL                           : ~2-4 ms p99 healthy
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Kafka write path. The client sends the message to the partition leader broker. The leader appends to its local log file and replicates to in-sync followers in parallel. Once min.insync.replicas followers ack, the leader returns success. One network hop client-to-leader, plus one parallel replication hop.
  2. Pulsar write path. The client sends the message to the topic-owning broker (which it discovered via a lookup the first time). The broker forwards the entry to the BookKeeper ensemble of bookies. Each bookie writes to its journal and acks. Once Qa bookies ack, the broker returns success to the client. Two network hops in the critical path: client-to-broker and broker-to-bookies.
  3. Why the extra hop is small. Bookies are in the same data centre as brokers; the RTT is a fraction of a millisecond. The journal write at the bookie is the dominant cost — same as the log append on a Kafka broker.
  4. Why the architectural cost pays back. That extra hop purchased statelessness, which buys instant rebalance and easier scaling. For most workloads, an extra 0.5–1 ms p99 is fine.
  5. The tunable knobs. Lowering Qa from 3 to 2 trades durability for latency (one less ack to wait for). Lowering RF from 3 to 2 in Kafka does the same. Be explicit about durability requirements before you tune.

Output (write-path latency).

engine hops in critical path typical p99 (healthy) dominant cost
Kafka client → leader; leader → 1 follower (parallel) 1.5–3 ms leader log append + replication ack
Pulsar client → broker; broker → bookies 2–4 ms bookie journal write + Qa ack

Rule of thumb. Kafka p99 is typically a tick lower because there is one fewer network hop in the critical path. Pulsar p99 is rarely the bottleneck for analytics or business-event workloads — it matters only for sub-2 ms transactional bus use cases.

Worked example — partition reassignment in Kafka vs topic ownership flip in Pulsar

Detailed explanation. An on-call engineer is adding a new broker to a hot cluster. In Kafka, this is a partition reassignment — log data has to move. In Pulsar, this is a load-manager rebalance — ownership of some topics flips to the new broker, but no data moves. The operational story is wildly different.

Question. Walk through both procedures step by step. Estimate the time and the impact on producers / consumers.

Input.

dimension value
existing cluster 6 brokers, 600 partitions / 6,000 topics
new capacity +2 brokers
hot partition log size 500 GB
inter-broker bandwidth 1 Gbps (~125 MB/s)

Code (Kafka reassignment, kafka-reassign-partitions.sh).

# 1) generate a proposed reassignment plan
kafka-reassign-partitions.sh \
  --bootstrap-server broker-1:9092 \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3,4,5,6,7,8" \
  --generate

# 2) execute (throttled)
kafka-reassign-partitions.sh \
  --bootstrap-server broker-1:9092 \
  --reassignment-json-file plan.json \
  --execute \
  --throttle 50000000   # 50 MB/s

# 3) monitor — can take hours on large partitions
kafka-reassign-partitions.sh \
  --bootstrap-server broker-1:9092 \
  --reassignment-json-file plan.json \
  --verify
Enter fullscreen mode Exit fullscreen mode

Pulsar load-manager rebalance.

# Brokers self-rebalance via the modular load manager; you usually do nothing.
# Manual flip if needed:
pulsar-admin namespaces unload public/default
# Or for one bundle:
pulsar-admin namespaces unload public/default --bundle 0x00000000_0x40000000

# Inspect ownership:
pulsar-admin brokers list saas-co-cluster
pulsar-admin broker-stats topics
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Kafka reassignment, step 1. Generate a plan that moves some partition replicas to the new brokers. The plan is a JSON file mapping partition → broker list.
  2. Kafka reassignment, step 2. Execute the plan with a bandwidth throttle. The leader broker starts shipping log data to the new broker; the new broker becomes a follower and catches up.
  3. Kafka reassignment, step 3. Monitor until every partition in the plan is fully replicated to the new broker set. For a 500 GB partition at 50 MB/s throttled, that is ~10,000 seconds = ~2.7 hours per partition. In practice teams run multiple partitions in parallel but the total wall-clock can be hours.
  4. Pulsar rebalance, step 1. The modular load manager continuously samples broker CPU, memory, throughput, and topic count. When a new broker joins, the load manager reassigns ownership of some topics — a metadata update.
  5. Pulsar rebalance, step 2. Clients receive a "topic moved" hint from the old broker; they reconnect to the new broker. No data movement; the new broker connects to the same bookies.
  6. The 100x difference. Kafka rebalance is bound by network bandwidth (data movement). Pulsar rebalance is bound by metadata-write latency. For a 500 GB topic, the difference is hours vs. seconds.

Output.

step Kafka time Pulsar time
add broker seconds (join cluster) seconds (join cluster)
rebalance plan minutes none (automatic)
data movement hours per hot partition none
total to "hot" new broker hours–days seconds–minutes

Rule of thumb. If your team rebalances Kafka every few weeks and the operation is well-tooled, the data-movement cost is acceptable. If your cluster scales aggressively (autoscaling on Kubernetes, frequent capacity changes, multiple migrations a quarter), Pulsar's stateless rebalance pays back the architectural complexity by week 4.

SQL interview question on per-broker load skew detection

The probe usually sounds like: "Given a per-broker minute-level metric table, write a query that flags brokers running more than 1.5x the cluster median CPU for 15 consecutive minutes — that is the classic 'hot broker' signal that triggers a rebalance."

Solution Using percentile_disc + a running streak

WITH per_minute AS (
    SELECT
        broker_id,
        ts,
        cpu_pct,
        PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY cpu_pct)
            OVER (PARTITION BY ts) AS cluster_median_cpu
    FROM broker_metrics
    WHERE ts >= NOW() - INTERVAL '2 hours'
),
flagged AS (
    SELECT
        broker_id,
        ts,
        cpu_pct,
        cluster_median_cpu,
        CASE WHEN cpu_pct > 1.5 * cluster_median_cpu THEN 1 ELSE 0 END AS hot
    FROM per_minute
),
runs AS (
    SELECT
        broker_id,
        ts,
        hot,
        SUM(CASE WHEN hot = 0 THEN 1 ELSE 0 END)
            OVER (PARTITION BY broker_id ORDER BY ts) AS cool_grp
    FROM flagged
)
SELECT broker_id, MIN(ts) AS streak_start, COUNT(*) AS streak_len
FROM runs
WHERE hot = 1
GROUP BY broker_id, cool_grp
HAVING COUNT(*) >= 15
ORDER BY streak_len DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

broker_id ts cpu_pct median hot cool_grp streak_len
b-4 10:00 78 40 1 0 1
b-4 10:01 82 41 1 0 2
b-4 1 0
b-4 10:14 80 42 1 0 15
b-4 10:15 35 40 0 1 (resets)

The cool_grp running-sum trick groups consecutive hot minutes into one streak per broker. The HAVING clause keeps only streaks of 15+.

Output:

broker_id streak_start streak_len
b-4 10:00 15

Why this works — concept by concept:

  • PERCENTILE_DISC OVER PARTITION BY ts — computes the cluster median per minute so the threshold is relative to the current cluster, not a static number. A 70% CPU broker is "cold" when the cluster median is 60% and "hot" when the cluster median is 30%.
  • Hot flag as 0/1 — turns the threshold check into an integer indicator. Cheap, indexable, and composable with running sums.
  • SUM OVER (PARTITION BY broker_id ORDER BY ts) running-grouping — every time a row is cold (hot=0), the running sum bumps; consecutive hot rows share the same cool_grp value. That is the canonical "consecutive-streak" trick from Blog106 and Blog121.
  • GROUP BY (broker_id, cool_grp) — each streak collapses into one row carrying its length.
  • Cost — three window passes over a 2-hour window of broker metrics; with an index on (broker_id, ts) the scan is O(rows-in-window).

SQL
Topic — window functions
Window function problems (SQL)

Practice →


3. Geo-replication — Pulsar native multi-region clusters vs Kafka MirrorMaker 2 + offset translation

pulsar geo replication is a property of the topic — Kafka MirrorMaker 2 is a separate Kafka Connect job

The mental model in one line: Pulsar treats geo-replication as a per-topic configuration (set the replication-cluster list and writes flow to every region asynchronously); Kafka treats geo-replication as an out-of-process stream-processing job (MirrorMaker 2) that reads from a source cluster and writes to a destination cluster while translating consumer offsets. Once you can name those two patterns, every probe on pulsar geo replication reduces to "which side carries the operational load — the broker or a separate worker fleet?"

Visual diagram of geo-replication — left a Pulsar topology with three regional clusters (us-east, eu-west, ap-south) all writing into and reading from a shared geo-replicated topic via built-in cluster replication; right a Kafka topology with two source clusters and a MirrorMaker 2 hop in the middle translating offsets into the destination cluster; on a light PipeCode card.

Pulsar geo-replication in detail.

  • Per-topic property. Set the replication-cluster list at the namespace level: pulsar-admin namespaces set-clusters tenant/ns --clusters us-east,eu-west,ap-south. Every topic under that namespace is automatically replicated to the listed clusters.
  • Asynchronous by default. Pulsar replicates messages asynchronously between clusters — a producer in us-east acks the moment the local cluster persists; the message then flows to eu-west and ap-south in the background.
  • Active/active out of the box. Producers can publish to the topic in any cluster; consumers can subscribe in any cluster. The same topic name in all regions; the same subscription name.
  • Replicator subscription. Internally, each cluster maintains a special "replicator" subscription that reads local-cluster writes and ships them to remote clusters. The subscription handles backpressure and retries.
  • Sync replication is opt-in. For workloads that demand cross-region durability before ack, Pulsar 2.7+ supports synchronous geo-replication per-namespace — at the cost of inter-region RTT in the write path.

Kafka MirrorMaker 2 in detail.

  • A Kafka Connect job. MirrorMaker 2 (KIP-382, shipped in Kafka 2.4) runs as Kafka Connect tasks. You deploy a worker fleet separate from the brokers.
  • Source and destination cluster. MM2 reads from a source cluster and writes to a destination cluster. Topic names are prefixed with the source cluster alias by default (e.g. prod-us.events in dr-us).
  • Offset translation. MM2 maintains a checkpoint topic mapping (source partition, source offset) → (target partition, target offset). A consumer that fails over from prod-us to dr-us can read the checkpoint and resume from the correct position.
  • Active/active is configurable. A bidirectional MM2 deployment (us → eu and eu → us) achieves active/active, but you must adopt a naming convention to avoid replication loops (use the cluster alias prefix).
  • Operational cost. MM2 is a separate fleet: you patch it, monitor it, alert on its lag, and scale it independently from the brokers. Comparable cost to running a Kafka Streams app at the same throughput.

Three trade-offs that drive the decision.

  • Topology overhead. Pulsar geo-replication = topic property. Kafka MM2 = separate Kafka Connect fleet. The Pulsar side has fewer moving parts at runtime.
  • Topic naming. Pulsar uses the same topic name in every region — "events" is "events" everywhere. Kafka MM2 uses a source-cluster-prefixed topic name to avoid loops (e.g. dr-us.events). Consumers must know whether to read the local or the remote name.
  • Offset semantics. Pulsar's subscription cursor is per-cluster (each cluster's broker tracks its own subscriber position); Kafka's offsets are per-cluster too, but MM2 must explicitly translate them via the checkpoint topic for clean failover.

Common interview probes on pulsar geo replication.

  • "How does Pulsar handle a cross-region replication loop?" — each message carries a replicated_from header; a cluster does not re-replicate a message it received from elsewhere.
  • "Is Pulsar geo-replication synchronous?" — no, by default it is asynchronous. Sync replication is per-namespace opt-in (2.7+).
  • "Why is MirrorMaker 2 a separate fleet?" — because Kafka's architecture has no concept of cross-cluster replication inside the broker. MM2 is the bridge built outside it.
  • "What is offset translation in MM2?" — a mapping between source-cluster offsets and target-cluster offsets, kept in a checkpoint topic so consumers can resume on the target after a failover.

Worked example — same-data, two-region topic in Pulsar

Detailed explanation. A team needs the same "events" topic in us-east-1 and eu-west-1. Producers in either region publish; consumers in either region subscribe; failover should be transparent. In Pulsar, this is two commands.

Question. Set up a geo-replicated events topic in two Pulsar clusters. Show the admin commands, the producer code, and how a consumer in either region sees the same stream.

Input.

dimension value
tenant acme
namespace acme/prod
topic events
clusters us-east, eu-west
replication async (default)

Code (Pulsar admin + producer/consumer).

# 1) Configure clusters on the metadata side (one-time)
pulsar-admin clusters create us-east --url http://us-east-broker:8080
pulsar-admin clusters create eu-west --url http://eu-west-broker:8080

# 2) Allow the tenant to use both clusters
pulsar-admin tenants create acme --allowed-clusters us-east,eu-west

# 3) Set the replication-cluster list on the namespace
pulsar-admin namespaces create acme/prod --clusters us-east,eu-west
pulsar-admin namespaces set-clusters acme/prod --clusters us-east,eu-west
Enter fullscreen mode Exit fullscreen mode
// Producer in us-east-1
PulsarClient client = PulsarClient.builder()
    .serviceUrl("pulsar://us-east-broker:6650")
    .build();
Producer<byte[]> producer = client.newProducer()
    .topic("persistent://acme/prod/events")
    .create();
producer.send("hello-from-us".getBytes());

// Consumer in eu-west-1 — same topic name
PulsarClient euClient = PulsarClient.builder()
    .serviceUrl("pulsar://eu-west-broker:6650")
    .build();
Consumer<byte[]> consumer = euClient.newConsumer()
    .topic("persistent://acme/prod/events")
    .subscriptionName("eu-app-subscription")
    .subscribe();

Message<byte[]> msg = consumer.receive();
System.out.println(new String(msg.getValue())); // "hello-from-us"
consumer.acknowledge(msg);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Cluster wiring. The two Pulsar clusters know about each other via the metadata store (cluster URL registration). One-time setup.
  2. Tenant scope. The tenant "acme" is allowed on both clusters. Namespaces inherit this list.
  3. Replication-cluster list. Setting the list on the namespace means every topic under acme/prod is automatically replicated to both clusters. No per-topic config needed.
  4. Producer in us-east. The producer connects to the us-east broker and publishes. The message is persisted locally (BookKeeper in us-east) and acked.
  5. Replicator subscription. A background "replicator" subscription on the us-east cluster reads new messages and ships them to eu-west. Latency depends on inter-region RTT.
  6. Consumer in eu-west. The consumer connects to the eu-west broker on the same topic name. The eu-west broker has the message (replicated) and delivers it.
  7. Subscription scope. The subscription "eu-app-subscription" is local to eu-west — the us-east cluster does not see it. Each cluster tracks its own subscriber positions.

Output.

component location sees the message
producer us-east acked locally
us-east BookKeeper us-east persisted
replicator subscription us-east → eu-west shipped async
eu-west BookKeeper eu-west persisted
eu-west consumer eu-west receives + acks

Rule of thumb. Pulsar geo-replication is a namespace decision. Once you set the cluster list, every topic is automatically replicated — there is no per-topic plumbing. For workloads that need different replication topologies per topic, use different namespaces.

Worked example — same-data, two-region topic in Kafka MirrorMaker 2

Detailed explanation. The Kafka equivalent of the Pulsar setup above requires deploying a MirrorMaker 2 cluster and configuring a replication flow. Producers still publish to one cluster (the source); consumers in the other region read the prefixed mirror topic.

Question. Set up a Kafka geo-replication topology with MM2. Show the connector config, the producer code, and what consumers in the target region need to know.

Input.

dimension value
source cluster prod-us (3 brokers)
target cluster dr-eu (3 brokers)
topic events (in prod-us)
MM2 fleet 3 Connect workers in eu-west
direction active/passive (us → eu)

Code (mm2.properties + producer/consumer).

# mm2.properties — MirrorMaker 2 Connect config
clusters = prod-us, dr-eu
prod-us.bootstrap.servers = us-broker-1:9092,us-broker-2:9092,us-broker-3:9092
dr-eu.bootstrap.servers   = eu-broker-1:9092,eu-broker-2:9092,eu-broker-3:9092

# Replication flow: prod-us -> dr-eu
prod-us->dr-eu.enabled = true
prod-us->dr-eu.topics = events

# Offset translation (the checkpoint topic)
prod-us->dr-eu.emit.checkpoints.enabled = true
prod-us->dr-eu.sync.group.offsets.enabled = true
prod-us->dr-eu.refresh.topics.interval.seconds = 30

# Naming convention — source-prefixed topic in destination
replication.policy.separator = .
# Result: 'events' in prod-us appears as 'prod-us.events' in dr-eu
Enter fullscreen mode Exit fullscreen mode
// Producer in us-east-1
KafkaProducer<String, String> p = new KafkaProducer<>(producerProps);
p.send(new ProducerRecord<>("events", key, value)); // writes to prod-us

// Consumer in eu-west-1 — must read the prefixed topic
KafkaConsumer<String, String> c = new KafkaConsumer<>(consumerProps);
c.subscribe(List.of("prod-us.events"));
// On failover, the app re-subscribes to "prod-us.events" in dr-eu
// MM2 has been writing the local-region copy + translating offsets
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. MM2 config. Two clusters listed; bootstrap servers for each; the replication flow prod-us->dr-eu is enabled for the events topic.
  2. Naming convention. The destination topic in dr-eu is prod-us.events (source-cluster prefix). This prevents replication loops if you also enable dr-eu->prod-us.
  3. Producer in us-east. Standard Kafka producer to prod-us. No knowledge of MM2.
  4. MM2 in flight. MM2 workers in eu-west read from prod-us.events (well, the source events topic in prod-us via the bootstrap servers) and write to prod-us.events in dr-eu. They also write checkpoint records.
  5. Consumer in eu-west. The consumer subscribes to prod-us.events — the prefixed name. If the app failed over from prod-us to dr-eu, it would re-read the checkpoint to find the correct offset to resume from.
  6. Compared with Pulsar. The producer and consumer code look similar. The difference is the naming convention (prefix) and the operational overhead (an extra fleet of MM2 workers to deploy, scale, and alert on).

Output.

component location sees the message
producer prod-us acks locally
prod-us topic events prod-us persisted
MM2 worker fleet eu-west reads + writes
dr-eu topic prod-us.events dr-eu persisted with mapped offsets
eu-west consumer (failover) dr-eu subscribes to prefixed topic

Rule of thumb. Kafka MM2 is an excellent tool for one-direction or bidirectional replication, but it carries the cost of a separate fleet. If your workload only needs occasional cross-region replication (DR replica), MM2 is reasonable. If your workload is active/active by design, Pulsar's per-topic native replication is operationally lighter.

Worked example — failover and offset translation in MM2

Detailed explanation. The interview moment: a consumer in prod-us has been reading at offset 1,000,000 on the events topic. The us region goes dark. The consumer must fail over to dr-eu and pick up reading the equivalent message on prod-us.events. The MM2 checkpoint topic is the key.

Question. Walk through the offset-translation steps. What does the consumer read first after failover?

Input.

dimension value
source offset (prod-us.events) 1,000,000
target offset (prod-us.events in dr-eu) 1,000,000 (approximately)
checkpoint record (consumer-group=app, partition=0, src=1000000, tgt=999987)

Code (consumer-side reconciliation, simplified).

// On failover detected
String checkpointTopic = "prod-us.checkpoints.internal";
// MM2 has been writing records like:
// key = (consumer-group, partition)
// value = ConsumerCheckpoint { upstreamOffset=1_000_000, downstreamOffset=999_987 }

ConsumerCheckpoint cp = lookupCheckpointFor("app", 0);
long resumeOffset = cp.downstreamOffset(); // 999,987

KafkaConsumer<String, String> c = new KafkaConsumer<>(drProps);
c.assign(List.of(new TopicPartition("prod-us.events", 0)));
c.seek(new TopicPartition("prod-us.events", 0), resumeOffset);
// Now read forward — the next poll() returns the message that was at
// offset 1,000,001 in prod-us, which is at 999,988 in dr-eu.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The checkpoint topic. MM2 emits a record to prod-us.checkpoints.internal periodically (default every 60s) summarising "consumer-group app was at upstream offset X; in the target cluster that corresponds to downstream offset Y."
  2. The consumer failover. The application code (or a sidecar) reads the most recent checkpoint for the group + partition. The downstream offset becomes the resume point.
  3. Why offsets do not match 1:1. The target cluster's prod-us.events topic was built by MM2 reading from source. The first message MM2 ever wrote into the target had offset 0 in the target, regardless of its source offset. The mapping is therefore not a constant — it can shift if MM2 was restarted, lagged, or skipped messages on a partition rebalance.
  4. The "approximately" caveat. Offset translation in MM2 is best-effort. The consumer may re-process or skip a small number of messages around the failover boundary. Idempotent processing downstream is the standard mitigation.
  5. Compared with Pulsar. Pulsar subscriptions are per-cluster. The eu-west consumer has its own subscription cursor in eu-west — no translation table to maintain. Failover is just "switch clients to the eu-west broker, keep using the same subscription name."

Output.

step action
1 failover detected
2 lookup checkpoint (group, partition)
3 seek downstream offset 999,987
4 resume; reprocess a few messages around the boundary
5 downstream dedup absorbs the small overlap

Rule of thumb. MM2 failover is workable but requires consumers to be idempotent. Pulsar failover is workable because subscriptions are local — but the application must accept that the eu-west cursor has its own progress, independent of us-east.

SQL interview question on cross-region replication lag

A senior interviewer might frame this as: "Given a per-message audit table that records (message_id, source_region, source_ts, target_region, target_ts), write a SQL query that returns the 95th-percentile replication lag per (source, target) pair over the last hour. This is the standard 'is geo-replication healthy?' metric."

Solution Using a percentile aggregate over the lag distribution

SELECT
    source_region,
    target_region,
    COUNT(*) AS messages_replicated,
    PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (target_ts - source_ts)))
        AS p50_lag_sec,
    PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (target_ts - source_ts)))
        AS p95_lag_sec,
    PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (target_ts - source_ts)))
        AS p99_lag_sec
FROM replication_audit
WHERE source_ts >= NOW() - INTERVAL '1 hour'
  AND target_ts IS NOT NULL
GROUP BY source_region, target_region
ORDER BY p95_lag_sec DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

message_id source_region source_ts target_region target_ts lag_sec
m-1 us-east 10:00:00.0 eu-west 10:00:00.45 0.45
m-2 us-east 10:00:00.0 eu-west 10:00:01.20 1.20
m-3 us-east 10:00:01.0 eu-west 10:00:01.40 0.40
m-4 us-east 10:00:02.0 ap-south 10:00:03.10 1.10
m-5 us-east 10:00:02.0 ap-south 10:00:04.50 2.50

After the percentile aggregate per (source, target) pair, you get the lag distribution. The p95 value answers "what is the worst-case the application sees 95% of the time?"

Output:

source_region target_region messages p50_lag_sec p95_lag_sec p99_lag_sec
us-east eu-west 1_200_000 0.50 1.30 2.10
us-east ap-south 1_200_000 1.20 2.80 4.50

Why this works — concept by concept:

  • PERCENTILE_DISC vs PERCENTILE_CONT — DISC picks an actual value from the input (no interpolation). CONT linearly interpolates. For lag SLOs ("our SLO is 'p95 below 2 seconds'"), DISC is more honest — it returns a value that actually happened.
  • WITHIN GROUP (ORDER BY ...) — the ordered-set-aggregate clause. It tells the percentile function which expression to rank by.
  • EXTRACT(EPOCH FROM ts_diff) — converts a INTERVAL into seconds for arithmetic. Without it, the percentile would be computed on intervals, which is engine-dependent.
  • GROUP BY (source_region, target_region) — each region pair gets its own distribution. A failing us-east→ap-south pair is invisible if you aggregate globally.
  • Cost — O(n log n) per group due to the percentile sort. With an index on (source_region, target_region, source_ts) the scan is local to the hour window.

SQL
Topic — real-time analytics
Real-time analytics problems (SQL)

Practice →


4. Tiered storage — BookKeeper offload to S3 vs Kafka tiered storage (KIP-405)

Both engines now offload cold log segments to object storage — Pulsar shipped it in 2018, Kafka caught up with KIP-405 in 2023

The mental model in one line: A closed log segment is immutable; immutable data belongs in cheap object storage; the broker can fetch it back transparently when a consumer wants to read history. Once you internalise the three shelves (hot SSD / warm BookKeeper or local segments / cold S3), the entire pulsar tiered storage and KIP-405 conversation becomes a cost-and-latency exercise.

Visual diagram of tiered storage shelves — three stacked horizontal shelves labelled hot (broker/local SSD), warm (BookKeeper / local segments), cold (S3 / GCS / Azure Blob); two columns showing how Kafka (KIP-405) and Pulsar (BookKeeper offloader) place data on each shelf; a cost chip on the right; on a light PipeCode card.

Pulsar tiered storage in detail.

  • BookKeeper offloader. Pulsar 2.1 (2018) shipped a per-namespace offloader that moves closed ledgers from BookKeeper bookies to object storage. Closed ledgers are immutable, so the offload is a one-shot copy.
  • Drivers. tiered-storage-jcloud supports AWS S3, GCS, Azure Blob; tiered-storage-filesystem supports HDFS. The offloader runs inside the broker process and is configured per-namespace.
  • Transparent reads. Once a ledger is offloaded, the BookKeeper copy can be deleted. When a consumer requests a message from that ledger, the broker reads from the object store directly and streams to the consumer.
  • Trigger. Configurable: by namespace size threshold, by ledger age, or manually via pulsar-admin topics offload. The "default" pattern is "offload everything older than 1 day" for a hot-tail-cold-history workload.
  • Cost shape. Object storage is ~10x cheaper per GB than SSD. Pulsar's offloader makes "infinite retention" actually affordable.

Kafka KIP-405 in detail.

  • KIP-405. Proposed 2018, adopted in Apache Kafka 3.6 (October 2023). Confluent shipped a proprietary version (Confluent Tiered Storage) in 2020; Aiven and Strimzi shipped variants soon after.
  • Local hot tier, remote cold tier. Each partition has an active segment on local disk (hot) plus rolled segments. Rolled segments are uploaded to the remote tier (S3 / GCS / Azure Blob) and the local copy is deleted when the local retention size is exceeded.
  • Transparent reads. When a consumer requests an offset that is in the remote tier, the broker fetches from the object store and streams to the consumer. Latency is higher than local reads.
  • Configuration. remote.log.storage.system.enable=true, remote.log.storage.manager.class.path=..., log.retention.ms (local), log.local.retention.ms (local cap), plus broker-specific cloud credentials.
  • Same cost shape. Cold tier ~$0.023/GB-month for S3 standard; warm tier ~$0.05/GB-month for EBS gp3; hot tier ~$0.20/GB-month for io2.

Three trade-offs.

  • Maturity. Pulsar's offloader has been production since 2018 (more battle-testing). Apache Kafka tiered storage has been GA since late 2023 — the operational track record is shorter but the design is sound.
  • Read latency from cold. Both engines hit the same physics: a cold-tier read pulls a segment object from S3, which is hundreds of milliseconds to first-byte. Latency-sensitive consumers should stay in the hot tier.
  • Cost / GB-month. Hot ~$0.20, warm ~$0.05, cold ~$0.023. A 1 TB topic with 30-day local + 365-day cold retention costs about $20 cold + $200 hot, instead of $7,000 if you ran a 365-day all-hot retention.

Common interview probes on pulsar tiered storage.

  • "Why is Pulsar's offload safe?" — because BookKeeper ledgers are immutable once closed; the offload is a copy, not a move, until the bookie copy is reclaimed.
  • "Can a consumer read from cold storage transparently?" — yes, in both engines. The broker proxies the cold read; the consumer protocol is unchanged.
  • "What is the typical hot-tail retention?" — 1–7 days local, 30–365+ days cold. Depends on the read-history pattern.
  • "How does cost compare?" — cold tier is ~10x cheaper per GB-month than hot. Tiered storage is the single biggest cost lever in modern messaging.

Worked example — Pulsar BookKeeper offload to S3

Detailed explanation. A namespace's topics accumulate 100 GB / day of events. The team wants 7 days of fast local access + 365 days of replayable history. Without tiered storage, that is 365 days × 100 GB = 36.5 TB of BookKeeper SSD — expensive. With offload, the warm tier holds 7 days (700 GB SSD), and the cold tier (S3) holds the rest (35.8 TB) at one-tenth the price.

Question. Configure the BookKeeper offloader on a namespace. Show the admin commands, the threshold, and how a consumer reads a one-month-old message.

Input.

dimension value
namespace acme/prod
daily write 100 GB
local (warm) retention 7 days
cold retention 365 days
object store S3 (us-east-1)

Code (broker.conf + pulsar-admin commands).

# broker.conf — enable the S3 offloader (one-time per broker)
managedLedgerOffloadDriver=aws-s3
s3ManagedLedgerOffloadBucket=acme-pulsar-coldstore
s3ManagedLedgerOffloadRegion=us-east-1
managedLedgerOffloadThresholdInBytes=107374182400   # 100 GiB per topic
Enter fullscreen mode Exit fullscreen mode
# Per-namespace offload policy: trigger offload when 700 GiB accumulates
pulsar-admin namespaces set-offload-threshold \
  --size 751619276800 acme/prod   # 700 GiB
pulsar-admin namespaces set-offload-deletion-lag \
  --lag 4h acme/prod                # delete BK copy 4h after S3 upload

# Force an offload right now (for testing)
pulsar-admin topics offload \
  -s 1G \
  persistent://acme/prod/events

# Inspect the offload state
pulsar-admin topics offload-status persistent://acme/prod/events
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Broker config. Tell every broker how to talk to S3: bucket, region, driver. Credentials come from the EC2 instance profile or the standard AWS SDK chain.
  2. Per-namespace threshold. Configure a size threshold (700 GiB) — when the namespace's local size exceeds this, the broker offloads the oldest closed ledger.
  3. Deletion lag. After a successful S3 upload, the broker waits 4 hours before deleting the BookKeeper copy. The lag is a safety net — if the S3 upload was corrupted, the BookKeeper copy still serves reads.
  4. Read path. A consumer asks for a one-month-old message. The broker checks its managed-ledger metadata: the ledger holding that message has been offloaded. The broker fetches the segment object from S3, decodes it, and streams the message to the consumer.
  5. Latency. The first read of a cold segment pays the S3 first-byte latency (~100–500 ms). Subsequent reads from the same segment are served from the broker's read cache.

Output.

age of message served from typical latency
0–7 days BookKeeper (SSD) <5 ms
7+ days S3 (cold) 100–500 ms first byte; then streaming
force-offloaded S3 as above

Rule of thumb. Tune the local-retention size to match your "fast replay" window — the time the application would re-read in a normal operation. Anything older is replay-on-rare-occasion, and the cold-tier latency is fine.

Worked example — Kafka KIP-405 tiered storage

Detailed explanation. The Kafka equivalent ships in Apache 3.6+. The configuration moves from "broker holds everything" to "broker holds local-retention; remote tier holds the rest." The consumer code does not change.

Question. Configure KIP-405 tiered storage on a broker. Show the local vs remote retention split and the consumer experience.

Input.

dimension value
Kafka version 3.6+
topic events
local retention 7 days
remote retention 365 days
remote store S3

Code (broker config + topic config).

# server.properties — enable tiered storage (one-time per broker)
remote.log.storage.system.enable=true
remote.log.storage.manager.class.path=/opt/kafka/libs/tiered/*
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.RemoteLogManager
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager

# Cloud config (driver-specific)
rsm.config.storage.s3.bucket.name=acme-kafka-coldstore
rsm.config.storage.s3.region=us-east-1
Enter fullscreen mode Exit fullscreen mode
# Topic-level configuration: enable remote storage + retention split
kafka-configs.sh --bootstrap-server broker-1:9092 \
  --entity-type topics --entity-name events --alter \
  --add-config remote.storage.enable=true,\
local.retention.ms=604800000,\
retention.ms=31536000000
# local.retention.ms = 7 days (in local broker tier)
# retention.ms       = 365 days (total, local + remote)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Broker config. Enable tiered storage; point to the remote log manager implementation and the cloud bucket. Restart brokers in a rolling fashion.
  2. Topic config. remote.storage.enable=true activates tiering for the topic. local.retention.ms caps how much stays on the broker; retention.ms is the total horizon.
  3. What happens at write time. Producers publish as usual. Brokers append to active segments. When a segment is rolled (size or time), a background task uploads it to S3 and registers metadata.
  4. What happens at consume time, hot data. Consumer asks for a recent offset. Broker reads from local log. Indistinguishable from non-tiered Kafka.
  5. What happens at consume time, cold data. Consumer asks for a 30-day-old offset. Broker looks up the segment in remote metadata, fetches from S3, streams to the consumer. Latency higher than local but the consumer protocol is identical.
  6. Compared with Pulsar. Same conceptual model, same cost shape, same trade-offs. Implementation maturity is the main delta in 2026 (Pulsar has 5 more years of production experience).

Output.

age of message served from latency
0–7 days local broker disk <5 ms
7+ days S3 (remote tier) 100–500 ms first byte; then streaming

Rule of thumb. Both engines now have tiered storage. The right choice is determined by which engine you are already running, not by the tiering capability — both make "infinite retention" economically feasible.

Worked example — the cost curve for 1 PB of historical events

Detailed explanation. Take a one-petabyte topic, growing 1 GB / hour. Compute the monthly cost of running all-hot, hot-tail + warm-recent, and hot-tail + cold-history. The cold tier is the entire game.

Question. Estimate the monthly storage cost in three retention strategies. Compare both engines (they have the same cost shape).

Input — pricing assumptions.

tier $ / GB-month (rough 2026)
hot (broker SSD / io2 / NVMe) $0.20
warm (BookKeeper SSD / gp3) $0.05
cold (S3 standard) $0.023

Code (math, not SQL).

data set: 1 PB = 1,048,576 GB

# Strategy A: all hot SSD
1_048_576 * 0.20 = $209,715 / month

# Strategy B: 7d hot + 30d warm + 365d cold
hot  = 7  * 24 GB = 168 GB                * 0.20 = $33.6
warm = 23 * 24 GB = 552 GB                * 0.05 = $27.6
cold = remainder ~1,048,000 GB (minus a small recent slice)
                                          * 0.023 ~ $24,104
TOTAL ~ $24,165 / month

# Strategy C: 1d hot + 365d cold (BookKeeper offload / KIP-405)
hot  = 1 * 24 GB = 24 GB    * 0.20 = $4.8
cold = ~1,048,552 GB        * 0.023 = $24,116
TOTAL ~ $24,121 / month
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. All-hot strategy. $210k/month is a sticker shock that drives most teams to tiering. This is the cost without offload — Kafka pre-3.6 or Pulsar pre-2.1.
  2. Three-tier strategy. Pushing 23 days of recent data to a warm SSD tier and the rest to cold S3 lowers the cost by ~90%. This is the Pulsar 2.1+ pattern with BookKeeper as the warm tier.
  3. Hot-tail-cold-history strategy. Pushing all but 1 day to cold drops the cost almost identically — the warm tier was a small fraction of the total anyway.
  4. Real-world adjustment. Add S3 PUT/GET request charges (~$0.005 / 1k PUT; ~$0.0004 / 1k GET) and inter-AZ traffic for replication if applicable. The relative ranking does not change; the absolute number shifts by 5–15%.

Output.

strategy monthly cost savings vs all-hot
all hot $209,715 baseline
7d hot / 30d warm / 365d cold ~$24,165 89% saved
1d hot / 365d cold ~$24,121 89% saved

Rule of thumb. Once any topic exceeds a few weeks of retention, tiered storage saves an order of magnitude on bills. Both Pulsar and Kafka offer it; the cost shape is the same; the decision is "which engine is your team already running?"

SQL interview question on cold-tier hit rate

A senior interviewer might probe: "Given a per-fetch audit table that records (consumer_id, fetched_offset, tier_served, latency_ms), write a SQL query that returns the cold-tier hit rate per consumer over the last 24 hours — and flag consumers reading >50% from cold." This is the production health metric for any tiered-storage rollout.

Solution Using a conditional aggregation per consumer

WITH per_consumer AS (
    SELECT
        consumer_id,
        COUNT(*) AS total_fetches,
        SUM(CASE WHEN tier_served = 'cold' THEN 1 ELSE 0 END) AS cold_fetches,
        AVG(CASE WHEN tier_served = 'cold' THEN latency_ms END) AS cold_avg_ms,
        AVG(CASE WHEN tier_served = 'hot'  THEN latency_ms END) AS hot_avg_ms
    FROM fetch_audit
    WHERE fetch_ts >= NOW() - INTERVAL '24 hours'
    GROUP BY consumer_id
)
SELECT
    consumer_id,
    total_fetches,
    cold_fetches,
    ROUND(100.0 * cold_fetches / NULLIF(total_fetches, 0), 2) AS cold_pct,
    ROUND(cold_avg_ms, 1) AS cold_avg_ms,
    ROUND(hot_avg_ms, 1)  AS hot_avg_ms,
    CASE WHEN 1.0 * cold_fetches / NULLIF(total_fetches, 0) > 0.50
         THEN 'investigate'
         ELSE 'ok' END AS flag
FROM per_consumer
ORDER BY cold_pct DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

consumer_id total cold hot_avg cold_avg cold_pct
etl-historical 12,000 11,800 3.0 ms 240 ms 98.3%
ml-feature 50,000 1,000 2.5 ms 220 ms 2.0%
dashboard-live 500,000 0 2.0 ms NULL 0.0%

The first consumer is a historical backfill that legitimately reads cold; the second is a feature pipeline mostly on hot; the third is purely hot. The flag fires for the first.

Output:

consumer_id cold_pct flag
etl-historical 98.3 investigate
ml-feature 2.0 ok
dashboard-live 0.0 ok

Why this works — concept by concept:

  • Conditional aggregationSUM(CASE WHEN tier='cold' THEN 1 ELSE 0 END) is the portable way to count rows that satisfy a predicate inside a single pass. Some dialects support COUNT(*) FILTER (WHERE ...) as a cleaner equivalent.
  • NULLIF(total, 0) — turns the "divide by zero" runtime error into NULL when a consumer fetched nothing in the window. Standard safe-division idiom (Blog126 ref).
  • AVG(CASE WHEN ... THEN x END) — using NULL for the non-matching rows means AVG ignores them, giving per-tier mean latency in one pass.
  • The 'investigate' threshold — 50% is a starting heuristic; teams often tune to "consumer p95 latency exceeds 1s and cold_pct > 30%" before paging.
  • Cost — one pass over the 24h fetch_audit; with (consumer_id, fetch_ts) index the scan is local. Aggregation is O(n) per group.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →


5. Pulsar Functions + multi-tenancy + topic-per-tenant model — picking by workload

pulsar functions is light per-topic stream processing baked into the broker — Kafka achieves the same with Streams + Connect outside the broker

The mental model in one line: Pulsar Functions are lightweight, per-topic transform tasks that run inside the Pulsar runtime; Kafka achieves the same workloads with the Streams library (inside your app) plus Kafka Connect (in a separate worker fleet). Combined with the topic-per-tenant model, this drives the most workload-shaped part of the apache pulsar vs kafka decision.

Visual diagram of Pulsar Functions runtime — left a Pulsar broker hosting a Function instance reading from input-topic and writing to output-topic in-process; right a Kafka topology with a separate Streams app pod and a separate Kafka Connect worker fleet doing the same input/output transform; a multi-tenancy chip showing tenants → namespaces → topics on the Pulsar side; on a light PipeCode card.

Pulsar Functions in detail.

  • Per-topic transform. A Function is a small Java/Python/Go program that reads from one or more input topics and writes to one or more output topics. The shape is (input) -> (output) — exactly the lambda function shape.
  • Built-in runtime. Functions run inside the Pulsar runtime: the Function Worker process (separate from the broker process but in the same cluster) hosts Function instances. There is also a "Kubernetes runtime" that runs each Function as a pod.
  • Deploy via admin. pulsar-admin functions create --tenant acme --namespace prod --name dedup --jar dedup.jar --classname Dedup --inputs raw-events --output clean-events. One command, one Function.
  • Window support. Tumbling windows and sliding windows are supported natively (since Pulsar 2.6). Stateful operations are supported via a key-value state store API.
  • Dead-letter topics + retry topics. Each Function can declare a DLQ topic and a retry topic — failed messages flow to DLQ instead of poisoning the pipeline.

Kafka Streams + Connect in detail.

  • Kafka Streams. A library, not a runtime. You embed it in your application JAR. It uses Kafka topics for internal state (changelog topics) and offset tracking.
  • Kafka Connect. A separate worker fleet for source and sink connectors. JDBC source → Kafka topic, Kafka topic → S3 sink, etc. Connectors are reusable modules.
  • Why the split. Streams is for in-process stream transforms inside your app; Connect is for moving data between Kafka and external systems. They cover the same surface area as Pulsar Functions + Pulsar IO connectors but as two separate deployment models.
  • Maturity. Kafka Streams is one of the most battle-tested stream-processing libraries in the world. Stateful joins, windowed aggregations, exactly-once with EOS-v2 — all mature.
  • Connect ecosystem. Hundreds of pre-built connectors (Debezium for CDC, S3 sink, BigQuery sink, MongoDB source). Pulsar IO has a smaller catalog but covers the major sinks/sources.

Multi-tenancy in detail.

  • Pulsar hierarchy. tenant → namespace → topic. Tenants are usually a customer or a business unit; namespaces group related topics; topics are the unit of data.
  • Per-level policies. Namespace-level retention, backlog quota, dispatch rate, replication-cluster list, schema validation. Tenant-level admin roles.
  • Kafka equivalent. Topic-level ACLs and quotas; per-client-id quotas. There is no tenant → namespace tree; the operator must encode the hierarchy as a topic-naming convention (e.g. acme.prod.events).
  • Why it matters. For multi tenancy kafka pulsar workloads with hundreds of customer-isolated streams, Pulsar's native model is operationally lighter. Kafka can do it; it just requires more application-layer plumbing.

Three workload shapes and the canonical choice.

  • Light per-topic transform (filter, mask, enrich). Pulsar Function in three lines vs. Kafka Streams app deployment. Pulsar wins on operational simplicity.
  • Stateful windowed aggregation, exactly-once. Kafka Streams is the most mature option. Pulsar Functions support windows but the ecosystem is thinner. Lean Kafka for hard stateful workloads, or pair Pulsar with Flink.
  • External system integration (DB sink, S3, ElasticSearch). Kafka Connect's connector catalog is the broadest in the industry. Pulsar IO has the major sinks; for niche systems you may need a custom connector.

Common interview probes on pulsar functions.

  • "Are Pulsar Functions stateful?" — yes, optionally. They have a built-in key-value state store; pure stateless Functions are also fine.
  • "How does a Pulsar Function compare to a Kafka Streams app?" — same workload shape, different deployment model. Pulsar Function runs in the cluster; Streams runs in your app.
  • "What is the topic-per-tenant pattern?" — one topic per tenant under a tenant-scoped namespace. Native to Pulsar; possible but heavier in Kafka.
  • "When would I still pick Kafka for a multi-tenant workload?" — when the tenant axis is small (<100), the throughput-per-tenant is high, and the team already runs Kafka. The operational cost of switching outweighs the architectural fit.

Worked example — a Pulsar Function that masks PII in three languages

Detailed explanation. A team needs to mask email addresses in a stream before downstream consumers see them. In Pulsar, this is a 20-line Function. In Kafka Streams, it is a small app — more files, more deploy steps, but conceptually equivalent.

Question. Write the masking Function in Python and Java. Deploy it. Compare with the Kafka Streams equivalent.

Input event.

{ "user_id": 42, "email": "alice@example.com", "action": "login" }
Enter fullscreen mode Exit fullscreen mode

Code (Python Function).

# mask.py
from pulsar import Function
import re

EMAIL = re.compile(r"([^@]+)@([^.]+)\.(.+)")

class MaskEmail(Function):
    def process(self, input, context):
        import json
        obj = json.loads(input)
        if "email" in obj and obj["email"]:
            obj["email"] = EMAIL.sub(r"***@\2.\3", obj["email"])
        return json.dumps(obj)
Enter fullscreen mode Exit fullscreen mode
# Deploy the Function
pulsar-admin functions create \
  --tenant acme --namespace prod \
  --name mask-email \
  --py mask.py \
  --classname mask.MaskEmail \
  --inputs persistent://acme/prod/raw-events \
  --output persistent://acme/prod/clean-events
Enter fullscreen mode Exit fullscreen mode

Code (Java equivalent).

// MaskEmail.java
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;

public class MaskEmail implements Function<String, String> {
    private static final ObjectMapper M = new ObjectMapper();
    @Override
    public String process(String input, Context ctx) throws Exception {
        JsonNode n = M.readTree(input);
        String email = n.path("email").asText("");
        String masked = email.replaceFirst("^[^@]+", "***");
        ((com.fasterxml.jackson.databind.node.ObjectNode) n).put("email", masked);
        return M.writeValueAsString(n);
    }
}
Enter fullscreen mode Exit fullscreen mode

Kafka Streams equivalent (sketch).

StreamsBuilder b = new StreamsBuilder();
KStream<String, String> raw = b.stream("raw-events");
KStream<String, String> clean = raw.mapValues(MaskEmail::mask);
clean.to("clean-events");
KafkaStreams app = new KafkaStreams(b.build(), props);
app.start();
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Pulsar Function deploy is one command. pulsar-admin functions create registers the code in the cluster, schedules instances, and starts forwarding messages. No app to build, no pod to deploy.
  2. The Pulsar Function lifecycle is cluster-managed. Upgrade with functions update. Scale with --parallelism N. Inspect with functions stats. Same admin pattern as topics.
  3. The Kafka Streams app is your application. You build a JAR, deploy it as a pod or service, manage its lifecycle through your normal CI/CD. The internal state (changelog topics) is in Kafka but the process is yours.
  4. For light transforms, Pulsar wins on overhead. No CI/CD pipeline needed; the cluster hosts the code.
  5. For complex stateful joins, the maturity gap reverses — Kafka Streams has a richer DSL, more operators, and a longer track record.

Output (operational comparison).

dimension Pulsar Function Kafka Streams app
deploy command functions create build + push + deploy
process owner cluster your team
scale knob --parallelism N replica count
state store built-in KV RocksDB + changelog topic
best fit light transforms, low ops stateful, mature DSL

Rule of thumb. Pulsar Functions are a 90% solution for the most common per-topic transforms (filter, mask, enrich, route). For stateful joins and windowed aggregations at scale, Kafka Streams or a pure stream-processing engine (Flink) remains the senior choice.

Worked example — the topic-per-tenant SaaS pattern in Pulsar

Detailed explanation. A SaaS platform creates one Pulsar tenant per customer. Each customer has a prod namespace and a staging namespace, each with a handful of topics. The number of customers grows from 100 to 10,000 over two years. The operational model scales gracefully.

Question. Walk through the topology, the per-tenant retention policy, and the per-tenant ACL. Show how a new tenant is provisioned.

Input.

dimension value
platform acme SaaS
customers 10,000
topics per customer 4 (events, metrics, errors, audit)
retention 30 days (default) / 90 days (premium)
isolation per-customer ACL

Code (provisioning, pulsar-admin).

# Provision a new tenant (one-off per customer signup)
pulsar-admin tenants create cust-123 \
  --allowed-clusters us-east,eu-west \
  --admin-roles cust-123-admin

# Create namespaces
pulsar-admin namespaces create cust-123/prod \
  --clusters us-east,eu-west
pulsar-admin namespaces create cust-123/staging \
  --clusters us-east

# Per-tier retention
pulsar-admin namespaces set-retention cust-123/prod \
  --size -1 --time 30d
# Upgrade to premium: 90d retention
pulsar-admin namespaces set-retention cust-123/prod \
  --size -1 --time 90d

# Per-customer rate limit
pulsar-admin namespaces set-publish-rate cust-123/prod \
  --msg-publish-rate 5000 --byte-publish-rate 10485760

# Topics are auto-created on first publish (auto-topic-creation enabled)
# or created explicitly:
pulsar-admin topics create-partitioned-topic \
  persistent://cust-123/prod/events --partitions 4
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Tenant. Each customer is its own tenant. Admin roles, allowed clusters, and policies cascade down.
  2. Namespace. Two per customer: prod and staging. Namespaces hold per-environment policies (retention, quotas, replication clusters).
  3. Topics. Four per environment: events, metrics, errors, audit. Auto-created on first publish if enabled, or pre-created.
  4. Retention. Set at the namespace level. Upgrading a customer to premium changes one config; no schema migration.
  5. Rate limits. Per-namespace publish and dispatch rates. Per-customer quota with zero application logic.
  6. Scaling math. 10,000 customers × 2 environments × 4 topics × 4 partitions = 320,000 partitions. Comfortably inside Pulsar's operating envelope.

Output (topology summary).

level count per-level policy
tenant 10,000 admin roles, clusters
namespace 20,000 retention, ACL, quota, replication
topic 80,000 schema, compaction
partition 320,000 striped across BookKeeper

Rule of thumb. If your SaaS platform serves more than a few hundred isolated tenants and each tenant needs per-tenant policy, the Pulsar tenant/namespace tree is doing significant work for you. The same model in Kafka would require a heavy application-layer abstraction over plain topic ACLs.

Worked example — when to still pick Kafka in 2026

Detailed explanation. Pulsar's architectural advantages are real, but Kafka still wins on a meaningful slice of workloads. Listing the cases explicitly avoids the "we should use Pulsar everywhere" bias.

Question. Identify three workload shapes where Kafka remains the senior choice. Argue each from first principles.

Input — three scenarios.

Scenario A: a single very-high-throughput event bus
- 5 GB/s sustained on one topic
- modest topic count (10-50)
- team has 5 years of Kafka operations

Scenario B: a Confluent / Aiven / MSK customer
- already paying for managed Kafka
- has Kafka Connect connectors for Snowflake, S3, BigQuery
- migration cost > architectural benefit

Scenario C: a stateful exactly-once stream-processing workload
- complex windowed joins, exactly-once semantics
- Kafka Streams DSL fits the team's mental model
- Pulsar + Flink is the alternative, but extra fleet
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Scenario A. Kafka's coupled compute+storage model is extremely efficient for a few hot topics with high throughput. The lack of an extra hop (broker → bookies) gives a small but real latency advantage. The team's existing Kafka expertise is a multiplier.
  2. Scenario B. Vendor lock-in cost. Confluent and Aiven and MSK have rich managed-service stories. Switching to Pulsar means moving off a managed Kafka SaaS, redeploying connectors, retraining the team. The operational cost dwarfs the architectural fit.
  3. Scenario C. Kafka Streams has the deepest stateful-processing DSL. Pulsar Functions support windows but not at the same maturity. For exactly-once windowed joins on a large state, Streams + EOS-v2 is the senior choice (or Flink, paired with either engine).
  4. The common thread. Kafka wins when (a) the workload's primary constraint is single-region throughput or stateful processing maturity, and (b) the team already runs Kafka. Pulsar wins when the primary constraint is multi-tenancy, multi-region, or topic-count.

Output (decision summary).

workload primary constraint senior choice
single-region max throughput Kafka
existing managed Kafka SaaS Kafka (migration cost)
stateful exactly-once joins Kafka Streams (or Flink + either)
multi-region active/active Pulsar
topic-per-tenant SaaS Pulsar
light per-topic transforms Pulsar Functions
frequent broker churn (autoscaling) Pulsar

Rule of thumb. "What is your team running today?" is usually the strongest signal. Architectural elegance does not outweigh five years of operational muscle.

SQL interview question on a per-tenant throughput SLO

A senior interviewer might frame this as: "Given a per-tenant throughput audit table, write a SQL query that returns each tenant's hourly message rate and flags tenants exceeding their configured rate limit. This is the daily multi-tenancy health check."

Solution Using a per-tenant hourly aggregate with a join to the rate-limit config

SELECT
    t.tenant_id,
    DATE_TRUNC('hour', a.ts) AS hour_bucket,
    COUNT(*) AS messages_in_hour,
    t.msg_per_sec_limit,
    ROUND(COUNT(*) / 3600.0, 2) AS observed_msg_per_sec,
    CASE
        WHEN COUNT(*) / 3600.0 > t.msg_per_sec_limit THEN 'exceeded'
        WHEN COUNT(*) / 3600.0 > 0.8 * t.msg_per_sec_limit THEN 'warning'
        ELSE 'ok'
    END AS slo_status
FROM tenant_audit a
JOIN tenants t
  ON t.tenant_id = a.tenant_id
WHERE a.ts >= NOW() - INTERVAL '24 hours'
GROUP BY t.tenant_id, DATE_TRUNC('hour', a.ts), t.msg_per_sec_limit
HAVING COUNT(*) / 3600.0 > 0.8 * t.msg_per_sec_limit
ORDER BY observed_msg_per_sec DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

tenant_id hour msgs limit observed status
cust-42 14:00 18,500,000 5,000 5138.9 exceeded
cust-77 14:00 14,800,000 5,000 4111.1 warning
cust-7 14:00 3,000,000 5,000 833.3 ok (filtered out)

The HAVING clause keeps only tenants that crossed 80% of their limit, surfacing only the SLO-relevant rows.

Output:

tenant_id hour observed limit status
cust-42 14:00 5138.9 5000 exceeded
cust-77 14:00 4111.1 5000 warning

Why this works — concept by concept:

  • DATE_TRUNC('hour', ts) — buckets timestamps to hour boundaries. Standard idiom for hourly aggregation in Postgres / Snowflake / BigQuery (TIMESTAMP_TRUNC in BigQuery).
  • JOIN to tenants config — the SLO threshold lives in a separate config table; joining brings the per-tenant limit into the same row as the observed rate.
  • HAVING for SLO filter — push the SLO filter into HAVING so the query returns only actionable rows. Tenants at 5% of their limit are noise.
  • Three-tier status — 'exceeded' / 'warning' / 'ok' lets the same query feed both paging alerts and quarterly review dashboards.
  • Cost — one pass over 24h tenant_audit with (tenant_id, ts) index; the join is hash-join to a small tenants table. Aggregate is O(rows-in-window).

SQL
Topic — event-processing
Event-processing problems (SQL)

Practice →


Cheat sheet — Pulsar vs Kafka recipes

  • Multi-region active/active default. Pulsar: set the namespace replication-cluster list (pulsar-admin namespaces set-clusters tenant/ns --clusters us,eu,ap). Kafka: deploy MirrorMaker 2 with bidirectional flows and the cluster-alias prefix convention. Pulsar wins on operational lightness; Kafka wins on tooling maturity.
  • Infinite retention without infinite SSD. Pulsar: enable the BookKeeper offloader (managedLedgerOffloadDriver=aws-s3 + namespace threshold). Kafka: enable KIP-405 tiered storage (remote.log.storage.system.enable=true + topic remote.storage.enable=true). Both engines now offer the same cost shape.
  • 100k+ topics, per-tenant policy. Pulsar: tenant/namespace/topic tree, per-namespace retention/quota/ACL. Kafka: per-topic ACL and per-client-id quota; encode tenants as a topic-naming convention. Pulsar wins on the high-topic-count axis.
  • Light per-topic stream transform. Pulsar: pulsar-admin functions create --jar transform.jar --inputs raw --output clean. Kafka: build a Kafka Streams app, deploy as a pod, monitor lifecycle. Pulsar Functions win on overhead.
  • Stateful exactly-once windowed aggregation. Kafka: Kafka Streams with EOS-v2 (state stores backed by changelog topics). Pulsar: Functions with built-in KV state + transactions (mature since 2.7), or pair Pulsar with Flink. Kafka Streams has the deeper DSL today.
  • External system integration (DB, S3, ES, BigQuery). Kafka: Kafka Connect with the open-source connector catalog (Debezium, Confluent Hub, Aiven connectors). Pulsar: Pulsar IO with a smaller but growing catalog. Kafka Connect's ecosystem is broader.
  • Strict ordering per key. Both engines: hash the key to a single partition; consumers see ordered messages within a partition. Pulsar adds the key_shared subscription type — multiple consumers share a topic but each key sticks to one consumer.
  • High-throughput single topic. Kafka still leads on the raw maximum throughput per partition. Pulsar reaches parity on non-persistent topics (no BookKeeper write).
  • Strong durability semantics. Both: quorum writes. Kafka: min.insync.replicas=2, acks=all. Pulsar: Qw=3, Qa=2. Same conceptual model, different knob names.
  • Strict exactly-once. Kafka: transactions + idempotent producer + read_committed consumer. Pulsar: transactions (since 2.7) + idempotent producer; consumer-side dedup via the transaction coordinator. Both engines now support exactly-once for the common patterns.
  • Stateless broker rebalance. Pulsar: broker ownership flips in seconds. Kafka: partition reassignment moves bytes; minutes to hours per hot partition. Pulsar wins when capacity churns.
  • Lightweight client. Both: many SDKs (Java, Go, Python, C++, Node.js, Rust). Pulsar adds first-class WebSocket protocol and a built-in HTTP/REST proxy — useful for browser-side producers.
  • Cluster topology. Kafka after KRaft: one process type (broker, with controller role on a quorum subset). Pulsar: three process types (broker, bookie, metadata). Kafka is operationally simpler at one-region scale; Pulsar offers more independent scaling axes.

Frequently asked questions

Is Pulsar faster than Kafka?

It depends on the metric. For raw single-topic throughput, Kafka still leads — one fewer hop in the write path means a slight latency edge, and the coupled broker+log model is extremely well-optimised. For rebalance and recovery time on a churning cluster, Pulsar is dramatically faster — broker ownership flips in seconds because no data movement is required. For multi-region active/active, Pulsar's native geo-replication ships data with less operational overhead than Kafka MirrorMaker 2. In production, "faster" usually means "matches our workload shape" — and that is a workload question, not a benchmark one.

Does Pulsar require ZooKeeper like Kafka used to?

Historically, yes. Pulsar 2.x used ZooKeeper as the metadata store for both broker coordination and BookKeeper ledger metadata. Pulsar 3.x added support for etcd as the metadata backend, reducing the ZooKeeper dependency. Kafka's KRaft (Kafka Raft) protocol replaced ZooKeeper inside the Kafka broker process; modern Kafka 3.7+ deployments are ZooKeeper-free by default. Both engines have therefore moved away from ZooKeeper, though Pulsar still has more deployment topologies that include it.

Can I run Pulsar Functions instead of Kafka Streams?

Yes, for a large class of workloads. Pulsar Functions are a great fit for light per-topic transforms (filter, mask, enrich, route) — they deploy in one admin command, run inside the cluster, and require no separate app. For stateful exactly-once windowed aggregations, Kafka Streams currently has the deeper DSL and longer track record; Pulsar Functions support windows and state but the ecosystem is thinner. Many teams pair Pulsar with Apache Flink for the heaviest stream-processing workloads, which is also a popular pattern with Kafka.

How does Pulsar geo-replication compare to MirrorMaker 2?

Pulsar geo-replication is a per-topic property — set a replication-cluster list at the namespace level and writes flow to every listed region asynchronously, with the same topic name in every region. Kafka MirrorMaker 2 is a separate Kafka Connect deployment that reads from a source cluster and writes to a destination cluster, with a source-cluster prefix in the destination topic name (prod-us.events) and an explicit offset-translation table. Pulsar wins on operational simplicity and on the active/active default; Kafka MirrorMaker 2 wins on flexibility and on ecosystem maturity (it has been in production at large scale since 2019).

What is BookKeeper and why does Pulsar use it?

Apache BookKeeper is a distributed write-ahead log service designed at Yahoo around 2011. It pre-dates Pulsar and is independently used by HDFS NameNode high availability, Apache DistributedLog, and others. Pulsar uses BookKeeper because it gives Pulsar a separately scalable storage tier — Pulsar brokers can be stateless because the durable storage lives in BookKeeper bookies. The unit of storage is a ledger (an append-only segment) striped across an ensemble of bookies with configurable write-quorum (Qw) and ack-quorum (Qa). This is the architectural pivot that distinguishes Pulsar from Kafka's coupled-storage model.

When should I pick Pulsar over Kafka in 2026?

Pick Pulsar when multi-region active/active is a primary requirement (Pulsar's native geo-replication is operationally lighter than MirrorMaker 2), when your workload has 100+ tenants with per-tenant policy (the tenant/namespace tree maps perfectly to SaaS), when your cluster churns brokers often (stateless rebalance is seconds, not hours), or when you want light per-topic stream processing inside the cluster (Pulsar Functions). Pick Kafka when your team already runs it, when single-region maximum throughput is the primary constraint, when stateful exactly-once joins are the main workload (Kafka Streams), or when you depend on the breadth of Kafka Connect's connector catalog. Both engines are mature, and "what the team operates today" is usually the strongest tie-breaker.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every Pulsar-vs-Kafka recipe above ships with hands-on practice rooms where you write the dedup query, the per-tenant SLO check, and the cross-region reconciliation against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a replay-induced duplicate behaves the same on Pulsar as on Kafka.

Practice streaming now →
Event-processing drills →

Top comments (0)