RAKESH THERANI

Posted on Jun 15

When a 1-Millisecond Disk Sync Turns Into 40: A ClickHouse Keeper fsync Chain Reaction

A production story about how running ClickHouse Keeper on a Kubernetes local-path disk — one shared with other I/O-heavy services on the node — turned a ~1-millisecond operation into a ~40-millisecond one — and how that single latency difference cascaded into recurring "replication lag" alerts that three wrong diagnoses couldn't explain.

TL;DR

Symptom: Recurring Replication Lag (queue > 100) warnings on our production ClickHouse cluster (lat-ams), always on the same two nodes. The identical workload on dev (lat-fra2) never alerted.
Root cause: ClickHouse Keeper's raft log uses an fsync on every committed write. Two of our three keepers ran on Kubernetes local-path storage whose underlying disk was shared with other I/O-heavy workloads on the node. Kubernetes isolates CPU and memory between pods — but not disk I/O. That competing disk traffic starved the keeper's fsync, pushing keeper latency from ~1 ms (uncontended) to 28–41 ms (contended)**
The cascade: Unequal keeper latency → the node on the fast keeper commits inserts faster → it fragments its share of the data into ~2.3× more tiny parts → the slow-keeper nodes must fetch that flood → their replication fetch queues pile up past 100 → alert.
The subtlety: The skew is relative, not absolute. ClickHouse replication is peer-coupled — every part written on any node must be fetched by the other two. A node only "lags" when it can't keep pace with its peers' combined output. Uniformity, not raw speed, is what keeps a cluster balanced.
The fix: Point all three ClickHouse servers at the single uncontended keeper (zookeeper_load_balancing: in_order with that keeper listed first), and pin raft leadership to it. Uniform keeper latency → no node lags its peers → skew gone. Measured: per-node keeper-latency spread collapsed from ~31 ms to ~6 ms.
The durable fix: Give Keeper a disk nothing else is hammering. It does not need a dedicated machine — Keeper is latency-bound, not throughput-bound (a few cores, <1 GB RAM, <300 MB disk). It needs disk isolation, achievable with anti-affinity scheduling or a small dedicated volume.

1. The setup

Our analytics pipeline looks like this:

Solana/EVM engines ──> NATS JetStream ──> ClickHouse (NATS engine) ──> MergeTree tables
                       (competing consumers)   (3 replicas, ReplicatedMergeTree)

3 ClickHouse servers running ReplicatedMergeTree, replicated through a 3-node ClickHouse Keeper ensemble (raft-based, ZooKeeper-compatible).
NATS JetStream delivers events. ClickHouse's built-in NATS engine pulls them as a competing-consumer group — whichever replica asks for the next batch first gets it.
Real-time SLA: data must land within seconds. That means small, frequent flushes — and therefore lots of small "parts."

Keeper is the coordination brain. Every replicated insert, every merge, every part registration goes through it. In raft, the leader must fsync each committed log entry to disk and get a quorum of followers to do the same before the write is acknowledged. That fsync is the heartbeat of the whole cluster.

2. The symptom

Production fired this, over and over, always on nodes 0-0 and 0-2:

ClickHouse Health Check: Replication Lag
NODE                   TABLE                    DELAY_SEC  QUEUE  INSERTS  MERGES
chi-server-apex-0-2-0  datastreams.wallet_balance   2       134     119      15
chi-server-apex-0-0-0  datastreams.market_ohlcv_long 8      132     122      10

Meanwhile dev — same code, same schema, same ClickHouse version — never alerted once. That contrast is the whole mystery. Whatever was different had to be environmental, not logical.

Critically, the cluster was healthy the entire time: data was fresh (lag ≤ a few seconds), readonly = 0, nothing was stuck. The queue depth looked alarming (it spiked to 1300+) but it was a standing fetch backlog, churning in seconds — not a stall. This was a cosmetic alert hiding a real, subtle imbalance.

3. Three wrong turns (and why they were wrong)

This problem was genuinely hard. We chased three plausible culprits before the real one. Documenting the dead ends matters as much as the answer.

Wrong turn #1: "Keeper is just slow"

The first instinct was raw keeper latency. But the numbers refuted a naive reading: prod keeper latency (17/38/45 ms) was actually better than dev's (60–73 ms) — yet prod alerted and dev didn't. Absolute latency wasn't the driver. (Hold that thought — the relative spread was.)

Wrong turn #2: "It's the phantom-fetch bug"

The slow nodes showed their fetch pool pinned at 128/128 ("128 fetches already executing") while system.replicated_fetches showed ~0 actually transferring. That's the textbook signature of a known ClickHouse bug (#22438 — leaked fetch-pool slots after a keeper session expiry).

Two checks killed it:

SYSTEM STOP FETCHES; SELECT sleep(3); SYSTEM START FETCHES did nothing — there was no leaked counter to clear.
The DownloadPart rate told the truth: the "stuck" node had downloaded 77,000 parts in 5 minutes. A phantom downloads ~0. The actually_fetching = 0 reading was a snapshot artifact — these parts are so small they're fetched in milliseconds, so a point-in-time query almost always catches zero in flight.
The cited bugs (#22438, #25836) were closed in 2021 — fixed five years before our version. (Always check the issue state via the API before citing it. We didn't, at first.)

Wrong turn #3: "Reduce the ZK transaction rate"

We measured prod driving Keeper at ~21,000 transactions/second, peaking at 27,000 — about 7× Altinity's ~3,000/s guideline. A real finding! But it's a workload characteristic (all those writes funnel to the leader regardless of which keeper a client connects to). High ZK volume made everything more sensitive, but it wasn't the cause of the per-node skew. It became a separate long-term efficiency item, not the fix.

The lesson from all three: a snapshot can lie (phantom), a real metric can be a red herring (ZK rate), and the obvious metric can point the wrong way (absolute latency). The truth came only from comparing dev and prod side by side and asking what is actually different.

4. The fsync primer: why Keeper is latency-bound

To understand the root cause you have to understand what Keeper does on every write.

Keeper (like ZooKeeper) is a strongly-consistent coordination store. When ClickHouse registers a new part, the keeper leader:

Appends the operation to its raft transaction log.
Calls fsync() to force that log entry durably to disk — because if the leader crashes, that entry must survive.
Replicates the entry to followers, who also fsync it.
Acknowledges the write once a quorum (2 of 3) has durably committed.

The working set is tiny — our entire keeper state was ~280 MB on disk (raft log + snapshots), ~290 MB in memory, ~845k znodes, ~640 MB RAM. The writes are small, frequent, fsync'd log appends.

This makes Keeper fundamentally latency-bound, not throughput-bound. It does not want IOPS or capacity or bandwidth. It wants one thing: a low, predictable fsync latency. On a quiet local SSD that's ~1 ms. The entire cluster's write-commit speed is gated by it.

And here is the trap: fsync latency is exquisitely sensitive to what else is hitting the same physical disk. A single heavy writer sharing the device can stretch a 1 ms fsync into 40 ms.

5. The discovery: a contended disk

We measured each keeper's internal latency, and the pattern jumped out — keeper speed tracked how busy its underlying disk was with other workloads:

Keeper	Disk shared with other heavy I/O?	Internal latency
keeper-0-1	✅ no — it had the disk to itself (idle node)	1 ms (low pending)
keeper-0-0	⚠️ yes (busy node)	28 ms (71 pending)
keeper-0-2	⚠️ yes (busiest node)	41 ms (125 pending)

The correlation is unambiguous: the keeper with its disk to itself ran at 1 ms; the two sharing their disk with other heavy I/O ran 28–41 ms (28–41×). The zk_outstanding_requests counter told the same story — ~30 on the isolated keeper vs 71/125 on the contended ones (requests backed up waiting on fsync).

Keeper	Disk situation	Internal latency
keeper-0-1	alone on worker-09 (`md1` 3% full, no ClickHouse)	1 ms
keeper-0-0	shares worker-06 disk other services	28 ms (28×)
keeper-0-2	shares worker-08 disk other services (heaviest writer)	41 ms (41×)

Keeper latency scales directly with how much other services I/O is hammering its shared disk. The zk_outstanding_requests counter tells the same story — ~30 on the idle keeper vs 71/125 on the disk-sharing keepers (requests backed up waiting on fsync).

We confirmed the mechanism three independent ways:

df inside the keeper pods: on the contended nodes, the keeper's data dir (/var/lib/clickhouse-keeper) sat on the same physical disk array (/dev/md1) as ~2 TB of other application data. Same spindle.
A direct fsync throughput probe : the isolated node hit 918 MB/s; the contended nodes 535 and 430 MB/s — the shared disks were measurably slower even at idleSo even with CPU fully reserved,*.
Keeper raft-commit ProfileEvents: KeeperLatency per commit was 4.6 µs on the idle keeper vs 102.6 µs (22×) on the busy one.

The Kubernetes blind spot

The natural objection: "but these are Kubernetes pods with resource requests — the keeper has guaranteed CPU and memory." Correct, and that's exactly why this is sneaky.

Kubernetes requests/limits isolate CPU and memory. They do not isolate disk I/O. A local-path PersistentVolume is just a directory on the node's physical disk, with no IOPS or bandwidth guarantee. SSo even with CPU fully reserved, every other workload writing to that disk competes with the keeper's raft fsync on the same spindle — and fsync loses. CPU allocation cannot fix a shared-disk bottleneck.

The fast keeper was fast for one reason only: nothing else was hammering its disk.**

Per-node CPU was elevated on the busy nodes but Kubernetes-guaranteed — not the bottleneck (disk I/O is). It evens out after the fix.

6. The chain reaction

Root cause in one paragraph. The skew is relative, not absolute. Replication is peer-coupled: every part written on any node must be fetched by the other two, so each node must keep pace with its peers' combined output. The 3 keepers run at unequal speed — keeper-0-1 has its disk to itself (uncontended → ~1 ms), while keeper-0-0 and keeper-0-2 sit on Kubernetes local-path disks shared with other heavy I/O workloads (→ 28–41 ms). The earlier nearest_hostname setting spread the ClickHouse nodes 1-1-1 across these unequal keepers → the fast-keeper node out-produces the slow-keeper nodes → their fetch queues pile up → alert. (Kubernetes isolates CPU/memory but not disk I/O — which is why the shared disk bites under prod's load but not on idle dev.)

Here's how a millisecond-scale disk difference becomes a paging alert. This is the heart of the story.

Two keepers sit on a disk shared with other heavy I/O workloads
        │
        ▼
Their raft fsync is starved → keeper latency 28–41 ms (vs 1 ms uncontended)
        │
        ▼
The ClickHouse node connected to the FAST keeper commits its inserts fastest
        │
        ▼
Committing faster → it flushes its NATS buffer more often → into MANY SMALL parts
        │
        ▼
Replication is PEER-COUPLED: every part on any node must be fetched by the other two
        │
        ▼
The fast node floods tiny parts → the slow-keeper nodes must fetch them all
        │
        ▼
Their fetch (GET_PART) queues pile up past 100 → "Replication Lag" alert
        │
        ▼
... on exactly the two nodes whose keeper sits on a shared disk. Every time.

The data that proves it

The most clarifying measurement was per-node write distribution over 15 minutes:

Node	NewParts	Rows ingested	Rows / part	% of parts
0-1 (fast keeper)	252,024	4.74 M	19	49.5%
0-0 (slow keeper)	146,588	4.68 M	32	28.8%
0-2 (slow keeper)	110,114	4.67 M	42	21.6%

Tiny-part merges per node — skewed on the left (one node dominates, and the dominance flips between nodes), then all three converge after the fix.

Look closely. The rows are essentially equal — 4.74M / 4.68M / 4.67M. NATS consumption is balanced; the engines deliver the same data to every node. What's skewed is part count: the fast-keeper node shatters its equal share of data into 2.3× more parts (19 rows/part vs 42).

Healthy MergeTree parts hold 10,000–100,000 rows. These were 19–42 rows per part — a tiny-part flood. The fast-keeper node fragments hardest because it commits fastest: each quick commit flushes a smaller buffer.

And tiny parts are precisely what hammers Keeper (every part = several ZK operations × 3 replicas) and what fills the fetch queues (every tiny part = a GET_PART on the other two nodes). The 21,000 ZK txns/sec and the queue>100 alerts are the same phenomenon viewed from two angles.

Inter-node replication traffic — the fetch flood the slow-keeper nodes must absorb. It evens out after the fix.

Why "relative, not absolute" is the key insight

This is the line that finally made it click. Replication in ClickHouse is peer-coupled: there is no central writer. Each replica must independently fetch every part the others produce. So a node is healthy only as long as it can keep pace with its peers' combined output.

When all keepers are equal, every node produces and fetches at the same rate — equilibrium. When one keeper is faster, its node out-produces the others, and the slower-keeper nodes fall behind on fetches even though nothing is broken. The slowest absolute latency in the world is fine as long as it's uniform. It's the spread that kills you.

7. Why dev never broke

Dev runs the same code on shared-disk keepers too — in fact all 3 dev keepers sit on shared local-path disks, vs 2 of 3 on prod.. So why was dev balanced?

Two reasons, neither of which is good keeper placement:

Lower load. Dev's nodes sat near-idle (0–1% CPU). With little ClickHouse I/O, the shared disk wasn't contended, so all three keepers ran at similar speed.
Uniform connection mapping. By luck of the default load-balancing, all three dev ClickHouse servers connected to the same single keeper. Identical keeper experience for every node → no node could out-produce another → no skew.

Dev was the accidental proof of the fix: all-clients-on-one-keeper → uniform → balanced. Not because that keeper was fast — because every node shared the same keeper, fast or slow.

8. The fix

Step 1 — Uniformity (the skew fix)

We set, in the ClickHouse config:

zookeeper:
  zookeeper_load_balancing: in_order
  nodes:
    - chk-keeper-apex-0-1   # listed FIRST → the uncontended, fast keeper
    - chk-keeper-apex-0-0
    - chk-keeper-apex-0-2

in_order makes every ClickHouse server connect to the first keeper in the list. By reordering keeper-0-1 (the one alone on worker-09) to the front, all three nodes now share one fast, uncontended keeper — replicating dev's proven, balanced model.

This is safe: the load-balancing setting only steers the ClickHouse client's connection preference. It does not touch the keeper raft ensemble, quorum, or leadership. It applies via a rolling ClickHouse restart (quorum-safe, ~15 min catch-up, 0 readonly), and is fully reversible.

Step 2 — Leadership pin (a bonus, not strictly required)

Connecting all clients to keeper-0-1 makes reads uniform. But raft writes always go to whichever node is the leader — and if the leader is a shared-disk keeper, every write still commits on a slow fsync. So we also requested leadership for keeper-0-1 at runtime:

echo rqld | nc -w2 <keeper-0-1-host> 2181   # "request leadership"

Now the leader's fsync lands on the fast, dedicated disk too, with no follower→leader forward hop for the clients.

Important nuance: Step 2 is an optimization, not the core fix. Uniformity alone (Step 1) removes the skew, because all three nodes get the same keeper experience regardless of which one is leader. The leader-pin additionally lowers the absolute write-commit latency. The catch: there is no per-server raft-priority knob in our Keeper operator, so the leader-pin isn't durable — a keeper restart or election can move leadership back to a shared-disk node, and it must be re-applied. That fragility is itself an argument for the durable fix below.

What we did not do, and why

Bigger insert batches / longer NATS flush intervals — would genuinely reduce the tiny-part flood and the ZK rate, but our real-time SLA forbids holding data longer before flushing.
Bumping keeper CPU/memory — already guaranteed by Kubernetes; not the bottleneck.
Raising the alert threshold — masks the symptom, doesn't fix the imbalance.

9. The results

We have the receipts. Here is the per-node keeper-latency spread (max − min across the three nodes — the direct measure of skew) over 30 hours, spanning all three configurations:

Era	Keeper-latency spread	What was happening
`random` / default	~31–33 ms	one lucky node @6 ms, two @37 ms — wild skew
`nearest_hostname` (1-1-1 spread)	~20–30 ms	deterministic, but split across unequal keepers → relocated the skew
`in_order` + leader-pin (all on keeper-0-1)	~6 ms	uniform — every node on the same fast keeper

Alongside: 0 readonly throughout (only tiny restart blips), data fresh to the second, queues balanced and bounded. The recurring queue > 100 skew alert — the thing we set out to kill — was resolved.

The proof at the mechanism level

The spread number is the input. The real confirmation is downstream: did the tiny-part fragmentation — the thing that actually fed the fetch queues — even out? It did. Measured per-node over 15 minutes, before vs after:

	`nearest_hostname` (before)	`in_order` + leader-pin (after)
NewParts / node	252k / 146k / 110k	122k / 119k / 119k
rows per part	19 / 32 / 42 (skewed)	34 / 35 / 35 (uniform) ✅
rows ingested / node	even (~4.7 M)	even (~4.1 M)
fetch queue / node	886 / 71 / 1299 (skewed)	47 / 53 / 51 (balanced, low) ✅

Merge duration — the contended node spikes to 5–9 s, then all three collapse to ~0 ms after the fix.

ZooKeeper operations by type, per node — heavily skewed before, evenly split after.

This is the smoking gun, reversed. Pre-fix, the fast-keeper node fragmented its equal share of data into 19 rows/part while the slow-keeper nodes sat at 42 — and that imbalance was the alert. Post-fix, all three nodes fragment identically (34–35 rows/part) because they share one keeper, so no node out-produces its peers, and the fetch queues collapse from a skewed 886/71/1299 to a flat 47/53/51. The chain reaction is broken at its source.

A full clean audit afterward confirmed it across the board: 0 readonly, lag ≤1s, ingest freshness 0s on all key tables, replication churning with nothing stuck (oldest queue entry 0s, max retries ≤2), no partition over 100 parts, 0 detached parts, and no replication errors (only benign UNEXPECTED_PACKET_FROM_CLIENT client-protocol noise).

Still on the watch-list (no longer alert-causing):

ZK transaction rate ~16k/s avg, ~20k peak — still ~5× the ~3k/s guideline. This is the tiny-part volume (a workload characteristic), not the per-node skew. It stopped causing alerts once fragmentation became uniform, but it's the reason the durable fix (disk isolation, and eventually larger parts) still matters. (Note: the volume itself also dropped — cluster-total ZK fell from ~22.7k/s pre-fix to ~16k/s, because even fragmentation produces fewer parts overall.)
Absolute CH-side keeper latency stayed roughly flat — a steady ~36–42 ms two-hour average across all three nodes (individual 5-minute snapshots bounce higher, ~50–84 ms). Here's the honest footnote: the absolute latency did not drop to the ~5 ms the single lucky node used to enjoy. Raft commit needs a quorum fsync, and the two follower keepers are still on shared disks — they gate commit latency. So the uniform-connection fix eliminates the skew (which is what caused the alerts) and evens every node onto the same number; lowering that absolute number requires the durable disk-isolation fix in §10.

10. The durable fix: isolate the disk, not the machine

The interim software fix (uniformity + leader-pin) is real but fragile — it depends on keeper-0-1 staying leader, and on no future restart scattering the client connections. The permanent answer is to remove the contention itself.

The instinct is "give each keeper a dedicated machine." But that's overkill, and we didn't have spare hardware. The key realization, which we sent to our infra team:

Keepers are latency-bound, not throughput-bound. The whole working set is tiny — ~280 MB on disk, ~290 MB in memory, ~845k znodes, ~640 MB RAM. The writes are small fsync'd raft-log appends, so it's a low-bandwidth workload where the thing that matters is fsync latency (~1–2 ms), not IOPS or capacity.

Our 28–41 ms problem isn't keepers wanting more I/O — it's their fsync getting starved by other workloads sharing the same disk on the node. Proof: the idle keeper (alone on its node) runs ~1 ms average, while the one sharing a disk with the heaviest CH writer runs ~33–41 ms average, with raft-commit latency spikes into the hundreds of milliseconds (peak ~840 ms measured) versus ~500 ms on the idle keeper.

So they do not need dedicated machines — a keeper is ~2 cores / <1 GB RAM / <300 MB disk; a whole dedicated worker is overkill. What it needs is a disk ClickHouse isn't hammering. It's a disk-isolation problem, not a node one.

Concretely, two viable approaches:

A dedicated low-latency volume for the raft log. ClickHouse Keeper splits storage into log_storage_path (raft log — latency-critical, fsync on every commit) and snapshot_storage_path (snapshots — can live on slower/shared storage). Point log_storage_path at a small dedicated SSD/NVMe device, separate from ClickHouse's data array. This is the canonical ClickHouse Keeper deployment rule.
Anti-affinity scheduling. Add scheduling rules so a keeper pod never shares a node (and disk) with other I/O-heavy workloads. Cheap, no new hardware, no ClickHouse restart.

Either gets all keepers to ~1 ms — uniform and fast — making the load-balancing trick and the leader-pin unnecessary.

11. How to diagnose this yourself

If you suspect keeper fsync contention, here's the battery (ClickHouse 26.x). All run via clusterAllReplicas.

Which keeper each node uses (is it balanced?):

SELECT hostName() AS ch_node, host AS keeper, session_uptime_elapsed_seconds AS up_s
FROM clusterAllReplicas('{cluster}', system.zookeeper_connection) ORDER BY ch_node;

Per-node keeper latency (CH-side, 5-min window):

SELECT hostName() AS host,
  round((max(ProfileEvent_ZooKeeperWaitMicroseconds) - min(ProfileEvent_ZooKeeperWaitMicroseconds))
      / nullIf(max(ProfileEvent_ZooKeeperTransactions) - min(ProfileEvent_ZooKeeperTransactions), 0) / 1000, 1) AS keeper_ms
FROM clusterAllReplicas('{cluster}', system.metric_log)
WHERE event_time > now() - INTERVAL 5 MINUTE GROUP BY host ORDER BY keeper_ms DESC;

ZK transaction rate (note: metric_log ProfileEvents are per-second deltas — average the per-second sums, do not max - min):

SELECT round(avg(ps), 0) AS zk_tx_per_sec_avg, max(ps) AS peak FROM (
  SELECT event_time, sum(ProfileEvent_ZooKeeperTransactions) AS ps
  FROM clusterAllReplicas('{cluster}', system.metric_log)
  WHERE event_time > now() - INTERVAL 5 MINUTE GROUP BY event_time);

The smoking gun — write/fragmentation skew (rows even, but parts skewed?):

SELECT hostName() AS host, count() AS newparts,
       formatReadableQuantity(sum(rows)) AS rows_ingested,
       round(sum(rows) / count(), 0) AS rows_per_part
FROM clusterAllReplicas('{cluster}', system.part_log)
WHERE event_type = 'NewPart' AND event_time > now() - INTERVAL 15 MINUTE
GROUP BY host ORDER BY newparts DESC;

If rows_ingested is even across nodes but rows_per_part is skewed, you have a fragmentation imbalance — look at keeper latency next.

Keeper internal latency (the mntr 4-letter word, per keeper pod):

echo mntr | nc -w2 <keeper-host> 2181 | grep -E 'zk_avg_latency|zk_outstanding_requests|zk_server_state'

Confirm the disk is shared (inside the keeper pod):

df -h /var/lib/clickhouse-keeper   # is it the same device as /var/lib/clickhouse?

Lesson Learnt

Kubernetes does not isolate disk I/O. CPU and memory requests lull you into thinking pods are isolated. A local-path volume shares the raw device. For any fsync-bound service (Keeper, ZooKeeper, etcd, Kafka, Postgres WAL), disk isolation is a first-class scheduling requirement, not an afterthought.
Coordination services are latency-bound, not throughput-bound. Don't size Keeper like a database. It wants a fast, quiet disk far more than a big or fast one. A small dedicated volume beats a huge shared array.
"Relative, not absolute" — measure the spread, not the average. The single most misleading metric here was absolute keeper latency (prod's was better than dev's). The skew lived in the difference between nodes. For any peer-coupled system, the variance is the signal.
A snapshot can lie. system.replicated_fetches reading 0 sent us down a phantom-fetch rabbit hole. Sub-millisecond operations are nearly invisible to point-in-time queries — use rates (DownloadPart over a window), not instantaneous counts.
Verify before you cite. We confidently quoted two GitHub bugs as the cause; both had been closed for five years. One gh api call would have saved an hour. Check the issue state.
Uniformity can beat optimization. The clean dev cluster wasn't clean because its keepers were fast — it was clean because every client used the same keeper. Sometimes "make everything equally mediocre" is a better, simpler fix than "make one thing fast."
Document the dead ends. Three wrong diagnoses preceded the right one. Writing them down — and why each was wrong — is what turns a painful afternoon into an asset the next engineer can stand on.

Cluster: 3-node ClickHouse ReplicatedMergeTree + 3-node ClickHouse Keeper, NATS JetStream ingestion, ClickHouse 26.2, on Kubernetes (RKE2) with local-path storage.

DEV Community