DEV Community

RAKESH THERANI
RAKESH THERANI

Posted on

The 32-bit Hidden Countdown in ClickHouse Keeper: How an XID Overflow Gave Us Weekly Read-Only Bursts

A production debugging story: tracing recurring 2–5-second read-only storms on a ClickHouse cluster down to a single 32-bit integer — and the one-line fix.


TL;DR

Our production ClickHouse cluster started throwing read-only bursts on its replicated tables every few days — each one firing 4–5 Slack alerts, each one self-recovering in seconds. The cause wasn't network, disk, or load. It was a 32-bit transaction-ID counter (the Keeper "xid") overflowing at our request rate of ~6,500 requests/sec/pod. When it wrapped past 2.1 billion, ClickHouse force-killed its Keeper session, every replicated table on that pod went read-only for 2–5 seconds, and it cascaded to the other replicas. The fix is one setting — use_xid_64 — that widens the counter to 64 bits and pushes the overflow horizon from every few days to almost infinite


The symptom: a storm that fixes itself

The first thing we saw was a pattern, not a single failure. Every few days, the production cluster would emit a cluster of alerts:

  • TABLE_IS_READ_ONLY across many replicated tables — simultaneously.
  • KEEPER_EXCEPTION: Connection loss.
  • Replication queue stalls, then a clean automatic recovery within minutes.

Each event was short (~2–5 seconds of read-only per pod) and self-healing, which is exactly what makes this class of bug insidious: nothing stays broken long enough to catch in the act, and "it recovered on its own" tempts you to close the alert and move on. But it kept coming back — 9 events in 30 days on prod. Reads were never affected. Writes, during that 2–5s window, were a different story (more below).

A self-healing failure on a roughly fixed cadence is a fingerprint. It almost always means a counter or threshold is filling up, hitting a ceiling, resetting, and starting over.


What an "xid" is, and why it has a ceiling

ClickHouse keeps all its replication metadata — parts, leadership, the replication log, schemas — in ClickHouse Keeper, a Raft-based, ZooKeeper-protocol-compatible coordination service.

That protocol multiplexes many requests over one long-lived TCP session. To match each response to its request, every request carries a monotonically increasing xid (transaction ID). The client sends xid=1, 2, 3…; the server echoes it back. The xid only ever goes up, for the entire life of the session.

In the classic ZooKeeper protocol, the xid is a signed 32-bit integer — maximum 2,147,483,647. You can already see where this is going.


Root cause, confirmed at the source-code level

This wasn't a guess. It's right there in ClickHouse 26.2's src/Common/ZooKeeper/ZooKeeperImpl.cpp, in pushRequest():

info.request->xid = next_xid.fetch_add(1);                       // int64 atomic
if (!use_xid_64)
    info.request->xid = static_cast<int32_t>(info.request->xid); // ← truncates to 32-bit

if (info.request->xid < 0)
    throw Exception::fromMessage(Error::ZSESSIONEXPIRED, "XID overflow");
Enter fullscreen mode Exit fullscreen mode

The counter itself is a 64-bit atomic. But when use_xid_64 = false (the default in older Altinity operator builds), every xid is cast down to int32_t. At 2.1 billion it wraps negative, the explicit xid < 0 check fires, and ClickHouse throws "XID overflow" and tears down the session as if it expired.

Here's the actual error from prod system.text_log (2026-05-20 05:33:04 UTC):

void DB::StorageReplicatedMergeTree::mergeSelectingTask():
  Code: 999. Coordination::Exception: XID overflow. (KEEPER_EXCEPTION)

Stack trace:
  3. Coordination::ZooKeeper::pushRequest
  4. Coordination::ZooKeeper::get
  6. zkutil::ZooKeeper::tryGetWatch
  8. ReplicatedMergeTreeQueue::pullLogsToQueue
 11. StorageReplicatedMergeTree::mergeSelectingTask
Enter fullscreen mode Exit fullscreen mode

And the immediate downstream cascade:

Code: 242. Table is in readonly mode (replica path:
  /clickhouse/tables/datastreams/<table>/0/replicas/...). (TABLE_IS_READ_ONLY)

Code: 999. Coordination::Exception: Coordination error: Connection loss,
  path .../is_active. (KEEPER_EXCEPTION)
Enter fullscreen mode Exit fullscreen mode

The cascade mechanic is worth understanding: when the session dies, the pod's ephemeral /is_active znode operations fail; the other two replicas notice via expired znode ops (bounded by the 10s operation_timeout_ms) and briefly destabilize too. One pod's overflow becomes a short cluster-wide wobble.

The Keeper side stayed calm — which is the trap

On the Keeper pod, the same instant logged only:

<Information> KeeperTCPHandler: Got exception processing session #136:
  Code: 210. DB::NetException: I/O error: Broken pipe, while writing to
  socket (...:2181 -> ...:39890). (NETWORK_ERROR)
Enter fullscreen mode Exit fullscreen mode

Note the level: <Information>, not Error or Warning. From Keeper's point of view, a client just disconnected — totally normal. Keeper is healthy throughout. This is why the bug is so easy to misattribute to "a network blip": the only Keeper-side trace looks like a routine TCP close. If you go looking for the cause in Keeper's error logs, you'll find nothing.


Why it's a hidden countdown: the math

Time-to-overflow is purely a function of request rate:

seconds_to_overflow ≈ 2,147,483,647 / keeper_requests_per_second_per_session
Enter fullscreen mode Exit fullscreen mode

We measured the real rate via the ZooKeeperTransactions counter:

Cluster Keeper req/sec/pod Overflow cadence Events in 30 days
Prod ~6,500 every few days 9
Dev ~3,500 ~2× longer 1

2.15e9 / 6500 ≈ 330,000 sec ≈ ~4 days. That matches the observed cadence. And it explains the dev/prod split exactly: prod does roughly double the Keeper traffic, so it overflows roughly twice as often.

It also explains the "it fixed itself after a restart" illusion. A restart opens a fresh session and resets the xid to zero — so the clock restarts, and the storm reliably returns a few days later. Every routine deploy and pod reschedule quietly resets the countdown and hides it.


Impact: small but non-zero write loss

Reads were never affected. But the write story matters, and we were honest about it:

Dimension Measurement
Read-only window ~2–5 sec per pod, +2–5 sec cascade to peers
Reads Unaffected
Replication recovery Automatic, within minutes
Writes Small but non-zero loss per event

The write loss is a consequence of the ingestion architecture: prod ingests from Core NATS (no JetStream durable redelivery yet). During the 2–5s read-only window, in-flight NATS messages whose materialized-view INSERT fails are simply dropped — there's no redelivery to catch them. Per event the loss is bounded (within normal minute-to-minute variance, ~hundreds of rows per pipeline), but across ~9–19 events/month it's cumulatively non-zero. (A separate migration to JetStream durable consumers closes this gap independently.)


The fix: use_xid_64 = true

The ZooKeeper protocol was extended to support a 64-bit xid. ClickHouse exposes it as a single setting on both the Keeper and the ClickHouse-server side.

Keeper config:

settings:
  keeper_server/coordination_settings/use_xid_64: 'true'   # ← ADD
Enter fullscreen mode Exit fullscreen mode

ClickHouse server config (the block is historically named <zookeeper> but points at Keeper):

settings:
  zookeeper/use_xid_64: 'true'   # ← ADD
Enter fullscreen mode Exit fullscreen mode

A 64-bit signed counter maxes at 9.2 × 10^18.

Requirement: ClickHouse Keeper 25.3+ (we're on 26.2.7.17 ✅). It's a vendor-blessed fix — the Altinity operator enables use_xid_64 by default from v0.27.0. We were on operator 0.25.0, which still ships the broken default, so we overrode it directly in our gitops overlay rather than waiting on the operator upgrade.

Two operational gotchas:

  1. Not hot-reloadable — read at startup, so enabling it needs a rolling restart of Keeper and ClickHouse.
  2. Must agree across the fleet — it's negotiated; you don't want a mixed 32/64 state, so roll it everywhere.

Rollout: dev first, prod off-peak, zero downtime

  1. Dev first. Update Keeper config → GitOps sync → rolling-restart Keeper pods one at a time (Raft quorum tolerates one down). Then the same for ClickHouse servers.
  2. Bake. Watch dev system.text_log for XID overflow for 2–4 weeks. Expect zero.
  3. Prod after dev validation, off-peak, with SRE sign-off, same node-by-node sequence (leader last).
  4. Rollback is trivial: set use_xid_64=false, restart.

Verification query, run weekly on both clusters — should always return 0:

SELECT count() FROM system.text_log
WHERE message ILIKE '%XID overflow%'
  AND event_time > now() - INTERVAL 7 DAY;
Enter fullscreen mode Exit fullscreen mode

External evidence chain

This is a vendor-acknowledged issue with a vendor-blessed fix, not an exotic edge case:

  • Altinity KB"XID is a transaction counter in Zookeeper, if you do too many transactions the counter reaches maxint32, and to restart the counter Zookeeper closes all the connections… a worst case we saw was once per 3 weeks."
  • Altinity Operator 0.27.0 release notes"Enabled async_replication and use_xid_64 in Keeper default configuration. Requires Keeper 25.3 or above."
  • ClickHouse docs — the <zookeeper> section documents use_xid_64: "Enables 64-bit transaction IDs… Default: false."
  • Matching GitHub issues: #44415 (XID becomes negative), #61978 (XID overflow → Session expired), #65424 (Table readonly after XID overflow).

Lessons for anyone running ClickHouse + Keeper at scale

  • Self-healing + fixed cadence = a counter overflow. When a failure recovers on its own and returns on a clock, look for something filling up and resetting — not a transient external cause.
  • Match time-to-failure against a real rate. We confirmed the diagnosis by computing INT32_MAX / measured_req_per_sec and watching it match the observed cadence. That arithmetic turned a hypothesis into a root cause.
  • Loud vs. quiet failures need different instincts. A real network outage logs Cannot resolve host, socket errors, refusals. XID overflow logs a clean <Information>-level disconnect on the Keeper side and a single "XID overflow" on the client. If you grep Keeper for errors, you'll wrongly conclude "Keeper is fine, must be the network."
  • "It recovered" is not "it's resolved." Each restart reset the timer and bought a few quiet days — which is exactly how a bug like this survives for months.
  • Know your ingestion's durability. The read-only window was harmless for reads; it cost writes only because prod was on non-durable Core NATS. The blast radius of a brief coordination hiccup is set by what's buffering upstream.

If you run a busy ClickHouse cluster on Keeper and have never set use_xid_64, check your versions and put a rolling restart on the calendar. It's one line of config standing between you and a recurring, self-hiding, cluster-wide read-only storm.


Top comments (0)