DEV Community: sqlrush

Fencing a node that doesn't know it's dead: pgrac build log #2

sqlrush — Thu, 18 Jun 2026 12:55:13 +0000

pgrac is an open attempt to rebuild Oracle RAC's core machinery (shared-everything storage, multiple active nodes all writing one database, a cluster-wide change number) on top of PostgreSQL 16. Build log #1 laid out the four problems that fight back. This one is about the problem that turns a node failure into silent data corruption, and the first, deliberately modest, layer pgrac ships against it.

The failure mode

In a shared-nothing cluster an evicted node is mostly harmless: it owns its own disks, so the cluster routes around it. In a shared-everything cluster the same event is dangerous, because every node writes the same storage. Picture the classic split: node 2 misses heartbeats, the cluster declares it dead and remasters its work elsewhere, but node 2 is not actually dead. It is frozen on a long GC pause, or its interconnect NIC flaked, and it is about to wake up and finish the write it started. Now two nodes believe they own the same blocks, and shared storage will accept both writes. That is not a crash. It is corruption you find three days later.

Oracle RAC's answer is I/O fencing: before remastering a dead node's resources, you make certain it can no longer touch the storage, with STONITH, SCSI-3 persistent reservations, or a hardware watchdog. The node is fenced at a layer below its own software, because the whole point is that you cannot trust the dead node's software to behave.

That hardware layer is real work, and it is not what pgrac built first. What it built first is the layer above it: an in-process cooperative write-fence, now default-ON. The rest of this is precise about what that does and does not buy you, because "we have fencing" is the kind of claim that is worth less than nothing if it is overstated.

A fence needs an authority everyone can agree on

You cannot fence on local opinion, because the whole problem is that the dead node disagrees about being dead. Authority has to live on durable, shared, quorum-backed storage.

pgrac writes a small fence marker into the cluster's voting disks, the same disks used for membership. The marker is a compact tuple: a monotonic fence_epoch, a fence_generation tie-break, a fenced_dead_bitmap naming which nodes are declared dead, and the issuer's id, CRC-protected inside the voting slot. Reconfiguration bumps the epoch and submits the new marker. It is written to a quorum-majority of voting disks, and the reconfiguration proceeds only if a majority acknowledges. If the majority write fails, it fails closed: no event is published and no remastering starts. A minority of disks cannot amplify itself into an authority, because each disk preserves only its own marker, so a single stale disk cannot out-vote the quorum.

Every node runs a poller (qvotec) that reads all voting disks every couple of seconds, keeps only CRC-valid markers, orders them by (fence_epoch DESC, fence_generation DESC), and recognizes an authority only when it sees the same marker on a majority of disks. That majority requirement is the split-brain guard. It is what stops two partitions from each minting their own truth.

The hot path: check a token before touching storage

Consulting voting disks on every write would be absurd, so the poller distills the durable marker into a small local token in shared memory: the epoch this node is authorized for, a lease expiry, a self_fenced bit, and a copy of the dead bitmap. The write path consults only that token, through a pure, lock-free, allocation-free judge that is safe to call inside a critical section. A write is allowed only when:

enforcement_on
  && region_attached
  && epoch == authorized_epoch     // exact ==, not >=
  && now < lease_expire
  && !self_fenced

The == rather than >= is deliberate: a node holding a stale epoch is as dangerous as one explicitly fenced, so anything other than the exact current epoch is refused. Six storage entry points (create, unlink, extend, zero-extend, write, truncate) call this judge before writing shared storage. A fenced or stale node fails closed: ERROR 53R51 on the normal path, PANIC if it is caught in a critical section where it cannot safely back out. The lease makes silence safe: if a node is cut off from the voting disks and cannot refresh its token, the lease expires and the node fences itself. The safe default, when you have lost contact with the authority, is to stop writing.

Why default-ON was hard

The hard part was not fencing a dead node. It was not fencing a healthy one. Fail-closed-on-silence is correct, but it has a corollary: a healthy cluster, idle with nobody fenced, has no fence marker on the voting disks. No marker means the poller never refreshes the token, the lease expires, and every node fences itself, so the whole healthy cluster cannot write. The safety mechanism starves the happy path. That is why enforcement could not just be flipped on in the first cut.

The fix is a steady-state baseline marker: even with nobody fenced, the cluster continuously maintains a current, majority-agreed "all alive at epoch N" marker, so the token lease always renews and a healthy cluster writes normally, while a real fence still supersedes it the instant it is issued. Getting the priority right between baseline and real-fence markers is itself a split-brain hazard, since a stale baseline author must never overwrite a higher-epoch fence. That is what two hardening passes after the initial flip were about: a baseline author may not regress the epoch, and a node must engage its bring-up latch before it ever publishes a self-fencing token. Only after that does cluster.write_fence_enforcement default to on for a multi-node shared-FS cluster. A single-node instance, or one with no voting disks, auto-degrades to a no-op and opens cleanly. You do not pay for cluster fencing when there is no cluster.

What is demonstrated, and what is not

Demonstrated, in CI, on shared-FS clusters:

On a 2-node and a 3-node cluster with enforcement default-ON, when a node is declared dead and self-fences, all six storage entry points reject the write with 53R51, verified end-to-end (t/271, t/272), with a pre-injection baseline write proving the test is not falsely green.
The reconfiguration, marker-write, and fence path is exercised by a synthetic dead-node injection that proves the mechanism runs to completion. A counter delta proves the code path executed, rather than being dead code. This is the recovery acceptance mechanism-completion gate.

Deliberately not claimed yet:

This is cooperative, not hardware, fencing. It fences a node that is still running the gate. It does nothing about a node whose software is wedged in a way that bypasses the gate. That is what STONITH, SCSI-3 PR, and cloud soft-fence are for, and that external fence plane is future work, not in this code.
Faithful-crash auto-recovery is not proven in CI. The synthetic injection proves the mechanism; a real SIGKILL of a node followed by deterministic apply-through recovery is SKIP-with-limitation in the single-machine test harness, and labeled as such. That is not true N-node auto-recovery until it runs against a real multi-machine backend.
No true 4-node CI demo. The 4-node recovery demo is a manual script that takes an external cluster connection string; with no such backend wired up it SKIPs with a reason rather than faking a pass.
Being fenced is terminal in this stage. A fenced node comes back up in a non-serving mode: it refuses shared writes and waits for offline or cold-admin recovery. Online rejoin, a fenced node automatically returning to active service, is Stage 5 and later.
Steady 2-node OLTP throughput and cross-node Cache Fusion block transfer remain Stage 5 problems. Nothing here changes that.

The honest one-liner: pgrac now has a default-on, quorum-backed, fail-closed cooperative write-fence, demonstrated to stop a self-fenced node from writing shared storage on 2- and 3-node clusters. It is the first of several fencing layers, with the hardware layer and faithful-crash recovery still ahead. That is a real step, and it is one layer of a problem that has several. I would rather undersell it than have a RAC veteran catch me overselling.

Code is in the open: cluster_write_fence.{c,h}, the storage gates in cluster_smgr.c, the durable marker and poller in cluster_qvotec.c, and the tests under src/test (t/271, t/272). https://github.com/sqlrush/pgrac · status board: https://pgrac.dev/features

If you have built I/O fencing for a shared-storage system and think the cooperative-first ordering is wrong, I want to hear it.

Rebuilding Oracle RAC's core machinery on PostgreSQL — the four problems that fight back

sqlrush — Tue, 16 Jun 2026 00:41:34 +0000

pgrac is an attempt to build many of Oracle RAC's core capabilities — shared-everything storage, multiple active nodes all writing one database, Cache Fusion, a cluster-wide change number — directly on PostgreSQL 16, in the open.

Be precise about the gap it aims at. As far as I can tell there is no other open-source, multi-active, Cache-Fusion cluster on Postgres. PolarDB and Aurora are shared-storage, but single-writer — one node writes, the rest read. BDR, Citus, Patroni and friends are shared-nothing. If you know of a real open-source, every-node-writes Postgres on shared storage, I genuinely want to see it.

This isn't a "look what I shipped" post. Most of this is hard and in progress. It's a "here is what the machinery actually is, and why each piece fights you" post. Four problems sit at the core. None is optional, and each one fails silently if you get it wrong — which is the worst way for a database to fail.

Problem 1 — A clock every node can agree on

Single-node Postgres orders changes with two things: transaction ids and the WAL LSN. In a shared-everything cluster, both stop working as a global order.

A 32-bit TransactionId is only meaningful inside one instance — node A's xid 5000 and node B's xid 5000 are different transactions, in the same numeric space. And LSNs from two independent WAL streams are simply incomparable; pgrac's own recovery code says so in as many words. So when a reader on node A meets a row last written by node B, neither xid nor LSN tells it "did B's write happen before my snapshot?"

RAC's answer is the SCN — a System Change Number, a single monotonic clock the whole cluster shares. The naive way to build one is a central allocator: one node (or service) hands out the next number. That is fatal on the hot path. Every commit would need a synchronous round-trip to the allocator before it could become durable, turning a node-local WAL insert into a network-bound operation, and making the allocator both a throughput ceiling and a single point of failure.

So pgrac builds the SCN as a distributed Lamport clock. Two operations, both lock-free:

advance (on commit) is a single atomic fetch_add on a per-node counter — about 5–50 ns, contends on nothing.
observe (on receiving anything from another node) is current = max(current, remote) + 1, done as a compare-and-swap retry loop.

The clock then synchronizes for free: every message on the interconnect — heartbeat, lock grant, block transfer — carries the sender's current SCN in its envelope header, and the receiver bumps its own clock before the message is even dispatched. No dedicated clock protocol exists; it's Lamport's "piggyback timestamps on all messages," literally.

The subtle part is comparison. An SCN packs [8-bit node_id | 56-bit local counter] into a uint64. If you compare two SCNs with a raw <, the node_id in the high bits dominates and the ordering is garbage. Visibility comparisons must look at only the local-counter bits and ignore node_id entirely — while a different comparison, the one used to give ITL slots a globally-unique order, keeps node_id as a tiebreak. Getting this backwards is a silent-wrong-answer bug, so there's literally a CI check that greps for raw </==/> on SCN values and fails the build.

Finally the SCN has to live in durable structures, not just memory: an 8-byte pd_block_scn went into the page header (page layout version bumped 4 → 5), and an 8-byte xl_scn into every WAL record. Recovery replays in SCN order; consistent reads compare against it. The clock is the foundation the other three problems stand on.

→ Code: cluster_scn.c — cluster_scn_advance is the fetch-add, cluster_scn_observe the CAS-retry observe.

Problem 2 — MVCC across nodes, without taxing the node that has no peers

This is the one I find genuinely beautiful and genuinely scary.

Postgres decides tuple visibility with xmin/xmax plus the commit log (CLOG). Here's the trap: a tuple on shared storage might have been written by another instance. If node A resolves that tuple's xmin through node A's local CLOG, it will happily find some transaction with that number — a completely unrelated local one — and return a confident, wrong answer. Silent cross-instance corruption. The 32-bit xid space overlapping across nodes means you can never, ever resolve a remote xid through local visibility machinery.

But the obvious fix — route every tuple through some cluster-wide visibility service — would tax the 99% case (a node doing ordinary local OLTP) to serve the 1% case (a tuple another node touched). That's unacceptable; nobody will run a cluster that halves their single-node throughput.

pgrac's answer is dual-track visibility, steered by a single byte. Each heap tuple header carries a new t_itl_slot_idx (one byte; 255 = unallocated). On every visibility check, a classifier reads the tuple's ITL (interested-transaction-list) slot and decides:

own-instance evidence → fall straight through to native Postgres xmin/xmax + CLOG. Untouched. Zero cluster overhead.
remote evidence → resolve through the cluster path: ITL slot → transaction-table → SCN → undo.
stale/ambiguous (the slot was recycled to a different remote owner) → fail closed.

The payoff: a node with no live peers does one boolean comparison per tuple and runs vanilla Postgres. The whole cluster apparatus is bypassed unless a tuple is physically stamped by another node. You only pay for coherence on the rows that actually need it.

The remote path is where Oracle's ghosts show up. A cross-node tuple's visibility is decided by resolving its commit status and commit SCN through a transaction table, then comparing commit_scn against the snapshot's read_scn. If the current row is newer than your snapshot, pgrac reconstructs the older version Oracle-style: it walks the undo chain backward, newest-first, inverse-applying changes until it reaches a version at or before your read_scn. That's Consistent Read, rebuilt on Postgres.

Two design decisions are worth dwelling on, because they're what separate "demo" from "correct":

Visibility is three-valued, not two. A cross-node resolution can come back VISIBLE, INVISIBLE, or UNKNOWN — and UNKNOWN is never silently collapsed into INVISIBLE. If a remote commit SCN hasn't propagated yet, or an overlay entry is missing, the read raises a specific error (53R97, "TT status unknown") and lets the caller retry or abort. A wrong-but-silent "not visible" is exactly the bug this whole layer exists to prevent, so the code refuses to guess.

The ABA problem is real and handled. The on-page ITL is only an 8-slot cache; a remote peer can reuse a slot for a later transaction — even a wrapped xid that numerically equals an old one. Resolving that against the peer's durable transaction table uses a 16-bit wrap generation counter, so a same-valued-but-wrapped xid is a sound zero-match (fail closed) instead of a dangerous false match. The durable outcome is only trusted when the SLRU commit state and the independent transaction-table slot agree on both commit SCN and wrap generation.

Honest status: the dual-track resolver, the ITL write path, and decide-by-SCN are wired into every HeapTupleSatisfies* function and exercised by 2-node tests. What's still maturing is live cross-node undo reads (versus the crash-recovery materialized path) and breadth of multi-node behavioral coverage.

→ Code: cluster_visibility_resolve.c (the dual-track classifier) and cluster_cr.c (consistent-read construction from undo).

Problem 3 — Moving a dirty page between two memories without losing it

Cache Fusion is the headline RAC feature: when node B needs a block node A holds dirty, ship it straight from A's buffer cache into B's over the interconnect, instead of forcing A to write it to disk so B can read it back. That "disk ping" is two synchronous storage round-trips on the hot path and serializes every writer hand-off behind I/O — the exact thing Cache Fusion exists to kill.

The protocol is three-way. The block has a master — and the mastering is deterministic and coordinator-free: every node computes master = hash(block_tag) mod (declared nodes) independently and gets the same answer, so there's no directory server to ask. A requester asks the master; the master forwards to the current holder; the holder ships the page directly back to the requester.

The correctness-critical step is one line of intent: flush the WAL up to this page's LSN before the bytes leave the node. Here's why it's mandatory and not just careful. pgrac ships the dirty in-memory image — that page never went to shared storage. The receiver installs it and may build more redo on top. If the holder then crashes and the WAL describing how that page reached its current LSN was never durable, recovery cannot reconstruct the page the receiver already built on. That's a cross-node lost write, the kind that surfaces as corruption days later. Flushing WAL-up-to-page_lsn before shipping makes the redo chain behind the image durable first.

The other subtle bit is the ownership hand-off. The master flips "who owns this block exclusively" to the new writer only after the new writer reports it has durably installed the image — not when the request is granted. That ordering closes the window where two nodes could both believe they hold the block exclusive and both write it.

Honest status, and I want to be exact here because it's easy to overclaim: the data plane — memory-to-memory ship, WAL-flush-before-ship, CRC, LSN re-stamping, the starvation guard, the invalidate-and-ack broadcast — is all built and wired. First-touch acquisition and read-sharing run on the real lock path. But transferring a block away from a still-live remote writer (the X-to-X case) is currently bounded fail-closed — the master returns FEATURE_NOT_SUPPORTED rather than do a transfer it can't yet prove correct under concurrent recovery. The hard case is implemented but gated off until the warm-recovery substrate underneath it lands. That's the honest line: the machinery is there; the last mile of live writer transfer is deliberately closed, not faked.

→ Code: cluster_gcs_block.c (the 3-way request / forward / invalidate handlers) and cluster_pcm_lock.c (the acquire path).

Problem 4 — Not corrupting everything when the network splits

Multi-active is where split-brain stops being a slide and becomes the thing that eats your data. If the interconnect partitions and both sides keep accepting writes to shared storage, you get two divergent truths on one set of blocks. There is no clean merge from there.

pgrac's primary defense is a fail-closed gate at the commit boundary. Inside CommitTransaction, after the transaction has done its work but before the WAL commit record is flushed, a writable transaction checks whether this node still holds quorum. If not, the transaction is aborted — error 53R40, quorum lost.

The placement is the entire point. A commit becomes durable, visible, and unrecoverable only at the instant its WAL commit record hits disk. The gate sits one step upstream of that flush. So a node that loses quorum mid-transaction can never turn that work into a durable commit; uncertainty resolves to abort, never to a divergent durable write. A partitioned minority that still thinks it's primary simply cannot make its writes stick.

Quorum itself is a disk-based vote with a lease: a majority of voting disks, and a lease that the quorum process must keep renewing. If that process hangs, the lease expires on its own and the gate fail-closes — no broadcast required, no liveness assumption. On top of that, a cooperative write-fence (epoch + lease + a quorum-majority marker on the voting disks) rejects shared-storage writes from a node whose epoch is stale or whose lease has aged out, with a PANIC if it's caught mid-critical-section where a half-write can't be rolled back.

Now the honest part, because a split-brain section that oversells is worse than useless. pgrac's fencing is cooperative and internal — there is no external STONITH, no IPMI, no SCSI-3 reservation, no "shoot the other node in the head" yet. The safety rests on each node fail-closing itself. That covers a cooperating node that loses quorum; it does not cover a node hung at the kernel level mid-write that ignores its own gates — the exact case STONITH exists for, and which is on the roadmap, not in the tree. And one more limitation I'll state plainly because it's load-bearing: today the quorum signal is driven by voting-disk I/O health, not by peer-liveness, so a clean network partition between two healthy nodes that can both still reach the disks does not, by itself, trip the fence. These are documented limits, not marketing softening. Getting split-brain fully right is years of work; what's here is a sound cooperative core with the failure modes written down.

→ Code: the commit gate lives in xact.c (CommitTransaction); quorum + lease in cluster_qvotec.c; the cooperative write-fence in cluster_fence.c.

Where it actually stands

Running today, on real code paths: the SCN clock, dual-track cross-node MVCC, Cache Fusion's data plane (first-touch + read-sharing), cluster catalog invalidation, the cluster-aware storage manager, and the substrate (interconnect, heartbeat, multi-node bootstrap). In progress: live-holder Cache Fusion transfer, full cross-node GES locking, crash/instance recovery, and external fencing. The sanity anchor underneath all of it: the --disable-cluster build is binary-identical to upstream PostgreSQL 16.13 and passes the full 219-test regression suite — the non-cluster path is trustworthy, the cluster path is young. Per-feature status, honestly maintained, lives at pgrac.dev/features.

If you've operated RAC or any shared-storage cluster and want to be useful: clone it, read the design, and tell me where it's wrong. "This breaks under X" is the most valuable thing you can send me right now, and there are scoped good first issues if you want to get your hands in.

Repo: https://github.com/sqlrush/pgrac · architecture and the full feature map: https://pgrac.dev

Next build-log goes one level deeper into the interconnect and how membership survives a node dropping mid-transaction.