Rebuilding Oracle RAC's core machinery on PostgreSQL — the four problems that fight back

#postgres #database #distributedsystems #oracle

pgrac is an attempt to build many of Oracle RAC's core capabilities — shared-everything storage, multiple active nodes all writing one database, Cache Fusion, a cluster-wide change number — directly on PostgreSQL 16, in the open.

Be precise about the gap it aims at. As far as I can tell there is no other open-source, multi-active, Cache-Fusion cluster on Postgres. PolarDB and Aurora are shared-storage, but single-writer — one node writes, the rest read. BDR, Citus, Patroni and friends are shared-nothing. If you know of a real open-source, every-node-writes Postgres on shared storage, I genuinely want to see it.

This isn't a "look what I shipped" post. Most of this is hard and in progress. It's a "here is what the machinery actually is, and why each piece fights you" post. Four problems sit at the core. None is optional, and each one fails silently if you get it wrong — which is the worst way for a database to fail.

Problem 1 — A clock every node can agree on

Single-node Postgres orders changes with two things: transaction ids and the WAL LSN. In a shared-everything cluster, both stop working as a global order.

A 32-bit TransactionId is only meaningful inside one instance — node A's xid 5000 and node B's xid 5000 are different transactions, in the same numeric space. And LSNs from two independent WAL streams are simply incomparable; pgrac's own recovery code says so in as many words. So when a reader on node A meets a row last written by node B, neither xid nor LSN tells it "did B's write happen before my snapshot?"

RAC's answer is the SCN — a System Change Number, a single monotonic clock the whole cluster shares. The naive way to build one is a central allocator: one node (or service) hands out the next number. That is fatal on the hot path. Every commit would need a synchronous round-trip to the allocator before it could become durable, turning a node-local WAL insert into a network-bound operation, and making the allocator both a throughput ceiling and a single point of failure.

So pgrac builds the SCN as a distributed Lamport clock. Two operations, both lock-free:

advance (on commit) is a single atomic fetch_add on a per-node counter — about 5–50 ns, contends on nothing.
observe (on receiving anything from another node) is current = max(current, remote) + 1, done as a compare-and-swap retry loop.

The clock then synchronizes for free: every message on the interconnect — heartbeat, lock grant, block transfer — carries the sender's current SCN in its envelope header, and the receiver bumps its own clock before the message is even dispatched. No dedicated clock protocol exists; it's Lamport's "piggyback timestamps on all messages," literally.

The subtle part is comparison. An SCN packs [8-bit node_id | 56-bit local counter] into a uint64. If you compare two SCNs with a raw <, the node_id in the high bits dominates and the ordering is garbage. Visibility comparisons must look at only the local-counter bits and ignore node_id entirely — while a different comparison, the one used to give ITL slots a globally-unique order, keeps node_id as a tiebreak. Getting this backwards is a silent-wrong-answer bug, so there's literally a CI check that greps for raw </==/> on SCN values and fails the build.

Finally the SCN has to live in durable structures, not just memory: an 8-byte pd_block_scn went into the page header (page layout version bumped 4 → 5), and an 8-byte xl_scn into every WAL record. Recovery replays in SCN order; consistent reads compare against it. The clock is the foundation the other three problems stand on.

→ Code: cluster_scn.c — cluster_scn_advance is the fetch-add, cluster_scn_observe the CAS-retry observe.

Problem 2 — MVCC across nodes, without taxing the node that has no peers

This is the one I find genuinely beautiful and genuinely scary.

Postgres decides tuple visibility with xmin/xmax plus the commit log (CLOG). Here's the trap: a tuple on shared storage might have been written by another instance. If node A resolves that tuple's xmin through node A's local CLOG, it will happily find some transaction with that number — a completely unrelated local one — and return a confident, wrong answer. Silent cross-instance corruption. The 32-bit xid space overlapping across nodes means you can never, ever resolve a remote xid through local visibility machinery.

But the obvious fix — route every tuple through some cluster-wide visibility service — would tax the 99% case (a node doing ordinary local OLTP) to serve the 1% case (a tuple another node touched). That's unacceptable; nobody will run a cluster that halves their single-node throughput.

pgrac's answer is dual-track visibility, steered by a single byte. Each heap tuple header carries a new t_itl_slot_idx (one byte; 255 = unallocated). On every visibility check, a classifier reads the tuple's ITL (interested-transaction-list) slot and decides:

own-instance evidence → fall straight through to native Postgres xmin/xmax + CLOG. Untouched. Zero cluster overhead.
remote evidence → resolve through the cluster path: ITL slot → transaction-table → SCN → undo.
stale/ambiguous (the slot was recycled to a different remote owner) → fail closed.

The payoff: a node with no live peers does one boolean comparison per tuple and runs vanilla Postgres. The whole cluster apparatus is bypassed unless a tuple is physically stamped by another node. You only pay for coherence on the rows that actually need it.

The remote path is where Oracle's ghosts show up. A cross-node tuple's visibility is decided by resolving its commit status and commit SCN through a transaction table, then comparing commit_scn against the snapshot's read_scn. If the current row is newer than your snapshot, pgrac reconstructs the older version Oracle-style: it walks the undo chain backward, newest-first, inverse-applying changes until it reaches a version at or before your read_scn. That's Consistent Read, rebuilt on Postgres.

Two design decisions are worth dwelling on, because they're what separate "demo" from "correct":

Visibility is three-valued, not two. A cross-node resolution can come back VISIBLE, INVISIBLE, or UNKNOWN — and UNKNOWN is never silently collapsed into INVISIBLE. If a remote commit SCN hasn't propagated yet, or an overlay entry is missing, the read raises a specific error (53R97, "TT status unknown") and lets the caller retry or abort. A wrong-but-silent "not visible" is exactly the bug this whole layer exists to prevent, so the code refuses to guess.

The ABA problem is real and handled. The on-page ITL is only an 8-slot cache; a remote peer can reuse a slot for a later transaction — even a wrapped xid that numerically equals an old one. Resolving that against the peer's durable transaction table uses a 16-bit wrap generation counter, so a same-valued-but-wrapped xid is a sound zero-match (fail closed) instead of a dangerous false match. The durable outcome is only trusted when the SLRU commit state and the independent transaction-table slot agree on both commit SCN and wrap generation.

Honest status: the dual-track resolver, the ITL write path, and decide-by-SCN are wired into every HeapTupleSatisfies* function and exercised by 2-node tests. What's still maturing is live cross-node undo reads (versus the crash-recovery materialized path) and breadth of multi-node behavioral coverage.

→ Code: cluster_visibility_resolve.c (the dual-track classifier) and cluster_cr.c (consistent-read construction from undo).

Problem 3 — Moving a dirty page between two memories without losing it

Cache Fusion is the headline RAC feature: when node B needs a block node A holds dirty, ship it straight from A's buffer cache into B's over the interconnect, instead of forcing A to write it to disk so B can read it back. That "disk ping" is two synchronous storage round-trips on the hot path and serializes every writer hand-off behind I/O — the exact thing Cache Fusion exists to kill.

The protocol is three-way. The block has a master — and the mastering is deterministic and coordinator-free: every node computes master = hash(block_tag) mod (declared nodes) independently and gets the same answer, so there's no directory server to ask. A requester asks the master; the master forwards to the current holder; the holder ships the page directly back to the requester.

The correctness-critical step is one line of intent: flush the WAL up to this page's LSN before the bytes leave the node. Here's why it's mandatory and not just careful. pgrac ships the dirty in-memory image — that page never went to shared storage. The receiver installs it and may build more redo on top. If the holder then crashes and the WAL describing how that page reached its current LSN was never durable, recovery cannot reconstruct the page the receiver already built on. That's a cross-node lost write, the kind that surfaces as corruption days later. Flushing WAL-up-to-page_lsn before shipping makes the redo chain behind the image durable first.

The other subtle bit is the ownership hand-off. The master flips "who owns this block exclusively" to the new writer only after the new writer reports it has durably installed the image — not when the request is granted. That ordering closes the window where two nodes could both believe they hold the block exclusive and both write it.

Honest status, and I want to be exact here because it's easy to overclaim: the data plane — memory-to-memory ship, WAL-flush-before-ship, CRC, LSN re-stamping, the starvation guard, the invalidate-and-ack broadcast — is all built and wired. First-touch acquisition and read-sharing run on the real lock path. But transferring a block away from a still-live remote writer (the X-to-X case) is currently bounded fail-closed — the master returns FEATURE_NOT_SUPPORTED rather than do a transfer it can't yet prove correct under concurrent recovery. The hard case is implemented but gated off until the warm-recovery substrate underneath it lands. That's the honest line: the machinery is there; the last mile of live writer transfer is deliberately closed, not faked.

→ Code: cluster_gcs_block.c (the 3-way request / forward / invalidate handlers) and cluster_pcm_lock.c (the acquire path).

Problem 4 — Not corrupting everything when the network splits

Multi-active is where split-brain stops being a slide and becomes the thing that eats your data. If the interconnect partitions and both sides keep accepting writes to shared storage, you get two divergent truths on one set of blocks. There is no clean merge from there.

pgrac's primary defense is a fail-closed gate at the commit boundary. Inside CommitTransaction, after the transaction has done its work but before the WAL commit record is flushed, a writable transaction checks whether this node still holds quorum. If not, the transaction is aborted — error 53R40, quorum lost.

The placement is the entire point. A commit becomes durable, visible, and unrecoverable only at the instant its WAL commit record hits disk. The gate sits one step upstream of that flush. So a node that loses quorum mid-transaction can never turn that work into a durable commit; uncertainty resolves to abort, never to a divergent durable write. A partitioned minority that still thinks it's primary simply cannot make its writes stick.

Quorum itself is a disk-based vote with a lease: a majority of voting disks, and a lease that the quorum process must keep renewing. If that process hangs, the lease expires on its own and the gate fail-closes — no broadcast required, no liveness assumption. On top of that, a cooperative write-fence (epoch + lease + a quorum-majority marker on the voting disks) rejects shared-storage writes from a node whose epoch is stale or whose lease has aged out, with a PANIC if it's caught mid-critical-section where a half-write can't be rolled back.

Now the honest part, because a split-brain section that oversells is worse than useless. pgrac's fencing is cooperative and internal — there is no external STONITH, no IPMI, no SCSI-3 reservation, no "shoot the other node in the head" yet. The safety rests on each node fail-closing itself. That covers a cooperating node that loses quorum; it does not cover a node hung at the kernel level mid-write that ignores its own gates — the exact case STONITH exists for, and which is on the roadmap, not in the tree. And one more limitation I'll state plainly because it's load-bearing: today the quorum signal is driven by voting-disk I/O health, not by peer-liveness, so a clean network partition between two healthy nodes that can both still reach the disks does not, by itself, trip the fence. These are documented limits, not marketing softening. Getting split-brain fully right is years of work; what's here is a sound cooperative core with the failure modes written down.

→ Code: the commit gate lives in xact.c (CommitTransaction); quorum + lease in cluster_qvotec.c; the cooperative write-fence in cluster_fence.c.

Where it actually stands

Running today, on real code paths: the SCN clock, dual-track cross-node MVCC, Cache Fusion's data plane (first-touch + read-sharing), cluster catalog invalidation, the cluster-aware storage manager, and the substrate (interconnect, heartbeat, multi-node bootstrap). In progress: live-holder Cache Fusion transfer, full cross-node GES locking, crash/instance recovery, and external fencing. The sanity anchor underneath all of it: the --disable-cluster build is binary-identical to upstream PostgreSQL 16.13 and passes the full 219-test regression suite — the non-cluster path is trustworthy, the cluster path is young. Per-feature status, honestly maintained, lives at pgrac.dev/features.

If you've operated RAC or any shared-storage cluster and want to be useful: clone it, read the design, and tell me where it's wrong. "This breaks under X" is the most valuable thing you can send me right now, and there are scoped good first issues if you want to get your hands in.

Repo: https://github.com/sqlrush/pgrac · architecture and the full feature map: https://pgrac.dev

Next build-log goes one level deeper into the interconnect and how membership survives a node dropping mid-transaction.