Guocheng Song

Posted on Apr 22

Why We Split meta/root and coordinator in NoKV

#architecture #database #distributedsystems #systemdesign

Links

Repo: https://github.com/feichai0017/NoKV
Interactive demo: https://demo.eric-sgc.cafe/

When people first look at a distributed KV system, one of the most natural assumptions is:

“There should be one control-plane service that owns the cluster metadata.”

That intuition is understandable. If you’ve looked at systems like TiKV, the first mental model you often get is something like a PD-style component: routing, timestamps, heartbeats, scheduling decisions, cluster topology, all gathered around one control-plane authority.

We started with a similar intuition.

But as NoKV grew, that model started to feel too coarse.

The problem was not whether we wanted a control plane. We absolutely did. The problem was whether the durable truth of the distributed system should live inside the same process that answers requests, serves views, and reacts to runtime events.

Our answer became: no.

That is why NoKV ended up with a deliberate split between:

meta/root: the rooted truth kernel
coordinator: the control-plane service and rebuildable runtime view

Once that split became explicit, a lot of other design choices started to become cleaner.

The core idea

In NoKV, the “brain” of the distributed system is not the coordinator.

The durable metadata truth lives in meta/root, which is implemented as a typed, append-only committed log plus compact applied state. Coordinator lease changes, allocator fences, region lifecycle, pending peer/range changes: these are not “just some in-memory fields inside the control plane”. They are rooted, replicated, and auditable metadata truth.

The coordinator sits above that truth.

It is a service + view, not the ultimate owner of metadata persistence.

That distinction matters a lot.

Because the moment you let the control plane also be the sole durable metadata owner, you start coupling together several concerns that actually have very different failure and evolution properties:

serving RPCs
maintaining routing views
lease competition
allocator windows
scheduling logic
metadata durability
metadata replication

We wanted those boundaries to be explicit instead of implicit.

From a PD-like intuition to a rooted-truth design

A useful way to explain the evolution is this:

the initial intuition was closer to a TiKV / PD-style control-plane concentration
the final direction is closer to a FoundationDB-style role separation, combined with a Delos-like rooted-truth design

Not in the sense of copying another system’s exact implementation, but in the sense of adopting a cleaner architectural boundary:

the log is the truth
services above it are consumers, views, and operators
restart should rebuild from truth, not recover from hidden local authority

That is the key shift.

In other words, we did not want coordinator to become a giant “metadata brain process” that owns everything and then needs more and more local state to stay alive. We wanted it to become something horizontally deployable and operationally replaceable.

So in NoKV:

meta/root owns durable rooted truth
coordinator consumes rooted truth and builds a runtime cluster view
raftstore executes data-plane work and region-level replication

This is also why the repository documentation describes the system as having three planes:

truth plane
control plane
execution plane

Why this split is useful in practice

This is not just a conceptual refinement. It has concrete engineering payoffs.

1. Coordinator becomes much lighter on restart

A coordinator restart is no longer “recover local metadata authority”.

It becomes:

reconnect to rooted truth
rebuild the in-memory view
resume lease competition if appropriate
continue serving

That makes the coordinator much easier to reason about operationally. The only thing that differentiates active and standby coordinators is not some private local metadata store, but the rooted lease state.

2. Durable truth stops being mixed with runtime convenience

Routing caches, heartbeat-derived state, scheduling hints, and local runtime maps are useful, but they are not the same thing as authoritative metadata truth.

The split forces us to say that explicitly.

That reduces a whole category of ambiguity around:

which state is “just a view”
which state must survive as the source of truth

3. Control-plane horizontal scaling becomes more realistic

If the coordinator is “everything”, then horizontal scaling is awkward, because every extra coordinator replica either:

becomes a passive hot standby with hidden state coupling, or
requires reimplementing distributed truth inside the coordinator layer itself

But if the durable metadata truth already lives below, then multiple coordinator processes become much simpler:

all consume the same rooted truth
all rebuild the same kind of view
lease determines who is currently active for singleton duties
standby instances are not fake; they are real, warm consumers of the same truth

That is a much cleaner story for scaling and failover.

4. Authority handoff becomes auditable

Lease grant, seal, closure, handoff: these become committed rooted events rather than side effects lost inside a single service process.

That matters both for correctness and for understanding the system later.

Why we built a dashboard for this

Once you split the system this way, a static architecture diagram is no longer enough.

Because the interesting part is not just “there are three kinds of nodes”.

The interesting part is:

who is currently the meta-root raft leader
which coordinator currently holds the lease
how region leaders are distributed across stores
how failover changes the live control path
what stays durable truth, and what is only a rebuildable view

That is why we built a live dashboard.

The dashboard is not only there to make the demo prettier. It is there because this architecture is much easier to understand when you can observe it from several angles at once:

truth plane: rooted truth ownership and replication
control plane: lease holder, routing view, coordinator role
execution plane: per-region leadership and store-level state

It turns the system from “a diagram in a README” into something you can actually inspect while it is running.

That is especially useful for a project like NoKV, because one of our goals is not just to build a storage system, but to build a maintainable and extensible distributed storage research platform.

If the architecture cannot be made visible, it is much harder to evolve it rigorously.

If you want to read the code, start here

If you want to understand this split from the source code instead of only from this post, these are the best entry points:

`meta/root/`

The rooted truth kernel:

typed events
compact state
storage backend
remote service/client

`coordinator/`

The control-plane service:

routing
heartbeats
lease handling
allocator serving
rebuildable cluster view

`raftstore/`

The execution plane:

multi-Raft region lifecycle
replicated command execution

The shortest doc path is:

README.md
docs/architecture.md
docs/rooted_truth.md
docs/coordinator.md

Those four together give the cleanest route from:

“What is this repo?”

to:

“Why are these package boundaries the way they are?”

Closing thought

A lot of distributed systems talk about separation of concerns, but in practice still let the control plane quietly accumulate too much hidden authority.

What we wanted in NoKV was a cleaner line:

the durable metadata truth should live in its own rooted substrate
the coordinator should be a service layer on top of that truth
the execution plane should stay separate from both

That separation made the architecture easier to explain, easier to visualize, and, I think, easier to extend.

And that is exactly why the dashboard exists: not as decoration, but as a way to make those boundaries visible.

Top comments (3)

mote • Apr 23

The meta/root split is an elegant way to handle the metadata vs data access pattern asymmetry. I've seen this same tension bite people in more naive implementations.

The key insight you're exploiting: reads of root data are rare (only during split/merge) but reads of metadata are hot. By physically separating them, you avoid the classic "small hot record" problem where metadata gets thrashed into the storage engine's cache in a way that's hard to tune.

One thing I'd love to understand: how do you handle the case where the root node itself becomes a bottleneck during a cascade of splits? If many data nodes trigger splits simultaneously, you could end up serializing them all through a single root.

In embedded/mobile contexts (my world: robot controllers), we sidestep this entirely because we don't have distributed splitting — but you obviously can't do that at scale.

What's the replication story for the coordinator? Raft? If it's single-leader, how do you avoid the leader being the bottleneck during write-heavy workloads?

Guocheng Song • Apr 23

Thanks — this is a thoughtful read, and you’re pointing at exactly the right pressure points.

One clarification though: the meta/root split in NoKV is less about cache-locality or “small hot record” tuning, and more about separating durable metadata truth from the control-plane service/view layer.

In other words, meta/root is the authoritative rooted log for cluster-level truth, while coordinator is a rebuildable service + in-memory view on top of it. So the main goal is not “keep hot metadata out of the storage engine cache”, but “don’t let the coordinator become the sole durable owner of metadata truth”.

On your bottleneck question: in the current design, rooted truth is intentionally not on the hot read path. Hot route reads are served from the coordinator’s rebuilt in-memory view. What goes through meta/root are control-plane truth transitions: lease state, allocator fences, region lifecycle, pending changes, and so on.

So yes — if you had a pathological burst of topology-changing events (for example, many splits at once), that would put pressure on the rooted metadata path. Today that’s an acceptable tradeoff because those transitions are much lower-frequency than normal reads/writes. If that path became the bottleneck, the next step would likely be batching / richer transition modeling rather than collapsing truth back into the coordinator.

On the replication story: coordinator itself is not a separate Raft-replicated metadata authority. The replicated part is meta/root. Multiple coordinators consume the same rooted truth, rebuild the same kind of runtime view, and a rooted lease decides which one is active for singleton duties such as TSO / allocator serving / scheduling decisions.

So the short version is:

meta/root = replicated truth plane
coordinator = service + rebuildable control-plane view
raftstore = execution plane

Really appreciate the question — it gets at the exact boundary we were trying to make explicit.

mote • Apr 23

The Query Intelligence layer for D1 is a genuinely useful idea — I've burned way too much time staring at Cloudflare billing surprises from unoptimized queries. The cost estimator alone would save people real money.

One thing I'd be curious about: have you considered how CF Studio would work in an offline/edge scenario? I work on embedded systems where we sometimes need a local database that can operate without network (think: drone telemetry logging, robot sensor stores). We ended up building moteDB exactly because the 'cloud-first' assumption breaks hard when your device loses connectivity mid-flight — you need writes that don't block on a round trip to a remote D1.

The Tauri choice is solid though. Sub-10MB and not another Electron memory hog? That's the way. How's the Windows/Linux support looking — any webview headaches across platforms?