Andrea Cadamuro

Posted on May 13 • Originally published at github.com

How a 1-in-3 BFT bug led me to wall-clock-bucketed DAG rounds

#rust #distributedsystems #blockchain #opensource

About a year ago, the consensus runtime I'd been building started doing something annoying.

The setup was straightforward: a Tendermint-style chained BFT with five masternodes finalising blocks proposed by a rotating set of lightnodes, partitioned into committees of 5-10 nodes each (we call them "groups"). The design was textbook. The implementation worked fine on a single machine, fine on two machines in the same datacenter, fine on three machines across two regions.

Then we put it on a real testbed — four VMs across three geographic regions (US-East, EU-Central, EU-North), 26 masternodes, 115 lightnodes — and started pushing realistic load through it. About 10³ transactions per second, distributed across four RPC endpoints, sustained.

And about every third group-formation transition, the BFT certificate would stall. Two honest masternodes would compute slightly different values for the canonical state digests we use to bind each certificate — ranked_hash_stable, tenure_start_height, group_members_hash — and refuse to sign each other's certs. No fork, no malicious behavior. Just a quorum that couldn't assemble.

It took weeks to trace. This article is about what I found, the small change that fixed it, and the two follow-on design decisions it pushed me toward.

What Savitri does, briefly

Savitri Network is an L1 blockchain I've been building in Rust. Two validator roles: masternodes (small fixed set, run BFT for finality, think Tendermint validators) and lightnodes (larger floating set, do the actual block production). Roles are separated because finality and production scale differently — having all nodes vote on every block is wasteful, having one node produce every block is the throughput bottleneck.

Groups are deterministic partitions of the lightnode set into committees of 5-10. Multiple groups run block production in parallel; that's how the chain scales horizontally. Masternodes recompute the partition every ~100 blocks (TENURE_BLOCKS) so adversaries can't reliably concentrate sybils into the same group.

PoU is the consensus scoring scheme — Proof of Unity. The "unity" part is that we collapse five behavioural signals into a single score per lightnode, used to determine who's eligible to be elected proposer. More on this later.

The V0.2 work I'll describe is layered on top: a DAG-BFT runtime called Lattice, derived from the Bullshark / Narwhal family, that's shipping in observation-only mode while we validate it against the legacy V0.1 BFT path. The point of this article isn't to sell Savitri — it's to walk through one specific bug and the engineering choices it pushed me toward.

The bug: round derivation under asymmetric load

The DAG-BFT literature (DAG-Rider, Narwhal, Bullshark) all derive a cell's round from observed DAG depth. When you produce a new cell, you set round_i = 1 + max(parents.round) over the certified parents you've observed locally.

This converges under symmetric load. Under asymmetric load — one half of the network temporarily lagging — it doesn't.

What happened in our V0.1 (which used a related but simpler mechanism: per-tick local sampling of validator latency) was a more boring version of the same problem. Each masternode was building its PoU ranking from locally observed RTTs. Validators in different regions saw different latency, computed different rankings, and from those slightly different group compositions. The Phase 1 fix at the time published a "canonical latency table" on intra-group gossip — but the fix only worked if every validator's per-tick sampling window aligned. Under asymmetric load, it didn't.

The fundamental issue, in both V0.1 and the textbook DAG-BFT design, is that the consensus round is derived from observer-local state. Anything observer-local will diverge if the network is asymmetric, and once round diverges, the BFT quorum can't assemble (because attestations on cells of round R only accumulate from peers who've themselves reached round R).

The fix: anchor rounds to wall-clock buckets

The substitute I shipped is embarrassingly simple:

pub fn current_lattice_round() -> u64 {
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .map(|d| d.as_secs() / LATTICE_ROUND_DURATION_SECS)
        .unwrap_or(0)
}

LATTICE_ROUND_DURATION_SECS = 1. Every NTP-synchronised node computes the same round at the same physical instant. No randomness beacon, no consensus on round index, no DAG-depth observation. Clock arithmetic.

On the same 4-VM cluster that previously diverged in something like 1-in-3 group transitions, the new round mechanism produced 0 mismatches across 277 group transitions under 30 minutes of sustained load. The class of bug is structurally gone.

The obvious objection — and I want to address it because it's the first thing every reviewer asked — is that I've introduced an NTP dependence the textbook design doesn't have. That's true. The paper §5.6 documents 8 specific disadvantages including BGP hijacks of NTP servers, MITM on unauthenticated NTP traffic, and leap-second handling.

The mitigation strategy is a layered TimeOracle: NTS (Network Time Security, RFC 8915, authenticated NTP) as the primary external source, HTTPS-timestamp fallback from CDN endpoints when NTS is unavailable, and an in-protocol peer-time consensus where each CellAttestation includes the signer's signed local timestamp, the aggregator computes a rolling median across peer attestations, and a validator whose local clock diverges from the peer-median by more than one round self-degrades to observer-only mode (publishes cells but doesn't attest — so it doesn't poison the quorum).

To corrupt this, an adversary would need to compromise f+1 peers PLUS the NTS providers. Significantly harder than MITM-ing a single plain-NTP query. The whole TimeOracle work is broken into 7 GitHub issues; the first one is the layered scaffolding, no upstream dependencies, ~3 days of work. Genuinely happy to take contributions there.

What this taught me about reputation

The wall-clock fix solved the immediate divergence. But it also made me reconsider how we were weighting the proposer election in the first place.

In Bullshark and the broader Algorand lineage, the cycle anchor (or "pivot") is chosen by either a public-coin randomness beacon or a VRF over stake. In both, the weighting input is capital — how much money you've locked up.

Our setup wanted something different. The validator population we aspire to — residential broadband nodes, mobile validators, edge devices — is heterogeneous in quality, not in capital. A node with 6 months of clean availability is a more reliable proposer than a freshly-funded whale with no history, but in a capital-weighted system the whale wins.

So PoU is a 5-component behavioural score, mechanically measured by the masternodes:

Availability (25%) — heartbeat presence over a 10-minute rolling window
Latency (20%) — median observed RTT, normalised
Integrity (20%) — rate of protocol violations (bad signatures, dangling parents, equivocation)
Reputation (20%) — slow EMA of past integrity (punishes persistent bad actors, allows recovery slowly)
Participation (15%) — fraction of rounds the validator actually contributed to

Combined into a single score in [0, 1000], smoothed by EMA with α = 0.97 at the masternode tier. The cycle pivot is then a blake3-seeded Fisher-Yates shuffle weighted by that score.

The two-tier model is: stake handles validator admission (you need a bond to register, like any PoS chain), PoU handles weighting among admitted validators. Stake says "can you play?", PoU says "should we listen to you right now?".

I haven't found another production L1 that ships mechanically-measured multi-attribute reputation as the consensus weight. EOS-family DPoS uses voted reputation (subjective, political). BFT-SMaRt supports a one-dimensional scalar. If you know of prior art I missed, I'd genuinely want to read it.

The other thing I learned: ship behind a gate

The third decision was engineering, not algorithm.

I didn't want a coordinated hard fork from V0.1 to V0.2. Ethereum DAO fork (2016), Bitcoin SegWit2x (2017), Cardano Allegra (2020) — every high-profile case taught the same lesson. An upgrade that requires a simultaneous switch across an unbounded validator set is operationally fragile.

So V0.2 ships behind an environment variable:

pub const CONSENSUS_VERSION_ENV: &str = "SAVITRI_CONSENSUS_VERSION";

#[inline]
pub fn is_authoritative_mode() -> bool {
    std::env::var(CONSENSUS_VERSION_ENV)
        .map(|v| v.eq_ignore_ascii_case("v2"))
        .unwrap_or(false)
}

Default (env var unset): V0.2 runs in observation-only mode. Produces cells, attests, certifies, identifies cycle commits, logs all the DIAG metrics — but does not push to chain storage. V0.1 BFT keeps finalising. Both runtimes coexist on production traffic.

When the env var flips to "v2", V0.2 commits become authoritative and V0.1 is short-circuited. The cluster-wide cutover is parameterised by SAVITRI_V2_ACTIVATION_EPOCH=N so all validators transition atomically at the same epoch — no straggler problem, no split brain.

The pre-activation gate is empirical: a counter lattice_commit_matches_v1 compares the cycle that would be committed under V0.2 against the V0.1 BlockCertificate at the same height, over a window of at least 10⁵ blocks. Zero divergences → safe to flip.

I think this pattern is generalisable. Any chain doing a consensus upgrade could adopt the same shape: ship gated, default observation-only, parameterise the flag-day, condition on an empirical pre-criterion.

What I'm honestly not claiming

A couple of things to head off the obvious questions.

This is not a Bullshark replacement. It's a Bullshark implementation with three documented deviations, empirically tested on a modest testbed (4 VMs, 6 MN, 15 LN). The paper §8.5 lists five concrete limitations of the current evaluation, including the cluster scale, the short evaluation duration, and the absence of cycle-commit empirical validation.

It's not in authoritative mode in production either. Lattice runs observation-only by default and stays that way until: (a) the security hardening lands — PoU floor admission gate, equivocation slashing, cross-shard watchdog committee, VRF-based group assignment for the malicious-MN-plus-sybil scenario; (b) the empirical pre-criterion is met over 10⁵ blocks; (c) a proper cluster-scale benchmark (50+ nodes, ≥3 regions) is published. None of those are done.

A non-obvious liveness bug I documented in §6.5: at small group sizes (n=2, where the BFT quorum equals the group size), the cell author has to explicitly attest its own cell — otherwise only the peer's attestation arrives, quorum is never met, and every cell rots in pending. Five-line fix once I figured it out. The bug is structurally invisible in the original Bullshark/Narwhal papers because their illustrative group sizes are always ≥4. If you're implementing Bullshark at small n, this gotcha is yours to inherit.

What I'd take from this if I were you

Three things, generalisable beyond my project.

Anchor your protocol round to something observers can converge on, even if it introduces a dependence. I added NTP as a requirement. That's a real trade-off, but the mitigation — layered time sources plus in-protocol peer-time consensus — is engineering-tractable. The class of bugs from observer-local round derivation is much harder to mitigate after the fact than NTP attacks are.

Separate admission from weighting in your validator design. Capital is fine for "are you in the game", but it's a poor proxy for "are you reliable right now". The mental model that helped me was: stake answers a yes/no question; reputation answers a continuous one. They shouldn't be the same number.

Ship the new thing behind a runtime gate, not behind a fork. Even on the wrong side of the gate, the new runtime gets exposed to real production traffic, real adversarial conditions, real edge cases. You discover things observation-only that you'd otherwise discover the hard way after cutover.

Where to find the code

Apache 2.0 at https://github.com/Savitri-Network/savitri-network. The Lattice modules are in savitri-consensus/src/lattice/ with inline citations to the original Bullshark / Narwhal / Algorand papers.

A 44-page preprint covering everything above plus the honest §9 limitations section is in docs/ of the testnet repo (both English and Italian). It's v0.1, single-author for now (acknowledgements section is an open call for co-authors).

The TimeOracle work for NTP-resilient validators on residential / mobile / IoT hardware is broken into 7 GitHub issues under the Phase 2.6 milestone. The first sub-task is a 3-day scaffolding task, no upstream dependencies. If you've shipped consensus in production and want to take a swing at it, that's the easiest entry point. I'd love the help.

I'll be in the comments if there are specific things you want to dig into. Particularly interested in feedback on the wall-clock-bucket substitution from anyone who's shipped DAG-BFT in production with non-uniform validator latency, and on the multi-attribute reputation weighting from anyone who knows of prior production deployments I might have missed.

Thanks for reading.

DEV Community