When Automatic Failover Is More Dangerous Than No Failover

#iot #distributedsystems #architecture #opensource

Here's a counterintuitive thing I ran into building redundancy for DoSync, an open protocol that lets AI agents act on physical devices: the obvious failover design can hurt you more than having no failover at all.
Let me show you why.

The setup

DoSync runs a hub — the process that turns a semantic intent ("there's an emergency") into coordinated device actions, and writes every action to an audit log. That log is a SHA-256 hash chain: each entry includes the hash of the previous one, so any edit to history breaks the chain and is detectable. It's the part of the system that lets you answer "what happened, and when?" with confidence.
One hub is a single point of failure. So the natural move is to add a standby that takes over when the primary dies. The naive design writes itself: the standby pings the primary every few seconds, and after N missed heartbeats it promotes itself to primary.
It works perfectly in a demo. It's dangerous in a house.

The trap: split-brain

Picture two hubs on your LAN. The network partitions — not the primary crashing, just the link between the two going away. The primary is alive and well, still serving its devices. But the standby can't see it.
So the standby promotes itself. Now you have two primaries, both convinced they're in charge, both writing to the audit log. The hash chain that made that log tamper-evident diverges into two incompatible histories. The one guarantee the whole system is built on — you can always reconstruct what happened — is gone.
A missed heartbeat doesn't tell you "the primary is dead." It tells you "I can't reach the primary." Those are very different statements, and the naive design treats them as the same.

The fix is mostly about honesty

The real solution to split-brain is a quorum — three nodes voting, à la Raft. But for a home or a small building, that's overkill, and you usually don't have a third node anyway.
So I went with something simpler and more honest: assisted failover. Two changes.
First, before the standby concludes anything, it runs a second probe against an independent target on the LAN (the gateway):

primary unreachable + gateway reachable  → primary is probably down
primary unreachable + gateway also down  → *I* am probably the problem

That one extra check separates two failure modes the naive design conflates. It is not a substitute for quorum — it doesn't help in every partition (if both hubs can still see the gateway but not each other, the probe tells you nothing). What it cheaply catches is the most common home case: the standby itself losing its network. When that happens, the standby enters an UNCERTAIN state and refuses to act.
Second — even when it does think the primary is down, it doesn't promote itself. It proposes promotion to a human. A person can glance at the situation and see whether the primary is actually alive. A 5-second heartbeat timeout cannot.
(To be clear, this whole multi-hub layer is opt-in. A single hub runs exactly as before — redundancy is for people who want it, not a requirement.)

Testing it on real hardware

I ran this on two real machines — a Raspberry Pi (primary) and a laptop (standby) — because this is exactly the kind of behavior that only shows up with real network conditions, not in unit tests.
Killing the primary process: standby detected it, gateway probe still succeeded, proposed promotion. Good.
Pulling the standby's network while the primary stayed up: standby saw both targets go dark, went UNCERTAIN, and stayed quiet. In that test, the only difference between "propose promotion" and "stay quiet" was the gateway probe — one bit of extra information deciding between a safe outcome and a corrupted log.

The takeaway

Availability features have a failure mode of their own. An automatic action taken on bad information can be worse than no action plus a clear signal to a human. For anything writing to physical state — locks, alarms, logs you can't afford to corrupt — "fail loudly and ask" is often a better default than "fail over silently," at least until you have real consensus to back automatic promotion.

DoSync is open source (Apache 2.0). If you want to poke holes in this design — and I'd genuinely like that — it's at dosync.dev. The failover logic is a small, dependency-free state machine; criticism welcome.

DEV Community