Our service discovery caught its own failure and switched itself off

#auth #distributedsystems #clustering #leaderelection

We had a three-replica cluster that kept disagreeing with itself. Background jobs ran two and three times over. The answer wasn't in the logs; it was in our own code. The peer-discovery routine had a catch block, and the comment inside it said, more or less, "multicast failed, discovery disabled." Our service discovery was catching its own failure and quietly turning itself off, and it had been doing exactly that in production from day one.

This is the story of why a clustering protocol that's correct on a server in a rack is the wrong tool the moment you put it on a managed platform, and why the fix was to delete it rather than repair it.

What the cluster is actually for

An auth server runs as more than one replica for availability. The replicas are mostly independent: any of them can validate a token or check a password. But a few jobs must run once, not once per replica. The data-retention sweep that deletes expired records. The pass that fires customer webhooks. Anything that reaches out and has a side effect. For those you need two things the cluster has to agree on: who the members are, and which one of them is the leader that runs the singleton work.

Our original answer was the classic one: gossip. Each node chatters with its peers, membership is an emergent property of who's reachable, and the group elects a leader from the agreed set. It's a beautiful design. It also assumes the nodes can find each other, and the way the old code found them was multicast: shout on the local network segment and see who answers.

Why it broke, silently

Managed Kubernetes does not carry multicast. Neither does almost any cloud network you don't build yourself. The shout went out and nothing came back, every time. So discovery "failed," the catch block fired, and each node concluded it was alone in the world.

At one replica that's harmless. At three it's split-brain by construction: three nodes, each certain it's the only member, each electing itself leader, each running the singleton work. The retention sweep ran on all three. The webhook pass fired every customer's webhook three times. None of it threw an error, because from each lonely node's point of view everything was fine. The bug wasn't a crash; it was three programs each behaving perfectly correctly in a world they had wrong.

The fix was to stop discovering peers at all

Here's the reframe that made the whole thing collapse to a few lines: we were trying to rebuild, with a chatty protocol, a fact the platform already stores for us with strong consistency. We run on a cloud that hands us a strongly-consistent store. Leader election doesn't need consensus among peers if there's already one place everyone can agree on.

So leadership became a blob lease. One blob, one lease. Whoever holds the lease is the leader. The lease has a timeout, so a leader that dies stops renewing and the lease frees itself for the next taker. There is no peer discovery, no quorum math, no multicast, and no catch block waiting to disable it. Coordination and events ride a small table that acts as a bus between the nodes.

The property I didn't expect to love: you can open Storage Explorer and see the cluster's mind. Who holds the lease right now. What's queued on the bus. Membership stopped being something you infer from a protocol's behavior and became a row you can read. When the thing that decides who runs your destructive jobs is a value you can look at, debugging stops being archaeology.

The part where we deleted a clever thing on purpose

While we were in there, we deleted a distributed rate-limiter built on a conflict-free replicated data type, a CRDT that merged per-node counters into a global count without coordination. It was genuinely elegant. It was also solving a problem we'd moved. The global rate limit belongs at the edge, where requests arrive, not reconstructed inside the cluster with a data structure that needs a paragraph to explain. Out it came.

Deleting working code feels like a loss until you count what you stop maintaining. We removed the gossip layer, the multicast assumption, the self-disabling catch block, and the CRDT, and replaced all of it with a lease and a table. The line count went down and the number of states the system can be in went down with it.

The lesson, factored out

Gossip and multicast membership are the right tools on bare metal or a flat network you control. They are the wrong tools on a platform whose network won't carry the very mechanism they depend on, and the failure mode isn't a loud crash; it's a quiet fallback that looks like it's working. Before you port a distributed-systems pattern onto managed infrastructure, ask what it assumes about the network, and ask what the platform already gives you for free. Ours was already running a strongly-consistent store. A lease in that store was a simpler, more correct leader election than the protocol we'd been carrying, and it's one we can watch.

If you're running auth, the parts that hold your keys and decide who runs your destructive jobs should be the most boring and the most inspectable things you own. We made ours boring. It was an upgrade.

And if you would rather not run a clustering protocol at all, that is the pitch: Authagonal keeps this boring so you never have to debug your own split-brain.