Idempotency and reconciliation, learned the hard way

#go #distributedsystems #backend #architecture

Two bugs, both mine, with symptoms that looked unrelated and the same cause underneath, code that only works when everything goes as planned.

The machine that shouldn't exist

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!    @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
The fingerprint for the ED25519 key sent by the remote host is
SHA256:8pLZ0r3nKqVxe2Tj9wMb1c...
Host key verification failed.

Anyone who works over SSH a lot knows this warning, and most of the time it's harmless. The server was reinstalled, the key changed, you clear a line out of known_hosts and move on. This time it wasn't so easy to resolve.

The message showed up on a system that had been neither reinstalled nor changed in any way. The SSH login a few minutes earlier had worked fine. So the warning had to be taken seriously, because the fingerprint for that address really didn't match the machine that was supposed to be there. You connect over the IP address, and it points unambiguously to exactly one VM. So my first reaction, against better judgment, was that this simply couldn't be.

But that's exactly what it was. A little digging turned up a second VM answering on the same address. Two different servers were responding under one IP, sometimes one, sometimes the other. That switch is exactly what the SSH warning above reports.

Digging in, the second VM turned out to be an artifact that was never cleaned up. In theory it should have been gone weeks ago. I could trace that the VM had been right in the middle of being deleted when, weeks earlier, I ran a deployment and had to kill a few containers abruptly. That shouldn't normally happen, but it does, and not only through human action. Systems fail, networks drop out. A deletion job should never be left in a corrupt state, whatever the cause.

Deleting a VM consists of several sub-steps that span quite different systems, among them the database, billing, DNS, and the hypervisor. Simplified, the flow looks roughly like this.

func (s *Service) DeleteVM(ctx context.Context, vmID string) error {
    vm, err := s.db.GetVM(ctx, vmID)
    if err != nil {
        return err
    }

    if err := s.ipam.Release(ctx, vm.IP); err != nil {
        return err
    }

    if err := s.db.DeleteVM(ctx, vm.ID); err != nil {
        return err
    }

    return s.hypervisor.Destroy(ctx, vm.HostID, vm.InstanceID)
}

You could say everything about it is wrong. First the shared resource, the IP, is released while the machine it's assigned to is still running. Then the system deletes the database entry that records which machine on which hypervisor belongs to this IP. If a process dies at exactly this moment, after the database delete and before the delete on the hypervisor, an orphaned VM is left behind.

The delete action does have a retry mechanism, but it gets stuck too. Asked about the machine, the system can only answer "doesn't exist," because the entry that linked database and hypervisor is already gone.

The clone that got in its own way

The second story begins in the logs, where the same line kept stacking up. "snapshot 'clone-…' already exists." A VM needed to be cloned several times, just several copies of one source VM, nothing exotic.

Technically it works like this. For each clone, a temporary snapshot of the source VM is created, the copy is made from it, and finally the snapshot is deleted. The snapshot's name is derived deterministically from the job ID. Simplified.

func (s *Service) cloneFromSource(ctx context.Context, a Action) error {
    snapName := "clone-" + a.ID

    if err := s.hypervisor.CreateSnapshot(ctx, a.SourceID, snapName); err != nil {
        return err
    }
    defer func() {
        if err := s.hypervisor.DeleteSnapshot(ctx, a.SourceID, snapName); err != nil {
            s.log.Warn("snapshot cleanup failed", "name", snapName, "err", err)
        }
    }()

    if err := s.hypervisor.Clone(ctx, a.SourceID, snapName, a.TargetID); err != nil {
        return err
    }
    return s.provision(ctx, a.TargetID)
}

At first glance that looks reasonable. The defer runs in almost every error case, but only almost. And it is the only place that clears the snapshot away again, so that cleanup can fail in two ways.

The defer doesn't run at all.
It runs, but the DeleteSnapshot call inside it fails.

And both are common in practice. Off the top of my head, three scenarios lead into one branch or the other.

If several clone operations start against the same source VM, the hypervisor refuses. It allows only one operation at a time and locks the source VM while a clone is running, so the jobs pile up behind that lock.
If a worker that is cloning a VM exits "ungracefully," the cloning is cut off mid-process. A defer then no longer runs.
A transient error keeps the cleanup call from getting through. That sort of thing just happens.

Which scenario struck first here, I can't say for sure. The result is the same every time. A snapshot is left behind on the source.

And from that first leftover on, every new attempt of the same job derived the same name.

Attempt 1: create snapshot ✓ → clone ✗ (source locked) → cleanup ✗ (source locked)
Attempt 2: create snapshot ✗ "already exists"
Attempt 3: create snapshot ✗ "already exists"
…

At some point the system gave up. The snapshot stayed put on the source.

Why distributed systems are so hard to get right

Deletion thought through too naively, and cloning that blocks itself. Both flows consist of several steps across system boundaries, and both were built only for the case where every sub-step succeeds reliably. Dissect the code like this and the problems behind it seem almost obvious. So how does it happen anyway?

A good part of the reason is in us. We think in sequences, first this step, then the next, all on a single timeline. On a single machine that picture works quite well. If a program crashes, the operating system reclaims the memory and the database rolls back its open transaction, and afterwards everything is consistent again, as if nothing had happened.

A distributed system doesn't give you that safety. A command like "delete this VM" touches several systems at once, the database, the IP pool, and the hypervisor. No transaction ties them together. If the command breaks off after the second of three steps, the first step stays in effect and the third never happens. Across system boundaries there is also no rewind button that restores the consistent state from before. And even if you rebuild the rollback by hand, it is subject to the same limits as the original command. No one guarantees it runs cleanly. That is not intuitive to us at first.

This partial failure stays invisible until it hits. Jim Waldo and his colleagues called this the real difference back in 1994, in "A Note on Distributed Computing". One component fails, the rest keeps running, and no one knows reliably how far the failed part got. The path where everything succeeds is simply the easiest to picture, which is why the half-finished states in between tend to slip through. But every step that can fail on its own multiplies the number of possible states, and that space grows faster than anyone can hold in their head. The first of the famous fallacies of distributed computing already says it, "the network is reliable," and it isn't. And that applies just as much to the call that's supposed to clean up at the end.

There's a handful of proven patterns against all this. None of them is new, and three are enough for our two stories.

Idempotency. An operation is idempotent if its second, third, nth call does nothing more than the first. That makes a retry harmless, even when the previous attempt broke off mid-work. The next run just copes with whatever the last attempt left behind. Amazon made this a house rule in "Making retries safe with idempotent APIs".

Reconciliation. Instead of relying on every single event arriving and being processed, a loop regularly compares the desired state with the actual state and corrects the difference, no matter how it came about. Kubernetes, for example, is built around exactly this principle. The technical term is level-triggered rather than edge-triggered. In plain terms, you work from the current state itself instead of trying to catch every individual signal.

Coordination. Where several actors touch the same resource, one place has to decide who may act and when. The obvious solution is a dedicated lock, but it brings a whole class of new bugs with it. In "How to do distributed locking", Martin Kleppmann took apart what makes such locks fail, from paused processes that keep writing after their lock has expired to missing fencing tokens and clocks you can't trust. Often it is more robust to avoid a second system altogether and coordinate through the database you already trust. It orders competing writes reliably, so its own state can take over the role of the lock.

The machine first, then the address

The fix turns the order around and makes every step idempotent.

func (s *Service) DeleteVM(ctx context.Context, vmID string) error {
    vm, err := s.db.GetVM(ctx, vmID)
    if errors.Is(err, ErrNotFound) {
        return nil
    }
    if err != nil {
        return err
    }

    if err := s.hypervisor.Destroy(ctx, vm.HostID, vm.InstanceID); err != nil && !errors.Is(err, hypervisor.ErrNotFound) {
        return err
    }

    if err := s.ipam.Release(ctx, vm.IP); err != nil && !errors.Is(err, ipam.ErrAlreadyFree) {
        return err
    }

    return s.db.DeleteVM(ctx, vm.ID)
}

Now the system only releases the IP once the machine it was assigned to is really gone. If the process breaks off somewhere, the entry stays, and with it the information about what's still to be done. A retry runs through cleanly, because what's already done counts as success, not as an error.

That still leaves one problem. How many such artifacts exist already, and how do you catch the ones the fixed code still misses? The answer is reconciliation. I built a check that compares the desired state in the database against the actual state on the hypervisors and reports every discrepancy. Some get fixed right away, some need a look by hand. That way orphaned machines surface reliably instead of running on unnoticed.

Delete before create

The first bug in the cloning was that creation wasn't idempotent. That's exactly what changed. If a temporary snapshot is left over from an earlier attempt, every new attempt now cleans it up before creating its own.

func (s *Service) prepareSnapshot(ctx context.Context, a Action) error {
    snapName := "clone-" + a.ID

    existing, err := s.hypervisor.GetSnapshots(ctx, a.SourceID)
    if err != nil {
        return err
    }
    for _, snap := range existing {
        if !strings.HasPrefix(snap.Name, "clone-") {
            continue
        }
        if snap.Name == snapName {
            if err := s.hypervisor.DeleteSnapshot(ctx, a.SourceID, snap.Name); err != nil {
                return err
            }
            continue
        }
        s.cleanupOrphan(ctx, a.SourceID, snap.Name)
    }

    return s.hypervisor.CreateSnapshot(ctx, a.SourceID, snapName)
}

The second bug was that the hypervisor's lock mechanism was ignored. You can't clone from the same source at the same time, so the system has to order the jobs itself and work through them one after another, without any extra locking. The order comes from the job table itself. Per source, the worker that atomically claims the oldest active clone is the one that runs, and the rest wait until it's done. The job's status doubles as its lock.

What remains

None of these problems needs millions of jobs in the queue. The challenges of distributed systems hit early, often with the first handful of workers.

No single fix covers all of this. The patterns help, but they don't guarantee complete safety, distributed systems are too complex for that. What's left is discipline. With every multi-step operation, assume the worst case. The process dies right after the step where it can't afford to stop. Two jobs hit the same resource at once. The cleanup call never arrives.

I recently started writing about building and running infrastructure. If that is your kind of thing, follow me here on DEV or subscribe to new posts at marvinstrauch.de/newsletter.