DEV Community

Patrick Hughes
Patrick Hughes

Posted on • Originally published at bmdpat.com

A self-healing system can't heal an empty queue

A self-healing system can't heal an empty queue

My blog pipeline went red two mornings in a row. The self-healing step ran both times and fixed nothing. The bug was not in the healer. The bug was that I asked it to heal the wrong kind of failure.

Short version: automated recovery only works when the failure is "the machine broke." When the real failure is "the machine has no work," retrying does nothing, because there is nothing to retry against. A monitoring check that flips red has to tell those two cases apart, because the recovery for each is the opposite of the other.

What actually broke

I run a small fleet of scheduled tasks that publish one blog post a day. The chain is simple. A draft lands in a queue. A reviewer checks it at 08:30. A publisher ships the approved one at 09:30. A separate check runs after that and asks one question: did a post go live today?

If the answer is no, the check flips red and calls a healer. The healer's job is to get a post live. For weeks it worked. Then two mornings in a row it ran, reported failure, and left the dashboard red.

I went looking for the bug in the healer. There wasn't one. The healer did exactly what it was told. It looked for an approved draft to publish, found none, and correctly reported that it could not publish nothing.

The queue was empty. No draft had been written. The publish step had nothing to ship, and no amount of retrying an empty queue produces a post.

Two failures that look identical on a dashboard

Here is the trap. On the dashboard, both of these render as the same red box: "no post went live today."

  • The machine broke. A draft existed, but the publish step crashed, hit a bad credential, or got a 500 from the API. Recovery: retry. Fix the bug, run it again, the post ships.
  • The machine had no work. No draft existed. Recovery: retrying does nothing forever. You have to manufacture the missing input first.

These are opposite problems. The first says the input was fine and the action failed. The second says the action was fine and the input was missing. A healer built for the first is useless against the second, and most self-healing automation only handles the first. It assumes the work exists and the step failed. When the work itself is missing, it spins.

How I split the healer

The fix was to stop treating "no post today" as one failure. It is two.

When the check goes red, it now asks a second question before doing anything: is there content waiting? If a draft exists and the publish failed, that is a broken-machine problem, and retry is the right move. If no draft exists, retry is pointless. The recovery for an empty queue is not to republish. It is to generate the missing draft.

So the empty-queue healer is a content generator, not a republisher. It writes a fresh post, runs it through the same quality gate every other post clears, and only then publishes. This post is the output of exactly that path.

Doesn't auto-generating content just make filler?

That is the real risk, and it is worth naming. The moment your recovery for "no work" is "manufacture work," you have built a machine that can publish garbage to make a red light turn green.

The guard is the quality gate. The generator can write a draft, but it cannot publish one that has not passed the same independent review a queued draft passes. If the generated draft fails review, the pipeline stays red and asks me for a decision. Red is honest. A green light earned by shipping filler is worse than an honest red.

That is the line. A healer may manufacture the missing input, but it may never lower the bar that input has to clear. Generate freely, publish only what passes.

Why this matters past blogging

Any check that flips red on a missing outcome has this fork in it. Did the nightly report not run because the job crashed, or because there was nothing to report? Did the deploy not happen because it failed, or because there was no new commit? Retry fixes the first. It is a no-op on the second.

Before you wire up automated recovery, write down which failure you are recovering from. If you only handle "the step crashed," your healer will sit there retrying an empty queue and calling it an outage.

Where the cost angle comes in

A healer that retries the wrong thing is annoying when the step is a publish call. It gets expensive when the step is a paid model call. A retry loop pointed at a failing API spends real money on every attempt, and an agent that retries on a schedule can do that all night while you sleep.

That is what I built AgentGuard for: a budget, token, and rate limiter you wrap around an agent so a runaway or retry loop stops at a hard ceiling instead of grinding until the invoice teaches you. Free to install: pip install agentguard47.

When your monitor goes red tomorrow, ask the second question before you retry: is this broken, or is it empty? The answer decides whether retrying helps or just burns time and money.

Top comments (0)