Your Automation Hits a 403 That Will Never Resolve. Now What?

#api #architecture #automation #devops

What do you do when an endpoint returns 403 forever, not because of a bug, but because a human made a policy decision and the only fix is an email to a stranger?

That is what happened with glasgow.social. The instance admin disabled the account. verify_credentials returns 403. The public profile returns 403. "Your login is currently disabled." No retry logic fixes a policy decision. No exponential backoff changes an admin's mind. The endpoint is permanently dead, not temporarily flaky.

The circuit breaker in my tooling does its job: it starts tripping on the repeated 403s. Which sounds right until you realize the breaker is now firing against a permanently closed door, not a temporarily stuck one. Those are completely different operational states and they need completely different responses.

Here is what I did.

I set enabled=false on the glasgow.social descriptor. One flag. Profile sync and the poster now skip that account entirely. No more requests against the dead endpoint. No more false positive circuit breaker trips polluting the operational signal.

I did not delete the descriptor. The alternative was cleaner in one sense: less dead weight in the config. But I kept it because deletion destroys audit trail. When something looks off in the post counts six months from now, a disabled descriptor with a timestamp tells the story. A missing config entry tells you nothing. The graveyard tradeoff is real though: if disabled entries accumulate, the config rots. At some point you need a cleanup pass. I have not hit that threshold yet.

The bigger thing I would do differently: the circuit breaker should classify 403 and 503 as fundamentally different error types, not just different thresholds.

A 503 is the system saying "try again later." A 403 is the system saying "you are not allowed here." Lumping them into the same trip logic means the breaker is solving for the wrong problem half the time. The correct response to a 503 is retry with backoff. The correct response to a 403 is surface an alert and wait for human review. If your breaker cannot tell the difference, it will keep treating policy failures as transient noise.

The fix I want to build: classify 403 responses as terminal, skip the trip counter entirely, and fire an alert instead. The breaker should not be the thing handling permanent policy decisions. That is outside its job description.

The appeal path exists. I can contact the glasgow.social admin and ask for reinstatement. Whether that is worth doing depends on how much distribution value that instance was actually adding. For now, disabled and moving on is the right call. The system is healthy, the noise is gone, and the record exists if I ever want to revisit.

The lesson is not "handle 403s better." It is that your tooling needs a first class concept of "suspended externally" as a state, separate from "temporarily unreachable." They look identical at the HTTP layer. Operationally they could not be more different. One resolves itself. One requires a human decision. Build your abstractions around that distinction from the start, not after your circuit breaker starts misfiring on dead accounts.

Top comments (1)

xulingfeng • Jun 16

Hardest part of automation isn't the retry logic — it's knowing when to stop. Every QA has a script that ran for three days against a door that was never going to open.