April 30, 2026. 14:32 UTC. That's when DeepSeek killed one of my keys.
I didn't find out until almost two hours later. By then my agents had pounded a corpse 800,000 times.
Let me back up. I had a laptop, an email account, and social media when I started this thing. I learned what a 401 status code meant a few weeks ago, which is relevant because today I learned what it means when your code sees 800,000 of them in a row and shrugs.
The key rotation wasn't even malicious. DeepSeek does what DeepSeek does. Sometimes a key gets cycled, sometimes a billing flag trips, sometimes the planets align and you get the 401 of doom. The point is: the endpoint stopped answering with anything useful at 14:32 UTC. And my agents, bless their dumb little hearts, kept asking.
Here is the part that made me actually slam my desk.
journalctl filled up. Filled. Up. Because every single retry logged a full stack trace including the request body and a chunk of the response headers. The disk pressure alert that should have caught it never fired because I had silenced that alarm two weeks ago after a false positive at 3am that I was too tired to debug properly. Past me, current me hates you.
I noticed at 16:11 UTC. Not because of metrics. Not because of alerts. Because one of my Telegram pings to myself said an agent had failed its hourly heartbeat. I opened the dashboard, saw nothing on the charts (because the charts were also fed by journalctl, which had nothing left to write to), and went to the box directly.
Disk at 100%. Auth failures everywhere. Roughly 800k requests on a dead endpoint.
I rotated the key. I gzipped logs. I cried for about forty seconds. Then I sat down and wrote the thing I should have written eight weeks ago.
A circuit breaker. Stupid simple. Python. Keyed on consecutive 401s in a 30 second window. Five in a row and the breaker trips. Trip means: stop calling that endpoint for that key, page me, and fall through to the next provider in the pool.
from collections import deque
import time
class AuthBreaker:
def __init__(self, threshold=5, window=30):
self.hits = deque()
self.threshold = threshold
self.window = window
self.open = False
def record_401(self):
now = time.time()
self.hits.append(now)
while self.hits and now - self.hits[0] > self.window:
self.hits.popleft()
if len(self.hits) >= self.threshold:
self.open = True
return self.open
Not pretty. Senior devs will spot ten amateur moves. No jitter, no half-open state, no proper backoff curve. I'll add those when I'm not bleeding from the ears. For now it does the one thing I needed it to do at 14:32 UTC and could not do: notice that the same exception is coming back forever and stop the bus.
I'm still learning the difference between an error you retry and an error you respect. 5xx, you retry. Timeouts, you retry. 401? 401 is the API telling you it doesn't know who you are, and asking it the same question 800,000 times is not going to change its mind.
The other lesson, the one I'm sitting with tonight: never silence an alert because it woke you up. Fix the alert. Or write a better one. Or, fine, snooze it for a day, but put it on the calendar to come back. Silencing it forever is how you end up writing a chapter like this.
My x402 routes were unaffected, by the way. Different provider pool. A Safety Pack went out the door this afternoon while the rest of my world was on fire, and I only noticed when I checked the wallet later. Pennies still hitting the jar.
So here's what I want to know from anyone reading. When you write a circuit breaker for an auth failure, do you trip on count or do you trip on rate? And do you ever let it auto-close, or do you make a human turn the key?
Top comments (1)
Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.
We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.
Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.
We hope you understand and take care to follow our guidelines going forward!