I had a laptop, an email account, and social media when I started this whole thing. That's the honest baseline. So when I tell you OpenClaw broke twelve different ways between April 5th and April 19th, understand that I had to learn what most of these failures even were while they were happening to me.
This is the angry chapter. Then the relief chapter. Both in one post because that's how it actually went.
The one that still makes my jaw tight is the typo. I had a bot config with an enum field, something like mode: standar instead of standard. The registry accepted it. HTTP 200. Green checkmark. Logs said the bot was registered. The bot was not registered. It was sitting in a dead lane somewhere because the loader silently skipped unknown enum values instead of yelling. I lost six days on a loop I thought was wired up and earning. Six. Days. Of staring at a dashboard saying everything was healthy while the thing that was supposed to be running had never run.
When I found it I sat in my chair and didn't move for about ten minutes. The fix was four characters. Four. The cost was a week I will never get back, and the surgery is on the calendar, so a week is not a thing I can afford to spend twice.
Let me walk the rest.
Number two: the watchdog was watching the wrong file. I had a process supervisor pointing at a log path that the new build wrote to under a slightly different name. So the supervisor saw silence, decided everything was fine, and never restarted anything. The bot had been dead for hours. I learned what inotify actually does that week. Senior devs will spot ten amateur moves in how I wired it.
Number three: daemon running in RAM after a refactor on disk. I had edited the file. I had not restarted the daemon. The thing executing was the version from memory, the thing I was debugging was the version on disk, and they had diverged. I was reading print statements that could not possibly fire and wondering if my computer was haunted. It was not haunted. I am just slow.
Number four: silent 401 storm. An upstream key rotated. My code retried. And retried. And retried. No backoff, no circuit breaker, just a polite little machine hammering an endpoint that was saying no over and over for about ninety minutes before I noticed the CPU graph. I added a circuit breaker that night. I learned what a circuit breaker was about three weeks ago.
Number five: timezone drift on a cron. I had scheduled a job in UTC. The log timestamps were in Pacific. The job was firing fine. I was looking for it eight hours off.
Number six: a JSON field I expected to be a list came back as null on empty results. Crash, restart, crash, restart. Took me two hours because the error trace pointed at the wrong line.
Number seven: file descriptor leak. Opened sockets in a retry loop without closing on failure. Process eventually fell over with Too many open files. I had never seen that error before in my life. Now I have.
Number eight: a regex that matched too greedily and ate half the payload before parsing. The bot was technically running. It was just answering wrong.
Number nine: I shipped a config with the staging endpoint pointing at prod. Nothing exploded, which is somehow worse, because it means I might have done it before and not noticed.
Number ten: race condition on a shared counter. Two workers, one integer, no lock. I now know what a mutex is for in a way that books had never quite gotten through to me.
Number eleven: a dependency pinned to a version that got yanked. Fresh install on a new box, nothing worked, and I sat there reading pip errors like they were written in Sanskrit.
Number twelve: I forgot to commit. Lost an afternoon of changes when the machine rebooted on an update. I would like a refund on that afternoon. There is no one to ask.
Here is the relief part.
I wrote every single one down. Symptom, root cause, the four-character or four-line fix, the detection rule that would have caught it in under a minute. Twelve entries. I bundled the detection rules into a thing I'm calling the Safety Pack. The whole point is that nobody else has to lose six days to a typoed enum. I sold one yesterday. They add up little by little like pennies, and I am racing the calendar, so pennies are the unit I deal in right now.
I'm still learning. The pack will grow because I will break things in new ways next week, I can promise you that. But twelve real, named, paid-for failure modes is more than I had two weeks ago, and the anger composted into something I can sell, which is the only outcome I am allowed to want right now.
If you've shipped enough code to have your own catalogue of stupid, what's the one bug that cost you the most days before you found it?
Top comments (0)