Aliaksei Zelianouski

Posted on Jun 27 • Originally published at azelianouski.dev

The second agent I won't automate

#ai #agents #monitoring #claudecode

A couple of weeks ago I wrote about the loop that watches my production while I sleep - a claude -p heartbeat that scrapes my logs, budgets, and game database every 20 minutes and pings me on Telegram when something's off. I ended that one on a throwaway line: once you know about the problems, Claude Code can usually fix them itself.

That's true. It can. I just don't let it.

The monitoring is really two agents, not one. The first is the loop. Its job is triage: collect the errors, check the app state, decide how bad each one is, and fold the noise into a digest so I'm not woken up over a transient blip. That's Marlow, and it's fully autonomous.

The second agent is the one that actually troubleshoots - stitches the logs to the user data to the action traces to the source code, finds the root cause, writes the fix, and patches the database if a game got stuck mid-play. That one is Simona, my customized Claude Code, and I drive it by hand. Every time.

Here's why.

A normal-looking bad day

Yesterday the loop sent me three digest entries over two hours, watching the error logs for my AI Werewolf game:

17:21Z: 37 new error lines, all one known noise class - char M's actions failing through talkToAll in a 24-minute burst. One game stuck in a broadcast-retry loop, not app-wide breakage. Downgraded urgent -> digest.

17:51Z: 9 Game action failed: D errors, plus 6 warnings: Ignoring invalid/duplicate GM-selected bots: [DeepSeekFlash]. A GM picked an invalid bot name. No breakage.

18:21Z: 50 new error lines, the same game-action-failure family - char T's vote actions failing in a 12-minute burst. Plus 5 more of those DeepSeekFlash warnings.

This looked scary. I've recently discovered that I'd poorly configured JSON output for the DeepSeek models: I was using a prompt instruction instead of the dedicated API feature for structured output. While doing that, I found a bug in the DeepSeek Flash Reasoning setup. And yet - the monitoring flags this exact model again.

This is why I don't want self-fixing. I need to understand what is going on. No matter how smart my coding AI is, it won't check the latest DeepSeek API to see if there are improvements in structured output. It won't unify the code for JSON parsing across all models unless I ask it to.

The loop did its job. It recognized the game-action-failures as a known noise class, confirmed nothing was app-wide, and refused to wake me. That's the boring escalation logic working as designed. It also flagged the bot-name warnings, correctly, as a separate harmless thing - the game master typed a bot name the engine didn't recognize.

So... it wasn't actually the JSON parsing, it was poor model reasoning or hallucination over player names. It returned a non-existent name where it had to be precise, and the game logic correctly failed. But why? I inject all the player names into the command - an addition to the last message I send to an LLM. This works great - models never fail to pick the exact name from the list. So what is going on?

Me in the loop

Apparently, I didn't inject those names. I was sure I did, but no - not in this specific request. That's a huge miss. It's quite hard to cover prompt-engineering logic with unit tests, so this logic wasn't covered. Plus I hadn't looked into this code for a long time - thanks to vibe-coding. I used to write all the code myself, but about 6 months ago Claude Opus 4.8 stopped making bugs, and I gave up. It's too convenient when it works.

So, that was it - a real bug in the code, a very tricky one. The model did its best to extract the player names from the entire day's conversation history, and this mostly worked. But this approach suffers from hallucinations in a long conversation - which is why I came up with those commands in the first place.

No way a self-fix loop spots this. It would just keep bolting on inefficient patches and never find the real cause. I think it's important for me to take part in debugging. It keeps me aware of the architecture. And it's really not that hard - I spent 10 minutes on this issue and Simona shipped the fix with a bunch of new tests.

The dream of automation

Right now, a lot of people try to exclude engineers from the loop. If you tell your boss it's possible to not only detect issues but quick-fix them autonomously, that's gonna be your next priority task. You still review the final code change, so it's fine. It's covered with tests - double fine. Well... without diving deep into the problems, I start forgetting how the whole system works. My understanding of the logic detaches from reality. That's the cost of pushing automation too hard. Of reading about AI and not practicing it in the field.

Top comments (3)

avp9-nexus • Jul 20

The bug story is the part I'd frame: "I was sure I did, but no." The monitoring couldn't
catch that, and neither could your certainty — only the hand-driven session did.

That's the same line I ended up drawing in my system, for a different asset. My agents bid
autonomously; the one act that moves money carries a human. Your Simona is the same gate
protecting something else — not the wallet, your mental model. "My understanding of the
logic detaches from reality" is the risk almost nobody prices in when they pitch
self-fixing loops.

My version of your rule, since I direct agents without writing the code myself: no claim
about system state enters a document without a deterministic tool output from the same day.
Not because the system drifts — because my picture of it does. Your player-names bug is
exactly that failure class: confidence, with no same-day check.

"It's too convenient when it works" is the most honest sentence in the piece.

Aliaksei Zelianouski • Jul 21

The no-claims rule holds until the system gets complex. Past some size you can't re-verify everything - you run on assumptions or you don't run at all. So my conclusion is different: if you need to know what's under the hood, don't vibe-code it. Co-work: small pieces, code review, design yourself. Although, I don't follow that on personal project. No time to do things right.

avp9-nexus • Jul 21

You're right that full re-verification doesn't scale. The rule never asks for it — it's
claim-gated, not coverage-gated. I run on assumptions like everyone; the system is full of
things I've never checked. The rule only fires when an assumption tries to become a
sentence: at that moment I either produce the check or tag the sentence as unverified.
Ignorance is allowed. Ignorance dressed as knowledge isn't. "I was sure I did" is exactly
that costume.

Which is also why it survives on a solo budget where "do things right" doesn't: your rule
costs at build-time, and build-time is your whole life. Mine costs at claim-time, and
claims are rare. Not virtue — economics.

The side effect turned out to be the interesting part: since every check costs, the rule
pressures the claims themselves to shrink until an oracle can reach them. I stopped writing
"it works" and started writing "this event count over this block range is 1." Complexity
didn't break the rule; the rule broke my statements into oracle-sized pieces.

For comprehension — your asset — it does nothing, agreed. Different asset, different gate.
Same split as before: yours guards the mental model, mine guards the record.