Rafael Costa

Posted on Apr 8 • Edited on Jun 26

Five Corrections: What an AI Agent Didn't Know About My Production Database

#ai #devops #database #architecture

And why "just write a better prompt" is the wrong lesson

The AI agent had just pulled 30 days of CloudWatch metrics, parsed them correctly, built a table, and pivoted the entire recommendation based on what it found.
I typed four words: "that's shared cluster data."
The deadlock counts, the connection peaks, the daily patterns — all real numbers from a real database cluster. All useless for the decision we were making. Our application shares that cluster with other services. The agent had processed the data flawlessly and drawn a conclusion from someone else's workload.
The data was flawless. The conclusion was someone else's. That gap has a structure worth looking at.

The setup

Multi-tenant Django application on Aurora MySQL. Low user count, background workers for bulk operations, primary-replica topology. No transaction isolation level configured anywhere — MySQL's default REPEATABLE READ running unchallenged. Sporadic deadlocks in background tasks, stale reads on the replica.
I pointed an AI agent at the codebase and asked it to devise a plan for choosing the right isolation level. Not "switch to READ COMMITTED" — decide what it should be, with evidence.

What the agent did well

Within minutes it had explored the entire codebase, mapped six categories of concurrent access patterns, and evaluated all four MySQL isolation levels against each. Found that no isolation level was configured anywhere — not in Django settings, not in middleware, not in init_command. Identified the specific background tasks doing bulk write cycles, the views with missing transaction boundaries, the delete-then-recreate services structurally vulnerable to race conditions.
This would have taken me a full day of methodical grep-and-read. The agent produced a concurrency map that didn't exist before the session started.
Then it started getting things wrong.

Three corrections, same shape

Over the course of the investigation, I interrupted the agent five times. Three of those reveal the same structural pattern. The other two are different in kind — I'll get to those.
"That's shared cluster data." CloudWatch showed hundreds of peak connections and sporadic deadlocks across 30 days. The agent built a careful analysis from these numbers, but those numbers belonged to the Aurora cluster, not to our application — low user count, handful of background workers, a fraction of those connections. The deadlocks could be entirely from other tenants on the same cluster.
Could a better prompt have prevented this? Sure. "The Aurora cluster is shared; CloudWatch metrics are cluster-wide." I knew that before the session. It's promptable.
But I expected the agent to catch it. It had the user count, the worker count, the CLAUDE.md. There was enough signal to at least question whether those connection peaks belonged to us. It didn't question anything — processed the numbers as ours and kept going.
"Did you check production only, or were staging and dev mixed?" The agent queried SigNoz for database metrics — millions of reads, tens of thousands of writes, clean latency percentiles. SigNoz had separate service entries per environment, and the agent hadn't verified which ones it was aggregating. The CLAUDE.md, the project docs — the namespace separation was right there.
"What about the slow queue workers?" The agent pulled error data for the main API service and the primary background worker. Missed the slow-queue workers entirely — the ones handling the heaviest bulk operations, the exact workloads where deadlocks would actually manifest. I know which queues carry what because I designed the routing. The agent queried the obvious service names and stopped.
The pattern. Each time, the agent had directional context: the codebase, the project setup, the CLAUDE.md scoping the investigation. The shared cluster, the environment boundaries, the queue routing were all inferrable from material it could see.
Fair objection: "So the context was there and the agent missed it. That's a tooling problem." Not a bad objection — better agents will get better at this.
But notice what's actually happening. The agent treats observability data as self-describing, and observability data inherits the topology of the infrastructure that produces it — structure that doesn't announce itself in the query results. CloudWatch doesn't label which connections belong to which tenant. SigNoz doesn't flag cross-environment aggregation. The data just arrives, looking authoritative.
The total context space includes cluster layout, environment naming conventions, service routing, monitoring configuration, team history with past incidents. The hard part isn't supplying that context. It's knowing which piece applies at the exact moment of inference — and that's a judgment problem wearing a context mask.
Better agent protocols will close some of this gap: verification steps before ingesting observability data, environment boundary checks, worker enumeration. Adversarial loops and self-correction prompts — "before drawing conclusions from this data, verify its scope" — would probably have caught all three. But someone has to write that checklist, and it's the engineer who already knows where the agent will drift. You're pre-encoding judgment into process. The human steering doesn't disappear; it moves earlier in the pipeline.
That objection only holds for these three. The next two resist it entirely.

Two different species

Premature convergence. The agent's first plan evaluated REPEATABLE READ versus READ COMMITTED and recommended switching. MySQL has four isolation levels. The plan hadn't touched READ UNCOMMITTED or SERIALIZABLE, hadn't considered per-connection versus per-transaction strategies, hadn't looked at Aurora-specific features. It optimized before it explored. Not all of those gaps matter equally — per-transaction strategy is the one that actually bites — but the point is the agent never even mapped the decision space before narrowing it. Probably because the REPEATABLE READ vs. READ COMMITTED binary dominates MySQL docs and Stack Overflow, so the training data funnels toward it. Strong early evidence triggers convergence, and decision-space awareness is a judgment, not instruction-following, skill.
Adversarial peer review. After the agent recommended READ COMMITTED, I dropped in a document I'd prepared separately: a fact-check showing this is a defensible but contested position. Dimitri Kravtchuk, Oracle's MySQL performance architect, has shown that READ COMMITTED creates per-statement ReadView overhead causing trx_sys mutex contention at scale — up to 2x performance degradation for short transactions. No prompt can pre-load a rebuttal to a conclusion that hasn't been reached yet. That's the human acting as adversarial reviewer, not correcting the trajectory but stress-testing the destination.

Why this generalizes

The isolation level was the vehicle. The pattern applies anywhere the codebase is an incomplete map of the system — capacity planning, migration strategies, incident response — anywhere the agent can analyze what's visible but can't weigh what's implicit.
Providing context and having the agent apply it at the right inferential moment are different problems, and the second one requires holding the system in your head, which is exactly the kind of knowledge an agent is supposed to help you leverage, not replace.

What actually happened

The agent shipped a solid architecture decision record, a team-facing report, a concurrency architecture document, half a dozen issues for the gaps it found, and a PR with the implementation. READ COMMITTED via init_command, with an escape hatch for the two operations that genuinely need snapshot isolation. Days of senior engineering work compressed into hours.
But the final recommendation was correct because a human who understood the system — the infrastructure, the observability configuration, the workload routing, the external literature — kept redirecting the analysis every time it drifted.
Who in the room holds the topology?
That's the mental model problem applied to infrastructure. The agent is a force multiplier — but a multiplier needs something to multiply.
The faulty inferences — despite context that should have been more than enough — cast mental model shadows: business rules, operations structure, and infrastructure constraints are all projections of the same multidimensional creature that can either devour or protect you forever.

Top comments (5)

Pon • Jun 30

Most 'we test our AI reviewer' claims can't survive the red-green discipline you put on the auditor itself, which is why making it the load-bearing wall matters. Where I'd point the same blade back at it: the planted-bug corpus can only hold defects someone was clever enough to plant. Red-green proves the auditor catches the bug you imagined and drops it when fixed -- but the villain of the essay is the silent false negative, the guard described-as-done that nobody thought to look for. That class is, by definition, the one you didn't plant. So the self-test inherits the exact blind spot it diagnoses in everyone else: it falsifies against the imagined defect, not the unimagined one. Mutation testing has lived with this forever -- mutants are a proxy for real faults, and the faults that hurt are the ones no operator thinks to generate. Not a hole in the method, more the question underneath it: how does the fixture corpus grow toward the defects you didn't know to write? The doc that lied wouldn't have been in it either.

Smaller one, on the falsifiability clause. 'This is wrong if an upstream caller already restricts the path' is written by the same miscalibrated model as the finding, so the refutation condition can be hallucinated too -- and a wrong one is worse than none, because it sends the reader to a place that checks out and sells false confidence in the finding. The reader inherits the attack, but also whatever the attack itself got wrong. Who refutes the refutation?

Rafael Costa • Jun 30 • Edited

Hey, Pon!
Thanks for your comment.
I do believe this text is not the one you intended to address, though :V. You're discussing dev.to/devanomaly/confront-dont-as..., right? They do form a series, so I think there's a nice hyperlink flavor which's great anyways!

In the very first piece (dev.to/devanomaly/the-mental-model...) of said series, I discuss EXACTLY the following concern of yours:

"the planted-bug corpus can only hold defects someone was clever enough to plant. Red-green proves the auditor catches the bug you imagined and drops it when fixed". You're completely right! And your caveat, is accurate as well: "but the villain of the essay is the silent false negative, the guard described-as-done that nobody thought to look for. That class is, by definition, the one you didn't plant. So the self-test inherits the exact blind spot it diagnoses in everyone else: it falsifies against the imagined defect, not the unimagined one."

This is the good old problem of unknown unknowns. And, to the best of my current knowledge, there's no better way out of it but to probe your stuff in staging/production, learn with the failures, provoke some, update your fixtures and, more importantly, update your mental model of the creature! You can speed that whole process up, considerably, with the approach I'm endorsing. It isn't, however, designed nor claimed to be a silver bullet. There's no realistic scenario, under sufficiently rich domains and operations, whereby casting the human out of the loop (at least not for now) won't result in catastrophes. Recent events have made that pretty clear, if you ask me. All swings back to the mental model ownership - and solving it for SWE agents is the next big thing, probably. BUT, of course, having the model produce human-checkable evidence (code you can run yourself, maybe something in the lines of zero-knowledge-proofs in the future, who knows?) is still one of my go-tos. That and the adversariation you (correctly) attack in your final lines.
So, my probably disappointing answer differs not so much from what we used to do when obliged to think just with our own brains: build, test in controlled environments, learn, improve, test, expose your system to reality, learn with the mistakes, improve, test... The deterministically tragedy of reality is our best ally, as always. Human effort is required, at least up until you have converged in that previously-uncharted front. (Well, future coupling and indirect side-effects are always a likely source of variability, and this is something I more-or-less discuss in dev.to/devanomaly/software-archite....)
Again: I cannot stress the importance of mental model ownership enough here, so please go see about that first piece and we can talk about it if you'd like!

"'This is wrong if an upstream caller already restricts the path' is written by the same miscalibrated model as the finding, so the refutation condition can be hallucinated too -- and a wrong one is worse than none, because it sends the reader to a place that checks out and sells false confidence in the finding."

Again, this is dead on. Callibrating the current-working-mental-model of your agent (ephemeral as it is, per-session, even with all the harness you can afford) is non negotiable, and you're the one to do it - at least until the relative convergence on your working-front is reached, like I mentioned above. That's it.

"The reader inherits the attack, but also whatever the attack itself got wrong. Who refutes the refutation?"

I think you expect me to talk about the human in charge, again. Yes, that's required. But I also figure this is an opportunity to think about how this problem you raise (reasonably, once more) turns out to be somewhat equivalent to the very challenge of making AI capable of original research, even within restricted bounds, autonomously.

Pon • Jun 30

Ha, you're right, I clicked reply on the wrong post in the series. I'm new to dev.to and read the 1/2/3 index like HTML anchors. Went back and read "The Mental Model Problem" properly, which was the half my comment was missing, so thanks for the nudge.

One thing I want to push back on, gently: I think your falsifiability clause does more work than your reply gives it credit for. My worry was that a planted-bug corpus can only hold defects someone already imagined, so the silent false negative never gets in. Probing staging and updating fixtures is right, but it still waits on a human noticing the failure first.

The clause routes around that. "This is wrong if an upstream caller already restricts the path" is a claim about the running system, not about a defect anyone planted. Production can disconfirm it without anyone imagining it first. So the corpus doesn't have to grow toward the unknown bug, the system's behavior does the disconfirming, and you read the clause and go look. That's a smaller ask than anticipating the defect.

Where I'm still stuck is your stuck-point too: that clause comes from the same miscalibrated model, and a wrong refutation condition is worse than none, since it sends you somewhere that checks out and you leave more sure. The only thing that's helped in my own tools is making the clause cheap to run, so reality answers it instead of me trusting the model's wording. Which is just your human-checkable evidence, one floor down.

Rafael Costa • Jul 6 • Edited

Hey Pon! Sorry for the wait, but I'm here.
You're right, and I want to be exact about how right, because this is juicy. A planted-bug corpus only holds what someone already imagined. The silent false negative never gets in, by construction.
So here's the turn: that's an argument for coarser grains, not finer! Watch a handful of critical paths + boring cluster metrics and you catch deviations you never named. Finer instrumentation just drags you back into "were we smart enough?", that is -> more surface, less coverage per probe.
But does coarse-graining close the gap? Well, as you know, it just relocates it. You buy reach by spending resolution, and a fault that moves no watched signal walks straight through untouched. That residue is the real unknown-unknown (the one only time surfaces, whether we like that or not).
This is where the non-teleological crowd have their cake and prevent us from eating ours, whilst somehow making everyone happier: Flyvbjerg's reference-class forecasting is coarse-graining for the tails -> you estimate the class, never the instance. Taleb's whole move is to quit predicting the Black Swan and manage the exposure instead.
And it lands us on your square from two sides: (i)"Make the clause cheap so reality answers it" and (ii)"coarse metrics catch what I didn't pre-name" are the same reflex, namely, you shove adjudication onto reality instead of onto anyone's prior imagination (yours or the model's, of course).
Which is the sharp part, by the way, and you nailed it: a clause from the same miscalibrated model is worse than none. (Being wrongly reassured feels identical to being fine. That's the entire horror of it, and means imminent death hahaha.) Controlled envs as anchors, reality as the reference, always.
Great thread. Let's have more of those!
Cheers.

Pon • Jul 6

"Being wrongly reassured feels identical to being fine" is the sentence I'm taking with me. And the Taleb turn resolves something I'd been circling: if the residue only shows up with time, the honest move is sizing exposure rather than sharpening the prediction. Same reflex from both sides, push the verdict onto reality and keep the check cheap. Looking forward to the next one.