I’ve been running field tests against live open-source issues to study a specific question:
When AI coding agents make repairs, where do they lose the actual truth of the repo?
Not “does the AI write code?”
Not “can it pass a test?”
More specifically:
Can the agent identify the boundary that actually broke before it starts changing files?
So far, the strongest failure classes I’m seeing are things like:
- provider/config contract drift
- retrieval/source-context truth loss
- dev/build behavior divergence
- build artifact metadata losing source truth
- generated output being treated as stronger evidence than it really is
One example: in an AI app, search results came back successfully, but the retrieved truth never survived into usable model context because a response-shape boundary collapsed.
Another example: in a build tool, the same plugin hook filter behaved differently in dev and build.
These are not just “bugs.” They are places where one part of the system says one thing, and another part silently treats it differently.
That is the area I’m trying to classify.
I’m building Scarab Diagnostic Suite around this idea: a repo-local diagnostic layer that identifies truth boundaries before an AI coding agent starts repairing.
I’m posting the field tests as a numbered series.
The question I’d love to hear from other developers:
When you’ve used AI coding agents, where have you seen them lose repo truth?
Examples:
- changing code to satisfy a test instead of preserving the actual contract
- fixing the symptom but not the boundary
- introducing hidden state drift
- treating generated files as source truth
- touching too much surface area
- missing dev/build differences
- assuming config shape that is not guaranteed
- creating “green” results that did not actually measure anything
I’m especially interested in concrete failure shapes, not just general AI complaints.
Where does the agent go wrong?
Where should a diagnostic system force it to stop and prove the boundary first?
Top comments (0)