AI coding tools are not reliable out of the box. Anyone who has used them on a real production project knows this. They work great on greenfield code. They fall apart on a codebase with history.
I learned this the hard way.
The Problem
We have a fintech project that has been running for nearly three years. Two React frontends, an admin panel and a client-facing app, sitting on top of a FastAPI backend with a complex relational database. Multiple tables, heavy joins, financial and personal user data that has to be handled carefully.
Adding new features should be simple. Most tasks are straightforward: new frontend page, new API route, new integration with an external financial data provider. But it was taking longer than it should. The database structure was only fully understood by one person. Every new query, every new table join, needed that person in the loop.
We decided to bring AI into the workflow to move faster. That is when things got interesting.
AI Just Completely Ignored Our Users Table
I asked the AI to create a contacts table. It did. And it created it with columns for first name, last name, email, and phone number. All fields we already had in our users table. Instead of linking the two tables with a foreign key, it just dumped everything into a new table and duplicated the data. I had to go in and fix the schema manually.
This was not the AI being stupid. It had no idea our users table existed. It made a reasonable decision with incomplete information, and that is exactly the kind of mistake that is hard to catch until it is already in your schema.
That incident changed how I thought about the problem. Instead of asking "how do I get AI to write better code?" I started asking: "What does the AI need to know to make good decisions?"
The Fix: Make the Codebase Legible to AI
I took inspiration from Matt Pocock's work on structured AI workflows and adapted it for our team. The core idea is simple: AI is only as good as the context you give it. So we made the context explicit and authoritative.
Here's what we set up:
1. Architectural Decision Records (ADRs)
We created a docs/adrs/ directory with ADR files. Documents that record why we made specific architectural decisions, not just what they are. Each ADR answers a specific question the AI might face:
- How do we create a new table?
- How do we link tables together?
- How do we structure a new API route?
- Where does each type of file live in the codebase?
The contacts table mistake became ADR-001: When creating a new table, check existing tables first. If relevant fields already exist, use a foreign key. Never duplicate user data. Always ask before creating new columns that could belong in an existing table.
Now when the AI encounters a new table task, it reads that rule first.
2. A Context File and Glossary
Our codebase had terms that meant specific things to us, words that an AI trained on general code would misinterpret. We wrote a context.md that explained what each term means in our codebase specifically, how different terms relate to each other, and which concepts that sound similar are actually different in our system.
We also wrote a plot.md, a high-level map of what the project is, what it does, and how the pieces connect.
Both files have one rule at the top: the docs directory is authoritative. These rules are not suggestions. Follow them in order. Do not skip steps.
3. Test Cases for Every API, No Exceptions
Every new API route now ships with test cases. Not optional. Not "when we have time."
This turned out to matter more than I expected, not just for code quality but for keeping AI reliable over long sessions.
Here is what happened once: the AI made a small change to a shared utility function. The kind of change that looks harmless in isolation. But that function was used in twelve places and the change broke eight of them. The test suite caught it immediately. The AI saw the failures, traced them back to the shared function, and instead of just reverting, it created a new version of the function that handled both the old behavior and the new requirement. It fixed its own mistake without me touching it.
Without those tests, that bug would have shipped.
What Changed for the Team
When I showed this to my team, a few people were skeptical. AI tools had let them down before. They had seen hallucinations, broken code, confident wrong answers.
After the session, three teammates asked me to set up their projects the same way.
The shift was not about trusting AI more. It was about understanding that AI needs structure to be reliable, the same way a new developer needs onboarding, documentation, and code review. You do not blame a new hire for not knowing your codebase. You document it.
We did the same for the AI.
The Setup, If You Want to Try It
docs/
context.md # What this project is, what terms mean, how pieces connect
plot.md # High-level map of the codebase
adr/
001-table-creation.md
002-api-structure.md
003-query-patterns.md
...
A few things that matter:
- Be specific in your ADRs. "Check existing tables before creating new ones" beats "follow good database practices."
- Make the docs authoritative. Tell the AI these rules come first, always.
- Add a new ADR every time the AI makes a mistake. Turn failures into rules.
- Write test cases for every new API. Run them before every commit. When the AI breaks something, add a test for that case so it can never happen silently again.
One Last Thing
This system does not make AI perfect. It makes AI predictable. There is a difference.
The AI still makes mistakes. But now the mistakes are smaller, easier to catch, and when they happen, they teach us something we can encode into the next ADR.
The goal was never a codebase where AI does everything right. It was a codebase where AI does things consistently, and where the team can move faster because of it.
That is what we built.
I am a fullstack developer and tech lead working on backend systems in Python, Go, and Node.js.
Top comments (1)
This is the part most teams underestimate: AI doesn’t hallucinate randomly — it hallucinates when the system has no authoritative source of truth.
What you effectively built here is a lightweight context + constraint layer (ADRs, glossary, tests) that turns the codebase from “implicit knowledge” into “retrievable rules”. That’s what actually reduces wrong assumptions like duplicated tables or unsafe schema changes.
The key shift is important: you didn’t make AI smarter — you made it less allowed to guess.