I tried using AI on a 15k+ line codebase. It failed badly. Wrong changes, broken logic, random imports — classic. Then I changed how I used it, and it started saving me weeks of work.
The Problem
AI doesn’t understand large codebases. You can’t just paste a repo and say:
“refactor this”
It will:
miss dependencies
break existing flows
hallucinate logic
touch things you didn’t ask for
I learned this the hard way.
The Setup
This is not a small project.
Multiple screens, services, storage layers — not something AI can “just understand”.
What Actually Worked
- Stop dumping the whole codebase
Instead of:
“Here’s my project, fix X”
Do:
give only relevant files
explain relationships manually
AI is not context-aware at scale. You have to simulate context.
- Give strict instructions (like a junior dev)
Bad:
“Improve this”
Good:
“Modify this function only. Do not change API shape. Do not touch unrelated files.”
The difference is massive.
- One change at a time
Don’t do:
“Refactor this entire flow”
Do:
break into small steps
verify each change
then move forward
AI works best iteratively, not in one shot.
- Expect it to fail so always keep backup
It will:
generate wrong logic
miss edge cases
introduce bugs
Example:

This is where you review like a senior — not trust blindly.
The Real Insight
AI didn’t replace my work. It compressed months into days, but only because I guided it properly. If you treat AI like magic, it breaks. If you treat it like a junior dev, it becomes powerful.
Final Thought
AI is not bad at large codebases. Most people are just bad at using it correctly. If you're working on bigger projects, stop prompting lazily. That’s the real bottleneck.

Top comments (4)
This resonates hard. I run a programmatic SEO site with 8,000+ stock ticker pages across 12 languages, and AI is central to my content generation pipeline. Your point about "simulating context" is the key insight most people miss — you can't just throw a massive codebase at an LLM and expect coherent output.
What I've found working at this scale is that the "one change at a time" principle extends beyond code into content too. When I use a local LLM to generate financial analysis for thousands of pages, I have to feed it very specific data per ticker — financials, sector context, peer comparisons — rather than asking it to "write analysis for all tech stocks." The hallucination problem you describe with code logic is even worse with financial data, where a wrong number can completely mislead readers.
Curious — have you experimented with building any kind of automated validation layer on top of AI outputs? That's been my biggest challenge: scaling the "review like a senior" step when you have thousands of outputs to verify.
Yeah, makes sense — in your case the risk is way higher since wrong data doesn’t break anything, it just silently becomes misinformation.
I haven’t built a fully automated validation layer either. What I’m doing right now is more of a hybrid approach.
I usually design the architecture, flow, and logic first, and then guide the AI to implement specific parts. That alone cuts down a lot of errors since the AI isn’t making system-level decisions on its own.
When something breaks, I try to trace the likely source and fix it step by step. And if it turns into a dead end or things get too messy, I just revert to a previous Git commit and restart that part clean.
For your case, one thing that might help is keeping data and generation strictly separate — like not letting the AI generate actual numbers at all , just explain them or fetch from a db or api . Also forcing more structured outputs instead of free-form text could make validation easier.
Even simple checks on important values or claims before publishing might go a long way. And instead of reviewing everything, focusing only on higher-risk outputs could make it manageable.
Full automation still feels risky right now. It’s more about reducing what needs manual review rather than removing it completely.
Curious how your pipeline is structured right now — is it mostly free-form generation or more controlled?
Definitely more controlled than free-form. The LLM gets a structured prompt per ticker — here's the company data, here are the financials, here's the sector context — and it generates analysis within a specific template. So the numbers themselves come from the database, the LLM just writes the narrative around them.
The tricky part is exactly what you said though — even with structured inputs, the LLM sometimes interpolates wrong conclusions from the data. Like it'll see a dividend yield of 0.0042 and write "41% dividend yield" because it misread the decimal. Those are the high-risk outputs I'm starting to filter for with simple regex checks before publishing.
Your suggestion about not letting the AI generate numbers at all is where I'm heading next — have it reference data by placeholder, then inject the real values post-generation. Appreciate the thoughtful breakdown.
Glad the discussion helped in some way. Good luck with the project — sounds like you're heading in the right direction.