All tests run on an 8-year-old MacBook Air.
I regularly feed 100+ files into LLMs as part of my development workflow. One day I thought: why not just dump everything in and let it find all the bugs?
Spoiler: it didn't work. Here's exactly what happened — and what actually does work.
The Setup
I tested both Claude Opus 4.7 and Gemini 3.5 Flash with thinking mode enabled on both. The task: find bugs across a multi-file Rust (Tauri v2) codebase. I suspect this happens regardless of language.
What Actually Happened
The models didn't crash or throw errors. They confidently returned answers. The problem was those answers were wrong — and if you don't know the codebase well enough, you'd never catch it.
This is the part nobody talks about: AI output is only as trustworthy as your ability to verify it. If you hand 100 files to an LLM and it says "no bugs found," how would you know it's lying?
Claude Opus 4.7:
- Up to ~5 files: accuracy was excellent. Genuinely impressive.
- 10+ files: started getting shaky. Missing things it should catch.
- 15 files (confirmed with src-tauri): hallucinations increased significantly.
Gemini 3.5 Flash:
- Single file up to ~300 lines: barely usable. Handles simple logic fine when the prompt is clear.
- Multiple files: fell apart quickly. Results vary significantly depending on prompt quality.
Both models, even with thinking mode on, returned "no bugs found" on code that definitely had bugs.
Why This Happens: Lost in the Middle
This is a known phenomenon called "Lost in the Middle". LLMs don't read context linearly like humans do. Information in the middle of a large context window gets significantly less attention than content at the beginning or end.
So when you dump 15 files into the context, the model "sees" all of it technically — but effectively ignores large chunks in the middle.
Important distinction: thinking mode increases reasoning depth, but it doesn't fix the fundamental information retrieval problem. A wider context window ≠ the model can actually process all of it accurately.
The Fix: One File at a Time
The solution I landed on from experience: feed files one at a time.
# Instead of this:
"Here are 100 files, find all bugs"
# Do this:
"Here is file X. Find bugs in this file only."
→ repeat for each file
→ aggregate results manually via copy-paste
More tedious? Yes. More accurate? Dramatically.
For Opus specifically, staying under 5 files per request keeps the quality high. For Gemini, one file at a time with under 300 lines is the safe zone.
Practical Takeaways
| Model | Sweet spot | Starts breaking down |
|---|---|---|
| Claude Opus 4.7 | 1–5 files | 10+ files, confirmed at 15 |
| Gemini 3.5 Flash | 1 file, <300 lines (prompt quality matters) | Multiple files |
- Don't trust "no bugs found" on large inputs — the model may have simply stopped paying attention
- Thinking mode helps with reasoning, not with reading everything accurately
- Context window size is marketing. Effective context is much smaller in practice.
- When in doubt, split the task smaller
Closing
Large context windows are impressive on paper. In practice, the effective range where you can trust the output is much smaller than advertised.
The fix isn't a better model — it's a better workflow. One file at a time, aggregated manually, beats dumping everything in and hoping for the best.
Progress updates on X: @hiyoyok
Top comments (0)