I spent three months building "ReviewBot," an autonomous agent that reviews pull requests.
My goal was simple. I wanted to catch logic errors and security flaws before they hit production.
I thought using the latest LLMs from early 2026 would make this easy.
I was wrong.
The project failed to launch publicly. It currently sits as an internal tool for my team of four developers.
Here is exactly where I went wrong. I hope these lessons save you time and money.
Mistake 1: Ignoring Context Window Costs
In January 2026, context windows are massive. Most models handle 1 million tokens easily.
I assumed this meant I could dump entire repositories into the prompt.
Why not? More context means better answers, right?
Wrong.
I built a naive indexer that sent every file in our src/ directory to the model for each PR.
Our average repository size is 45,000 lines of code.
Each review cost me $4.20 in API fees.
We process about 20 PRs a day. That is $84 daily. Or $2,520 a month.
For a side project with zero revenue, this was unsustainable.
I looked at the logs in February. 90% of the tokens sent were irrelevant CSS files or vendor libraries.
The model got confused by the noise. It started hallucinating imports that didn't exist because it was trying to connect unrelated dots.
The Fix
I switched to a hybrid retrieval approach.
I only send files that changed in the PR, plus their direct dependencies.
I use a local vector store to find related files if the change touches a core module.
This dropped my average token usage per request from 150k to 12k.
Costs fell to $0.35 per review.
Mistake 2: Trusting the Model's Self-Correction
Early in development, I noticed the agent would sometimes suggest broken code.
It would miss a semicolon or use a deprecated API method.
I thought, "No problem. I'll just ask it to check its own work."
I added a second step to the pipeline: Self-Reflection.
The prompt looked like this:
def review_code(pr_diff):
initial_review = llm.generate(f"Review this code: {pr_diff}")
# The mistake: assuming the LLM catches its own errors
correction = llm.generate(
f"Check your previous review for errors. "
f"Did you miss any edge cases? "
f"Original review: {initial_review}"
)
return correction
This doubled my latency. Each review took 45 seconds instead of 20.
Worse, it didn't fix the accuracy issues.
The model often doubled down on its mistakes. If it missed a null pointer exception initially, it usually missed it in the reflection step too.
LLMs are bad at verifying their own logical output without external tools.
By March, I realized I was paying for extra tokens to get the same wrong answer slower.
The Fix
I replaced self-reflection with static analysis tools.
Now, the agent runs eslint, pylint, and SonarQube locally first.
It only sends the output of those tools to the LLM, along with the code.
The prompt changed to: "Explain these linting errors and suggest fixes."
This shifted the burden from probabilistic guessing to deterministic checking.
Accuracy went up by 40%. Latency dropped by half.
Mistake 3: Over-Engineering the Agent Loop
I read too many blog posts about "Agentic Workflows."
I wanted ReviewBot to be smart. I wanted it to browse documentation, search Stack Overflow, and run tests.
I built a complex state machine using LangGraph.
It had five nodes:
- Parse Diff
- Search Docs
- Generate Review
- Run Tests
- Finalize
It looked impressive on my architecture diagram.
In practice, it was fragile.
If the documentation search failed (which happened often due to rate limits), the whole graph crashed.
If the test runner timed out, the agent hung indefinitely.
I spent weeks debugging race conditions in my own orchestration layer.
Users don't care about your agent's internal state. They care about getting a comment on their PR within 2 minutes.
My complex agent averaged 3.5 minutes per review.
Developers started ignoring the bot because it was too slow.
The Fix
I deleted 80% of the code.
I moved to a linear, synchronous pipeline.
- Get diff.
- Run linters.
- Send to LLM with strict system instructions.
- Post comment.
No loops. No external searches. No dynamic tool calling.
Itβs boring. Itβs dumb. But it works every time.
Speed is a feature. Reliability is a feature. Complexity is a bug.
Mistake 4: Poor Feedback Loops
I launched the beta in April.
I asked my team to use it. They did.
But I didn't track what they did with the suggestions.
Did they accept the advice? Did they reject it? Did they edit the comment?
I assumed that if the bot posted a comment, it was helpful.
In May, I finally added analytics.
The
π‘ Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
Top comments (0)