I spent three months building "ReviewBot," an automated code reviewer for our internal monorepo. The goal was simple. Catch bugs before they hit production. Reduce the cognitive load on senior engineers during pull requests.
It sounded great on paper. In practice, it was a disaster for the first six weeks.
We are in 2026. Large Language Models are cheap and fast. You might think integrating one into a CI/CD pipeline is trivial. It is not. The complexity shifts from model selection to context management and latency handling.
Here are the five specific mistakes I made. I hope you can skip the pain I went through.
Mistake 1: Ignoring Context Window Costs
My first version sent the entire file history to the model. I thought more context meant better accuracy. I was wrong. It just meant higher bills and slower responses.
We use a mid-tier proprietary model that charges per token. Our average pull request involves touching 15 files. Each file has about 300 lines of code. Sending all that data added up quickly.
By week two, our API bill jumped from $50 to $420. That is unsustainable for an internal tool.
I realized we only needed the diff and the immediate surrounding functions. The rest is noise. The model does not need to know the contents of utils/dateFormatter.js when you are changing components/LoginButton.tsx.
I switched to a tree-sitter based parser. It extracts only the relevant abstract syntax tree nodes. This reduced our token usage by 85%.
| Metric | V1 (Full File) | V2 (AST Extract) |
|---|---|---|
| Avg Tokens per PR | 12,500 | 1,800 |
| Cost per PR | $0.04 | $0.005 |
| Latency (p95) | 4.2s | 1.1s |
The cost drop was immediate. The speed improvement was even better. Developers stopped complaining about waiting for checks to pass.
Mistake 2: Treating LLM Output as Deterministic
I wrote my initial assertion tests assuming the AI would always return valid JSON. I asked it to output a structured report with severity levels and suggested fixes.
For the first 50 runs, it worked perfectly. Then, on a Tuesday morning, it broke the entire pipeline.
The model decided to be chatty. It added a preamble like "Here is your code review:" before the JSON block. My parser crashed. The CI check failed. Developers were blocked from merging critical hotfixes.
I had to manually restart the pipeline for three different teams. It was embarrassing.
LLMs are probabilistic. They do not guarantee format consistency unless you force it. I stopped relying on raw text parsing. Instead, I implemented a retry mechanism with strict schema validation using Zod.
import { z } from 'zod';
const ReviewSchema = z.object({
issues: z.array(
z.object({
line: z.number(),
severity: z.enum(['low', 'medium', 'high']),
message: z.string(),
suggestion: z.string().optional(),
})
),
summary: z.string(),
});
async function parseReview(rawOutput: string) {
try {
// Strip markdown code blocks if present
const cleanJson = rawOutput.replace(/```
{% endraw %}
json\n?|\n?
{% raw %}
```/g, '');
const parsed = JSON.parse(cleanJson);
return ReviewSchema.parse(parsed);
} catch (error) {
console.error('Failed to parse AI output', error);
throw new Error('Invalid review format');
}
}
This change did not fix the model's tendency to ramble. It just handled the failure gracefully. We now log the bad output for fine-tuning later. The pipeline keeps moving.
Mistake 3: Over-Engineering the Agent Loop
I got excited about agentic workflows. I built a system where the bot could "think" step-by-step. It would analyze the code, search documentation, then propose a fix.
This added 15 seconds to every comment. In a fast-moving team, 15 seconds feels like an hour.
Developers do not want a philosopher. They want a linter with opinions. Most of the time, the issues are simple. Unused variables. Type mismatches. Missing error handling.
These do not require a multi-step reasoning loop. They require pattern matching.
I stripped out the agent logic for 90% of cases. Now, the system uses a lightweight model for basic syntax and style checks. It only triggers the heavy reasoning model if it detects complex logic changes or security vulnerabilities.
This hybrid approach cut our average response time from 18 seconds to 3 seconds. The quality of feedback did not drop. In fact, it improved because the bot was less likely to hallucinate complex solutions for simple problems.
Mistake 4: Neglecting User Feedback Loops
I assumed developers would love the tool. I did not build a way for them to tell me when it was wrong.
For the first month, I had no idea if the suggestions were useful. I only saw usage metrics. I saw that people were reading the comments. I did not know if they were acting on them.
Then I noticed a pattern. Senior engineers were ignoring 80% of the high-severity warnings. Why? Because the bot was flagging intentional architectural decisions as errors.
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
Top comments (0)