In SRE, an error budget is the acceptable amount of downtime before you stop shipping features and fix reliability. It turns "how reliable should we be?" from a vague aspiration into a concrete number.
You can apply the same idea to AI-generated code.
The Problem
Most developers have a binary relationship with AI output: either they trust it completely (ship without review) or they don't trust it at all (rewrite everything manually). Neither is productive.
What you need is a threshold — a clear line that says "this much failure is acceptable, and beyond this we change our process."
The Error Budget for Prompts
Define three numbers for any AI-assisted workflow:
## Error Budget: Code Generation
- **Acceptable failure rate:** 20% of generated functions need manual fixes
- **Warning threshold:** 30% need fixes → add more constraints to prompt
- **Red line:** 50% need fixes → stop using AI for this task, write manually
Tracking period: 1 week (rolling)
Then actually track it.
How I Track It
Simple CSV. One row per AI-generated code block I review:
date,task,model,accepted_as_is,minor_fix,major_rewrite,rejected
2026-03-28,auth middleware,gpt-4o,1,0,0,0
2026-03-28,db migration,gpt-4o,0,1,0,0
2026-03-29,api endpoint,claude-3.5,1,0,0,0
2026-03-29,regex parser,gpt-4o,0,0,1,0
2026-03-29,unit tests,claude-3.5,0,1,0,0
Weekly summary script:
#!/bin/bash
# error-budget.sh — Weekly AI code quality summary
FILE="${1:-ai-tracking.csv}"
TOTAL=$(tail -n +2 "$FILE" | wc -l)
ACCEPTED=$(awk -F',' '{sum+=$4} END {print sum}' <(tail -n +2 "$FILE"))
MINOR=$(awk -F',' '{sum+=$5} END {print sum}' <(tail -n +2 "$FILE"))
MAJOR=$(awk -F',' '{sum+=$6} END {print sum}' <(tail -n +2 "$FILE"))
REJECTED=$(awk -F',' '{sum+=$7} END {print sum}' <(tail -n +2 "$FILE"))
FAIL_RATE=$(echo "scale=1; ($MAJOR + $REJECTED) * 100 / $TOTAL" | bc)
echo "=== AI Error Budget Report ==="
echo "Total generations: $TOTAL"
echo "Accepted as-is: $ACCEPTED ($(echo "scale=1; $ACCEPTED*100/$TOTAL" | bc)%)"
echo "Minor fixes: $MINOR ($(echo "scale=1; $MINOR*100/$TOTAL" | bc)%)"
echo "Major rewrites: $MAJOR ($(echo "scale=1; $MAJOR*100/$TOTAL" | bc)%)"
echo "Rejected: $REJECTED ($(echo "scale=1; $REJECTED*100/$TOTAL" | bc)%)"
echo ""
echo "Failure rate: ${FAIL_RATE}% (budget: 20%)"
if (( $(echo "$FAIL_RATE > 50" | bc -l) )); then
echo "🔴 RED LINE — Stop using AI for these tasks"
elif (( $(echo "$FAIL_RATE > 30" | bc -l) )); then
echo "🟡 WARNING — Tighten prompt constraints"
else
echo "🟢 WITHIN BUDGET"
fi
What the Data Tells You
After two weeks of tracking, patterns emerge:
Per-task insights:
- Unit tests: 90% accepted as-is → great AI use case
- Regex: 60% rejected → write these manually
- API endpoints: 75% need minor fixes → add response type constraints to prompt
Per-model insights:
- Model A: better at boilerplate, worse at business logic
- Model B: better at tests, worse at frontend
Per-prompt insights:
- Prompts with explicit type signatures: 85% acceptance
- Prompts without examples: 40% acceptance
Adjusting the Budget
Your error budget should tighten over time as you improve your prompts:
Month 1: 30% failure rate acceptable (you're learning)
Month 3: 20% (prompts are tuned)
Month 6: 10% (if you're not here, something is wrong)
If you consistently hit your budget, raise the bar. If you consistently miss it, either improve the prompt or accept that AI isn't the right tool for that specific task.
The Key Insight
An error budget forces you to be honest about AI reliability in your specific context. Not what the benchmarks say. Not what Twitter says. What your data says about your prompts for your codebase.
That honesty is worth more than any prompt engineering trick.
Start tracking today. One CSV file, one weekly bash script, one honest look at the numbers.
Top comments (0)