Error Budget Prompting: Set Failure Thresholds for AI-Generated Code

#ai #programming #productivity #devops

In SRE, an error budget is the acceptable amount of downtime before you stop shipping features and fix reliability. It turns "how reliable should we be?" from a vague aspiration into a concrete number.

You can apply the same idea to AI-generated code.

The Problem

Most developers have a binary relationship with AI output: either they trust it completely (ship without review) or they don't trust it at all (rewrite everything manually). Neither is productive.

What you need is a threshold — a clear line that says "this much failure is acceptable, and beyond this we change our process."

The Error Budget for Prompts

Define three numbers for any AI-assisted workflow:

## Error Budget: Code Generation

- **Acceptable failure rate:** 20% of generated functions need manual fixes
- **Warning threshold:** 30% need fixes → add more constraints to prompt
- **Red line:** 50% need fixes → stop using AI for this task, write manually

Tracking period: 1 week (rolling)

Then actually track it.

How I Track It

Simple CSV. One row per AI-generated code block I review:

date,task,model,accepted_as_is,minor_fix,major_rewrite,rejected
2026-03-28,auth middleware,gpt-4o,1,0,0,0
2026-03-28,db migration,gpt-4o,0,1,0,0
2026-03-29,api endpoint,claude-3.5,1,0,0,0
2026-03-29,regex parser,gpt-4o,0,0,1,0
2026-03-29,unit tests,claude-3.5,0,1,0,0

Weekly summary script:

#!/bin/bash
# error-budget.sh — Weekly AI code quality summary

FILE="${1:-ai-tracking.csv}"
TOTAL=$(tail -n +2 "$FILE" | wc -l)
ACCEPTED=$(awk -F',' '{sum+=$4} END {print sum}' <(tail -n +2 "$FILE"))
MINOR=$(awk -F',' '{sum+=$5} END {print sum}' <(tail -n +2 "$FILE"))
MAJOR=$(awk -F',' '{sum+=$6} END {print sum}' <(tail -n +2 "$FILE"))
REJECTED=$(awk -F',' '{sum+=$7} END {print sum}' <(tail -n +2 "$FILE"))

FAIL_RATE=$(echo "scale=1; ($MAJOR + $REJECTED) * 100 / $TOTAL" | bc)

echo "=== AI Error Budget Report ==="
echo "Total generations: $TOTAL"
echo "Accepted as-is:   $ACCEPTED ($(echo "scale=1; $ACCEPTED*100/$TOTAL" | bc)%)"
echo "Minor fixes:      $MINOR ($(echo "scale=1; $MINOR*100/$TOTAL" | bc)%)"
echo "Major rewrites:   $MAJOR ($(echo "scale=1; $MAJOR*100/$TOTAL" | bc)%)"
echo "Rejected:         $REJECTED ($(echo "scale=1; $REJECTED*100/$TOTAL" | bc)%)"
echo ""
echo "Failure rate: ${FAIL_RATE}% (budget: 20%)"

if (( $(echo "$FAIL_RATE > 50" | bc -l) )); then
  echo "🔴 RED LINE — Stop using AI for these tasks"
elif (( $(echo "$FAIL_RATE > 30" | bc -l) )); then
  echo "🟡 WARNING — Tighten prompt constraints"
else
  echo "🟢 WITHIN BUDGET"
fi

What the Data Tells You

After two weeks of tracking, patterns emerge:

Per-task insights:

Unit tests: 90% accepted as-is → great AI use case
Regex: 60% rejected → write these manually
API endpoints: 75% need minor fixes → add response type constraints to prompt

Per-model insights:

Model A: better at boilerplate, worse at business logic
Model B: better at tests, worse at frontend

Per-prompt insights:

Prompts with explicit type signatures: 85% acceptance
Prompts without examples: 40% acceptance

Adjusting the Budget

Your error budget should tighten over time as you improve your prompts:

Month 1: 30% failure rate acceptable (you're learning)
Month 3: 20% (prompts are tuned)
Month 6: 10% (if you're not here, something is wrong)

If you consistently hit your budget, raise the bar. If you consistently miss it, either improve the prompt or accept that AI isn't the right tool for that specific task.

The Key Insight

An error budget forces you to be honest about AI reliability in your specific context. Not what the benchmarks say. Not what Twitter says. What your data says about your prompts for your codebase.

That honesty is worth more than any prompt engineering trick.

Start tracking today. One CSV file, one weekly bash script, one honest look at the numbers.