A model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires. Left unconstrained, the extended reasoning gives models more room to 'improve' code that doesn't need improving.
GPT-5.4 averages 0.395 normalized Levenshtein distance per edit. Claude Opus 4.6 averages 0.060. That is 6.5x more output tokens for the same class of fix, averaged across the benchmark. Pass@1 correctness is similar (0.723–0.912 across models), so the over-editing is paid waste, not paid capability.
What does 6.5x look like on a bill? A 50-engineer org doing 800 agent edits per engineer per month = 40k edits/mo. At average 500 output tokens per minimal fix × $15/M Opus 4.7 output = $300/mo. At 3,250 output tokens per over-edited fix = $1,950/mo. Delta is $1,650/mo per 40k edits, pure output-token waste with no correctness upside. Scale to your actual traffic.
Why 'just use a smaller model' isn't the answer: reasoning models got worse (not better) at minimal editing when given more reasoning budget. So you can't fix over-editing by paying more; you fix it by measuring the ratio and routing around it.
The metric CFOs actually need is over-edit ratio per agent: over_edit_ratio = output_tokens / minimum_required_tokens_to_achieve_green_tests. Infrastructure to compute this: log full diff of every agent edit, run patch-min on the diff offline, diff size ratio = your over-edit score.
Instrument over-edit ratio this quarter, treat it as a first-class SLO per agent (budget for <0.2 average), and route high-stakes "minimal" tasks to models whose published over-edit score is <0.1.
Attribution is the prerequisite for every other cost signal you'll want this year. LLMeter ships per-customer + per-agent attribution today. Over-edit ratio is the first quality-flavored metric where LLMeter's attribution layer is the right home.
Top comments (0)