Can Bölük's The Harness Problem showed hashline-style edits (line-number anchored, like 4#WB) outperforming traditional replace-mode edits (old_string/new_string matching) for coding agents.
I've been experimenting with building my own harness (tau), and wanted to verify this result and see if I should consider using hashline as the default edit strategy there.
So I built edit-bench to test this myself across multiple languages and models.
Setup
edit-bench generates mutation-based tests from existing codebases.
You point a script at a directory, and it generates mutations like deleting a statement, flipping a boolean, swapping args, etc.
- Languages: Python (from hive), TypeScript (from oh-my-pi), Rust (from irradiate)
-
Models:
gpt-4.1-mini,google/gemini-3-flash-preview,qwen/qwen3.5-397b-a17b - Edit modes: replace (old_string/new_string) vs hashline (line-number anchored)
- 20 tasks per language, single-attempt oneshot runs
- I also recently added fuzzy matching to
tau(trim cascade:trim_end → trim_both → unicode normalization) and wanted to see if this helps
Results
Replace mode:
| Model | Python | TypeScript | Rust |
|---|---|---|---|
| gemini-3-flash | 95% | 80% | 95% |
| qwen3.5-397b | 90% | 85% | 85% |
| gpt-4.1-mini | 65% | 75% | 45% |
Hashline mode (from earlier runs):
| Model | Python | TypeScript | Rust |
|---|---|---|---|
| gemini-3-flash | 70% | 85% | 90% |
| qwen3.5-397b | 85% | 85% | 90% |
| gpt-4.1-mini | 50% | 70% | 55% |
Hashline hurts Python noticeably, and seems roughly neutral on TypeScript and Rust.
The language-dependence is interesting — Python's significant whitespace might make line-anchored edits more error-prone.
Does Fuzzy Matching Help?
Apparently not.
I added trace collection to see if tau's fuzzy trim cascade ever fires during replace-mode runs. Across 114 successful edits and 20 failed edits (3 models × 3 languages), fuzzy matching triggered zero times.
Of the 20 failed edits:
- 1 had trailing whitespace (theoretically fixable)
- ~8 included line numbers in
old_string(model bug) - ~11 had completely hallucinated content
When models get old_string right, they get whitespace right too.
When they get it wrong, they get it very wrong — trim cascading doesn't help.
Takeaways
Hashline vs replace is not a clear winner either way. The effect is language-dependent and model-dependent. Python penalizes hashline; TypeScript is neutral; Rust is a toss-up.
Can's results are hard to generalize. The react-edit-benchmark is JavaScript-only and uses an LSP for validation feedback. Our setup (no LSP, multiple languages) shows a different picture. The LSP feedback loop in particular likely confounds. Giving the model type errors to retry against is a meaningful boost that interacts with edit format.
Fuzzy matching is a non-problem for current models. LLMs either reproduce source text exactly or hallucinate something completely different. The whitespace near-miss case that fuzzy matching targets basically doesn't happen in practice.
For current-gen models in contemporary harnesses, edit format is not the bottleneck. The gap between models (gemini-3-flash at 90%+ vs gpt-4.1-mini at 55-65%) dwarfs the gap between edit formats. Invest in model selection and prompt engineering before worrying about edit format.
Obligatory disclaimer: small n, not statistically rigorous, treat accordingly.
All data: nwyin/edit-bench, issues #13 and #14.
Top comments (0)