Hashline vs Replace: Does the Edit Format Matter?

#agents #llm #testing #tooling

Can Bölük's The Harness Problem showed hashline-style edits (line-number anchored, like 4#WB) outperforming traditional replace-mode edits (old_string/new_string matching) for coding agents.
I've been experimenting with building my own harness (tau), and wanted to verify this result and see if I should consider using hashline as the default edit strategy there.
So I built edit-bench to test this myself across multiple languages and models.

Setup

edit-bench generates mutation-based tests from existing codebases.
You point a script at a directory, and it generates mutations like deleting a statement, flipping a boolean, swapping args, etc.

Languages: Python (from hive), TypeScript (from oh-my-pi), Rust (from irradiate)
Models: gpt-4.1-mini, google/gemini-3-flash-preview, qwen/qwen3.5-397b-a17b
Edit modes: replace (old_string/new_string) vs hashline (line-number anchored)
20 tasks per language, single-attempt oneshot runs
I also recently added fuzzy matching to tau (trim cascade: trim_end → trim_both → unicode normalization) and wanted to see if this helps

Results

Replace mode:

Model	Python	TypeScript	Rust
gemini-3-flash	95%	80%	95%
qwen3.5-397b	90%	85%	85%
gpt-4.1-mini	65%	75%	45%

Hashline mode (from earlier runs):

Model	Python	TypeScript	Rust
gemini-3-flash	70%	85%	90%
qwen3.5-397b	85%	85%	90%
gpt-4.1-mini	50%	70%	55%

Hashline hurts Python noticeably, and seems roughly neutral on TypeScript and Rust.
The language-dependence is interesting — Python's significant whitespace might make line-anchored edits more error-prone.

Does Fuzzy Matching Help?

Apparently not.

I added trace collection to see if tau's fuzzy trim cascade ever fires during replace-mode runs. Across 114 successful edits and 20 failed edits (3 models × 3 languages), fuzzy matching triggered zero times.

Of the 20 failed edits:

1 had trailing whitespace (theoretically fixable)
~8 included line numbers in old_string (model bug)
~11 had completely hallucinated content

When models get old_string right, they get whitespace right too.
When they get it wrong, they get it very wrong — trim cascading doesn't help.

(Trace analysis details)

Takeaways

Hashline vs replace is not a clear winner either way. The effect is language-dependent and model-dependent. Python penalizes hashline; TypeScript is neutral; Rust is a toss-up.
Can's results are hard to generalize. The react-edit-benchmark is JavaScript-only and uses an LSP for validation feedback. Our setup (no LSP, multiple languages) shows a different picture. The LSP feedback loop in particular likely confounds. Giving the model type errors to retry against is a meaningful boost that interacts with edit format.
Fuzzy matching is a non-problem for current models. LLMs either reproduce source text exactly or hallucinate something completely different. The whitespace near-miss case that fuzzy matching targets basically doesn't happen in practice.
For current-gen models in contemporary harnesses, edit format is not the bottleneck. The gap between models (gemini-3-flash at 90%+ vs gpt-4.1-mini at 55-65%) dwarfs the gap between edit formats. Invest in model selection and prompt engineering before worrying about edit format.

Obligatory disclaimer: small n, not statistically rigorous, treat accordingly.

All data: nwyin/edit-bench, issues #13 and #14.