DEV Community

nwyin
nwyin

Posted on • Originally published at nwyin.com

Hashline vs Replace: Does the Edit Format Matter?

Can Bölük's The Harness Problem showed hashline-style edits (line-number anchored, like 4#WB) outperforming traditional replace-mode edits (old_string/new_string matching) for coding agents.
I've been experimenting with building my own harness (tau), and wanted to verify this result and see if I should consider using hashline as the default edit strategy there.
So I built edit-bench to test this myself across multiple languages and models.

Setup

edit-bench generates mutation-based tests from existing codebases.
You point a script at a directory, and it generates mutations like deleting a statement, flipping a boolean, swapping args, etc.

  • Languages: Python (from hive), TypeScript (from oh-my-pi), Rust (from irradiate)
  • Models: gpt-4.1-mini, google/gemini-3-flash-preview, qwen/qwen3.5-397b-a17b
  • Edit modes: replace (old_string/new_string) vs hashline (line-number anchored)
  • 20 tasks per language, single-attempt oneshot runs
  • I also recently added fuzzy matching to tau (trim cascade: trim_end → trim_both → unicode normalization) and wanted to see if this helps

Results

Replace mode:

Model Python TypeScript Rust
gemini-3-flash 95% 80% 95%
qwen3.5-397b 90% 85% 85%
gpt-4.1-mini 65% 75% 45%

Hashline mode (from earlier runs):

Model Python TypeScript Rust
gemini-3-flash 70% 85% 90%
qwen3.5-397b 85% 85% 90%
gpt-4.1-mini 50% 70% 55%

Hashline hurts Python noticeably, and seems roughly neutral on TypeScript and Rust.
The language-dependence is interesting — Python's significant whitespace might make line-anchored edits more error-prone.

Does Fuzzy Matching Help?

Apparently not.

I added trace collection to see if tau's fuzzy trim cascade ever fires during replace-mode runs. Across 114 successful edits and 20 failed edits (3 models × 3 languages), fuzzy matching triggered zero times.

Of the 20 failed edits:

  • 1 had trailing whitespace (theoretically fixable)
  • ~8 included line numbers in old_string (model bug)
  • ~11 had completely hallucinated content

When models get old_string right, they get whitespace right too.
When they get it wrong, they get it very wrong — trim cascading doesn't help.

(Trace analysis details)

Takeaways

  1. Hashline vs replace is not a clear winner either way. The effect is language-dependent and model-dependent. Python penalizes hashline; TypeScript is neutral; Rust is a toss-up.

  2. Can's results are hard to generalize. The react-edit-benchmark is JavaScript-only and uses an LSP for validation feedback. Our setup (no LSP, multiple languages) shows a different picture. The LSP feedback loop in particular likely confounds. Giving the model type errors to retry against is a meaningful boost that interacts with edit format.

  3. Fuzzy matching is a non-problem for current models. LLMs either reproduce source text exactly or hallucinate something completely different. The whitespace near-miss case that fuzzy matching targets basically doesn't happen in practice.

  4. For current-gen models in contemporary harnesses, edit format is not the bottleneck. The gap between models (gemini-3-flash at 90%+ vs gpt-4.1-mini at 55-65%) dwarfs the gap between edit formats. Invest in model selection and prompt engineering before worrying about edit format.

Obligatory disclaimer: small n, not statistically rigorous, treat accordingly.

All data: nwyin/edit-bench, issues #13 and #14.

Top comments (0)