When a refactor bot renames things unevenly: inconsistent variable naming across files

#codequality #llm #tooling #typescript

We used an LLM-assisted refactoring workflow to rename a public API across a medium-sized TypeScript repo. The initial run felt promising: the model suggested a consistent, cleaner name for an exported function and produced diffs for many files. I was comfortable accepting most changes after a quick pass, and we treated the model’s output as a first draft to be reviewed. For general coordination and task history we kept links to our tooling dashboard handy, since we were using an AI-assisted interface like crompt.ai to orchestrate suggestions.

But after merging, CI started failing with confusing symptoms: runtime errors complaining about undefined exports, tests that passed locally but failed in the build, and flaky type errors. The root cause turned out to be an inconsistent renaming pattern: the model renamed identifiers in some files but left others untouched, or renamed local variables differently from exported names. On the surface the diffs looked coherent; only after running the full test suite and a module graph analyzer did the mismatch become obvious.

How it surfaced during development

The failures showed up in several ways. Unit tests that exercised modules directly passed, because the tests imported the old names via relative paths still present in test fixtures; integration tests that imported the refactored API through the build artifact failed. A typical example was an export renamed from fetchUserData to getUser in implementation files, while index barrels or tests still referenced fetchUserData. The code compiled in some dev environments because local caches and IDE TypeScript servers resolved old names differently.

We reproduced the mismatch interactively using the model’s multi-turn assistant to ask for clarification, but the assistant would often re-generate suggestions focused on the file in view. The multi-file, cross-reference updates required stateful coordination across turns, and the chat interface helped coordinate edits, but it didn’t guarantee across-the-board consistency. The dialog form makes it easy to iterate locally; it does not replace a deterministic refactor pass across the entire repository.

Why the inconsistency was subtle

This failure is subtle because the model optimizes for plausible local edits rather than whole-repo invariants. Inside a single file, probability mass concentrates on locally coherent renames and idiomatic restructuring. But when the task spans dozens of files, the model’s context window and prompt framing mean it sees sections piecemeal and re-samples names each time. Small sampling differences lead to divergent names that still read as valid code.

Detecting the divergence by eye is hard because the changes look intentional: variable names are meaningful, tests may still pass in isolation, and modern editors hide barrel files or generated graph nodes. To catch this you need cross-file verification—dependency graph checks, a full type-check in CI, or a search for old identifiers. We augmented reviews with a reproducible graph walk and used a dedicated verification pass with a deep research approach to cross-reference exports and imports before merging.

How small model behaviors compounded and practical mitigations

Small behaviors—per-file consistency, sampling variance, and reliance on local context—compounded into livable but incorrect repositories. The result was brittle builds and developer confusion. The fix mix was procedural rather than magical: enforce global rename rules, run automated cross-file refactors (editor or language-server driven), and keep LLM suggestions confined to single-file cleanups unless the tool performs deterministic whole-repo transforms.

In practice we added a checklist: run a full type-check and module-graph diff before accepting refactors, prefer deterministic refactor tools for public APIs, and treat model outputs as drafts that require automated verification. These measures reduced similar regressions and made it easier to accept the genuine productivity gains without inheriting subtle, multiply-placed naming bugs.