We were using a code-generation assistant to help with a large multi-file refactor: extract a service layer, rename a few domain concepts, and wire them into existing modules. The assistant was fast and useful for individual snippets, so we iteratively asked for edits per file using a chat style workflow. What we didn’t anticipate was how small, inconsistent naming choices made in each reply would accumulate into a brittle, failing codebase.
At first the symptoms were minor — linter warnings, a couple of TypeScript compile errors — so we treated the outputs as draft patches and applied them quickly. After several files were edited, runtime errors started appearing in integration tests and in staging. The issue wasn’t a single glaring hallucination or a deprecated API: it was inconsistent variable and function names produced differently across files (for example, currentUser vs activeUser vs userCtx), and those inconsistencies were subtle enough to slip past quick reviews.
How the inconsistency surfaced during development
On our CI the first failing step was a bundler that tolerates some naming drift, followed by a TypeScript build that flagged a handful of type errors. The build errors pointed at a few places but not the systemic mismatch — the assistant had introduced different identifiers in separate files because each prompt showed only local context. The immediate fixes were small (rename here, adjust an export there) but they kept reappearing in other modules until the tests revealed integration failures.
We used a verification pass with a code search and a few unit tests to catch the most obvious discrepancies. That saved us from deploying the biggest regressions, but it didn’t find semantic mismatches like two functions that did the same job but were named differently and left duplicated logic in the system. To better verify cross-file consistency we ran a deeper manual pass and used a simple mapping document (canonical names) to compare against generated patches; for cross-referencing we leaned on a lightweight deep research style check to validate API names and public contracts.
Why the problem was subtle and easy to miss
The assistant behaved reasonably on a single-file basis: suggestions matched local idioms and naming patterns in that file. The subtlety comes from expectations of global consistency — humans expect a canonical domain vocabulary to be applied everywhere. Models, when prompted piecemeal, prioritize local fluency and may introduce synonyms or stylistic variants that are harmless in isolation but harmful in aggregate.
Another factor was our workflow: small, fast commits and accept-apply cycles gave a false sense of progress. Each applied patch looked correct, so we moved on. The model’s probabilistic token selection occasionally favored an alternative name, and that small decision compounded across multiple files into broken imports, duplicative logic, and an increased maintenance surface.
Mitigations and practical takeaways
The most effective mitigations were procedural and tooling-based. First, define an explicit naming map and include it in every prompt so the model has a ground truth for identifiers. Second, prefer single-pass generation for related files when possible, providing the complete minimal set of files in one context. Third, enforce cross-file checks: run automated grep-based consistency checks, strong typing, and lint rules tuned to the canonical names.
Finally, treat assistant outputs as drafts that require verification: automated tests that assert behavior plus lightweight semantic checks reduce risk. For teams relying on iterative assistance, documenting a canonical vocabulary and integrating small consistency checks into CI turns a subtle model quirk into a manageable process problem rather than a recurring source of regressions — something we learned the hard way while refactoring across many files with the help of crompt.ai.
Top comments (0)