We were using a multi-turn code assistant to help with a cross-cutting rename: changing a frequently used utility function and its argument names across several packages. The assistant in the chat interface made polite, file-by-file suggestions and even produced tidy diffs for individual modules. That multi-file surface area is where the problem began.
The model renamed identifiers in some files but left older names in others, producing a mix of calculateTotal and compute_total across the codebase. At first glance the diffs looked reasonable: each change compiled and unit tests passed locally. The failure mode was subtle and systemic rather than a single, obvious error.
How it surfaced during development
The bug showed up during an integration run on CI when a rarely exercised code path failed with a reference error. Locally we had run unit tests that mocked the utility entry point, so those tests continued to pass. The failing trace referenced a function name that only existed in one of the microservices after the partial rename. We reproduced the problem by grepping the repo and found three different identifiers serving the same role.
Example of the pattern we saw in different files:
// file: service-a/util.js export function calculateTotal(items) { /* ... */ } // file: service-b/helpers.js export function compute_total(items) { /* ... */ }
Each file change by the assistant was internally consistent, and the assistant explained why it preferred a particular naming style in that file. The mismatch happened because suggestions were produced per-file with only local context and some token-level preference for different naming conventions.
Why it was subtle and easy to miss
There are a few reasons this slipped past us. First, the model operates on the visible context: when asked to refactor a file it optimizes the snippets it sees, not the whole repository. Second, sampling and prompt phrasing make it prefer different idioms across runs — camelCase in one suggestion, snake_case in another — which is noise humans can miss during quick reviews. Third, our test coverage and mocks hid the mismatch by exercising only the updated surface area.
Statically typed languages or global rename tools would have caught this earlier; in a dynamically typed codebase the runtime is the arbiter. The assistant's outputs read like confident edits, which made reviewers assume end-to-end consistency rather than verify symbol usage across packages.
How small model behaviors compounded and practical mitigations
Small behaviors added up: per-file context, temperature-driven naming choices, and helpful but local explanations from the assistant created an illusion of complete refactoring. Once deployed, inconsistent names triggered brittle integrations and a painful rollback. To reduce risk we added a project-wide symbol report and mandatory cross-file grep checks as part of the refactor flow and used a repository-wide code action to enforce a naming convention before accepting assistant suggestions.
For verification, pairing the assistant with a repository-level search and a dedicated verification pass (we used an external deep research style check to cross-reference symbols) helped. Finally, we added a post-refactor CI job that fails on multiple identifiers implementing the same interface and linked the practice into our general AI workflow documentation on crompt.ai.
Top comments (0)