When long chats change the model (and your build breaks)

#codequality #discuss #llm #promptengineering

Context drift looks like changing styles, not noise

I once left a model-assisted debugging thread open for a full day while swapping between reproducing an issue and asking for refactors. At the start I had strict constraints: TypeScript, no any, async/await only. By afternoon the model started suggesting callbacks and looser types. The conversation still contained my original constraints, but recent turns had more weight. The model’s output didn’t look wrong at first. It matched a pattern that suited the most recent short messages, not the original contract I had in mind.

That shift is subtle. You don’t get an error that says constraints expired. You get code that compiles locally but violates style checks or runtime assumptions later in the pipeline.

Hidden assumptions pile up with every turn

There are always background decisions the model makes for you. Which package major to target, which default JSON shape to use, whether to open a new DB transaction — it fills these blanks without telling you. In one run the assistant assumed a different ORM API than the one in our repo and returned migration code that created a nullable column I had forbidden. The migration passed my quick glance, but it failed in production when a downstream job relied on nonnull semantics.

Those are not dramatic hallucinations, just quiet defaults. They feel like helpful completion until some piece of the stack interprets them differently and the error shows up hours later. Tool chains make this worse: a failing CI plugin or a timeout from a test runner can leave a partial response, and the model will happily continue on a best guess.

Small hallucinations compound into real outages

My most embarrassing incident was iterative. Early in a thread I asked for a query example. The model invented a column name that almost matched ours. I accepted it, used that snippet to write an integration test, and then accepted another generated fix when the test failed. Those tiny mismatches moved from a harmless typo to multiple failing jobs in CI and eventually to a broken cron that logged exceptions every hour.

It is the accumulation that hurts. One made-up column. One assumed default. One missing assertion. When you chain these outputs into scripts, deployments, or batch jobs the errors multiply. They are predictable because the model is just doing pattern completion and will keep filling whatever gaps it senses in the immediate context.

Logging, checkpoints, and forced resets helped me more than better prompts

I started treating the model like a flaky subsystem. I log every prompt and every tool response. If a tool call fails or returns partial JSON I stop and fail fast. We added checkpoints in the conversation: after three edits we reset the thread and restate constraints. That sounds annoying but it reduced context drift. We also batch prompts into smaller, verifiable steps instead of one long freeform session.

For verification I built a short pipeline: unit tests and linters run on model output before any human edits. If the model suggests infra changes we run the terraform plan and diff it automatically. When I need to compare model variants or see how different prompts behave I spin up parallel conversations in a shared workspace so I can compare outputs side by side instead of trusting a single thread. That comparison habit is what made me reach for a multi-model chat approach rather than a longer single-thread session, and I recommend keeping results in a place you can revisit, like a collaborative chat tool I use for experiments.

Ask for assumptions, then verify them

I now force the model to enumerate its assumptions before I accept code. Ask it to list the package versions it assumed, the DB shape it used, the environment variables it expects. Then try to break the output with a simple test that targets those assumptions. When the model calls external tools I treat the tool response as authoritative and validate it; if a response is missing fields I make the workflow fail and log the raw output for later inspection.

If you want a practical nudge: put assertions and sanity checks at the edges of any generated change and use a structured research step when you need sources. I often follow a quick verification pass with a focused look at references and diffs using a separate sourcing flow so I do not conflate synthesis and verification in one long chat session.