Alan West

Posted on May 25

Why LLM Coding Agents Drift on Long Back End Tasks (and How to Fix It)

#ai #llm #backend #debugging

Last month I spent three days debugging a Django service where the AI agent had written... mostly correct code. The endpoints worked. The tests passed. But somewhere around the fourth file, it had quietly dropped a database transaction wrapper around a multi-step write. By file seven, it had forgotten that one of the models required tenant scoping.

This is constraint decay. And once you start watching for it, you see it everywhere.

What constraint decay actually is

When you hand an LLM agent a backend task, you give it a pile of constraints. Some are explicit (use this ORM, scope by tenant_id, wrap writes in transactions). Some are implicit (auth middleware applies to all routes, errors map to specific status codes). Early in the task, those constraints are fresh in context and the agent honors them.

As the task drags on, something predictable happens. The agent generates more code. That generated code pushes the original constraints further from the attention window. By the time it's writing the eighth function, the original instructions are competing with thousands of tokens of its own output for attention weight. Constraints fade. Output drifts.

I should say upfront: I haven't read every paper on this in detail, and recent work like the Constraint Decay preprint on arXiv is still being discussed. But the phenomenon itself is reproducible at home. Build a long enough agent loop with enough constraints and you'll watch it happen on your own machine.

Root cause: it's not memory, it's signal-to-noise

The first instinct when you see drift is "well, just put it in the context window." Modern models have huge context windows. But window size isn't really the issue.

The issue is that attention is a softmax over the entire context. When your system prompt is 200 tokens and the surrounding generated code is 8000 tokens of similar-looking function names, types, and patterns, the relative weight on the constraint shrinks. The constraint is present. It's just not salient.

You can verify this with a quick experiment. Give an agent a constraint like "every database write must go through audit_log()." Have it write five files. By file four, direct writes will often sneak in. Re-prompting with just the original constraint restores compliance immediately. The constraint never left the model — the model just stopped weighting it.

Step-by-step fix

Here's the pattern I've landed on after maybe a dozen agent-driven projects this year. It's not perfect. It does cut drift significantly.

1. Externalize constraints as checked artifacts

Don't rely on the agent remembering. Make the constraint a thing you can mechanically verify.

# constraints.py — source of truth for cross-cutting rules
INVARIANTS = {
    "tenant_scoping": {
        # any query on a multi-tenant model must include tenant_id
        "applies_to": ["User", "Invoice", "Subscription"],
        "check": "every .filter() / .get() includes tenant_id",
    },
    "audit_log": {
        # mutations to sensitive tables must be logged
        "applies_to": ["billing", "auth"],
        "check": "calls audit_log() before commit",
    },
}

Then write a small AST-walking linter that checks these. Now the constraint has a teeth-having enforcer that doesn't decay.

2. Chunk the work, refresh between chunks

Long single-shot generation is where decay is worst. Break the task into chunks, and between chunks, replay the relevant constraints.

def run_agent_task(task, constraints):
    chunks = decompose_task(task)
    results = []
    for chunk in chunks:
        # Constraints go near the top, fresh, every chunk
        prompt = build_prompt(
            constraints=constraints,
            prior_summary=summarize(results),  # summary, not full output
            current_chunk=chunk,
        )
        result = agent.run(prompt)
        results.append(result)
    return results

The key move is summarize(results) instead of dumping all prior code. A summary preserves the architectural decisions without crowding the constraint with thousands of code tokens.

3. Use a separate constraint-check pass

After every chunk, run a separate, narrow LLM call whose only job is to check the new code against the constraints. Single responsibility, fresh context.

def check_chunk(generated_code, constraints):
    prompt = (
        "Check this code against the constraints below. "
        "For each constraint, answer PASS or FAIL with one line of evidence.\n\n"
        f"CONSTRAINTS:\n{format_constraints(constraints)}\n\n"
        f"CODE:\n{generated_code}"
    )
    return narrow_model.run(prompt)  # smaller, cheaper model is fine

This is much more reliable than asking the main agent to self-check, because the checker isn't carrying the cognitive load of generation. Its context is short, its attention is undivided.

4. Make violations fail loud

When the checker finds a violation, don't try to "patch" the offending file. Roll back the chunk and regenerate with the violated constraint pinned at the very top — sometimes repeated. Repetition is ugly but it works. Models weight constraints that appear multiple times more heavily.

def regenerate_with_emphasis(chunk, violations):
    emphasized = "\n\n".join(
        f"CRITICAL CONSTRAINT (do not violate): {v}" for v in violations
    )
    # Yes, we repeat. Yes, it helps.
    return agent.run(emphasized + "\n\n" + chunk.prompt + "\n\n" + emphasized)

5. Keep human review on the constraint surface, not the code

This is the part people skip. You don't need to review every line the agent writes. You need to review the constraint set and the checker. If those two are correct, drift is bounded.

I have a habit now of starting every agent project by writing the constraints file first. Before any code. It feels weird because you're writing rules against code that doesn't exist yet, but it forces you to articulate the invariants up front, while you're still thinking clearly.

Prevention: design for bounded drift

A few patterns that keep the problem small in the first place:

Prefer narrow tasks. "Add a new endpoint" is bounded. "Build the whole admin panel" is not. Decay scales with task length, so shorter tasks decay less.
Use typed interfaces aggressively. When the agent has to satisfy a type signature, the type acts as a local, always-visible constraint. Tools like mypy or TypeScript catch a surprising amount of drift for free.
Lean on existing scaffolding. If your codebase has a BaseRepository that already enforces tenant scoping, the agent inherits the constraint by inheritance. The framework remembers what the agent forgets.
Don't trust passing tests. An agent that wrote both the code and the tests has aligned them to each other. Run the actual app. Hit the endpoints with curl. Check the database directly.

The honest summary: LLM agents are fantastic at the first hour of a task and progressively worse at the fourth. If you architect your workflow around that reality — short chunks, external constraints, mechanical checking — the drift becomes manageable. If you treat the agent like a junior dev who remembers everything you said, you'll be debugging silent constraint violations for days.

Three days, in my case. I've structured my projects differently since.

DEV Community