CPDForge

Posted on Mar 30

From Prompts to Systems: Fixing AI Agent Drift in Production

#ai #webdev #softwareengineering #machinelearning

Why My AI Agent Kept Getting Things Wrong (And What Actually Fixed It)

At first, it worked.

I gave the AI a clear prompt. It responded well. Structured, relevant, even a bit impressive.

Then I tried again.

Same prompt. Slightly different output.

Then again — and something felt off.

Not completely wrong… just inconsistent.

That’s when it became a problem.

Because I wasn’t building a demo. I was building a product.

The Problem: “Almost Right” Is Not Good Enough

When you’re working with LLMs in isolation, variability is fine. Even interesting.

When you’re building something people rely on — it isn’t.

I started seeing patterns:

Outputs drifting in structure
Key instructions being ignored
Tone and formatting changing between runs
Occasionally… things just made up

Nothing catastrophic. Just unreliable.

And that’s worse.

Because you can’t trust it.

The Context: This Wasn’t Just a Chatbot

One important detail — this wasn’t an internal tool or a sandbox experiment.

This was a user-facing AI agent, interacting with both:

logged-in users (with context, data, and history)
prospective users (with no context at all)

Which meant I effectively needed two behaviours:

one that could operate with structured internal data and constraints
one that could explain, guide, and respond more openly without access to that context

Trying to handle both with the same prompt quickly broke down.

The agent would:

assume context that didn’t exist
overreach when it should stay generic
or lose structure when switching between modes

That’s when it became clear the issue wasn’t just prompting — it was context control and behavioural separation.

Why This Happens (and Why It’s Not a Bug)

It took a bit of stepping back to realise:

The model wasn’t failing — I was asking it to behave like something it isn’t.

LLMs are:

Stateless (unless you force context)
Probabilistic (not deterministic)
Context-sensitive (and context degrades fast)

What I was treating as “rules” were really just:

Suggestions with good intentions

Even system prompts didn’t fully solve it.

They help — but they don’t enforce behaviour.

What I Tried First (and Why It Didn’t Work)

Like most people, I went through the usual iterations:

Making prompts longer
Repeating instructions
Adding “IMPORTANT:” everywhere
Trying to be hyper-specific

It improved things slightly… but not enough.

The problem wasn’t clarity.

The problem was control.

The Shift: From Prompts to Systems

The breakthrough came when I stopped thinking in terms of prompts and started thinking in terms of structure.

Instead of:

“Tell the model what to do”

I moved to:

“Define how the model is allowed to behave”

That’s a completely different mindset.

What I Built: A Structured Instruction Layer

I ended up creating what I originally called an “instruction bible”.

In reality, it’s closer to a structured instruction system layered on top of the model.

1. Persistent rules (not buried in prompts)

Instead of mixing everything into one prompt, I separated:

Role definition
Behaviour rules
Output constraints

Example:

{
  "role": "compliance_ai",
  "rules": [
    "Do not invent regulations",
    "Flag uncertainty explicitly",
    "Prioritise clarity over completeness"
  ],
  "output_format": "structured_sections"
}

This becomes the source of truth, not just part of the conversation.

2. Modular instructions

Different tasks = different instruction sets.

Instead of one giant prompt, I used:

Generation mode
Review mode
Analysis mode

Each with its own constraints.

This reduced cross-contamination between behaviours.

3. Controlled outputs

I stopped accepting “natural” responses.

Everything had to follow a structure.

For example:

Sections must exist
Headings must match
Lists must be formatted consistently

If the output didn’t comply, it was rejected or reprocessed.

4. Reduced ambiguity

I removed anything vague.

No:

“be helpful”
“be clear”
“be concise”

Instead:

Define structure
Define constraints
Define boundaries

The model performs much better when it has less room to interpret.

What Changed

Once this layer was in place, the difference was immediate.

Outputs became consistent
Structure stabilised
Hallucination dropped significantly
Reuse became possible

Most importantly:

I could actually trust the output in a product setting

Not perfect — but predictable.

The Bigger Realisation

The real lesson wasn’t about prompts.

It was this:

Prompt engineering doesn’t scale. Systems do.

You can get good results with clever prompts.

But if you want:

reliability
repeatability
product-grade output

You need structure.

Where This Fits in the Bigger Picture

This lines up with a broader shift happening right now:

From chatbots → agents
From prompts → orchestration
From “AI responses” → controlled systems

We’re moving away from:

“Ask the model something”

Toward:

“Design how the model operates”

Final Thought

LLMs are powerful — but they’re not plug-and-play components.

If you want to build something real with them, you have to accept:

You’re not just writing prompts
You’re designing behaviour

And once you start treating it that way, everything changes.

If you’re building with AI and hitting similar issues, I’d be interested to hear how you’re handling it — especially where things break.

DEV Community