Morgan Willis for AWS

Posted on Mar 18 • Edited on Mar 24

The Agent Buddy System: When Prompt Engineering Isn't Enough

#ai #agents #programming #aws

Most AI agents don’t reliably follow directions, and that’s one of the biggest reasons they never make it from POC to production.

This is how deploying agents usually plays out: you write clear instructions in your prompt, test against every scenario you can think of, and ship it. Then the agent skips steps, drifts from your guidelines, or invents behavior you didn't anticipate. So you add more detail, more constraints, more explicit directions.

The prompt is getting huge now but you’re sure you’ve captured all the rules. You deploy again. Same problem. Eventually, you hit a wall and give up.

I ran into this firsthand trying to create a simple AI assistant to help me write. I gave it samples of my writing style, told it to write like me, and it did start off okay. But after a few turns it drifted back into generic AI-speak. I'm talking em dashes everywhere, staccato sentences for dramatic effect, and that weird "It's not about X, it's about Y" framing that sounds profound but actually says nothing. By the end of a long session, the output usually sounds nothing like me.

This example makes the problem obvious because you can read the output and immediately tell something’s off. But the same thing happens in more serious scenarios, like compliance checks, customer support flows, or multi-step workflows where the stakes are higher.

What's actually happening is that as conversations get longer, the model pays less attention to earlier instructions.

Prompt engineering helps, but it can only take you so far. What you need is a feedback loop that catches drift and corrects it before the response ever reaches the user.

The Agent Buddy System

Instead of trying to make one agent behave perfectly, the solution was to introduce a second agent to the system. One does the work, and the other checks it. That’s what I’ve been calling the agent buddy system.

The main agent handles the task: writing, reasoning, calling tools, whatever it needs to do. The buddy sits alongside it, watching the output. If the agent skips a step, tries to misuse a tool, or drifts from the defined rules, the buddy steps in and helps get things back on track.

The idea is simple: don’t rely on the model to always follow instructions. Assume it will drift, and build something that corrects it when it does.

This is essentially using an LLM as a judge. The evaluator model inspects the output from the worker model and decides whether it meets the criteria. If it does, the response goes through. If not, it sends guidance and the agent can try again.

It turns out that having two models that disagree with each other is safer than having one model that just does whatever it wants.

You can build this pattern yourself, but I used the Strands Agents SDK because it already supports this kind of feedback loop through a feature called steering.

Steering lets you inject just-in-time guidance into the agent’s execution instead of front-loading everything into a massive prompt and hoping for the best.

Under the hood, Strands steering works through hooks in the agent’s lifecycle. You can intercept tool calls before they execute to run custom validations, or evaluate the model’s response after it’s generated to check things like tone, format, or adherence to the prompt.

The steering agent intercepts the call and returns one of three actions: Proceed (accept), Guide (reject with feedback for retry), or Interrupt (escalate to a human).

Building a Writing Buddy

To fix my AI writing problem, I built a steering handler that checks every response against a style guide with examples of my actual writing. If the output doesn’t sound like me, the handler catches it and asks for a rewrite before I ever see it.

In Strands, this means creating a SteeringHandler and attaching it to your agent as a plugin.

For my use case, I only needed to evaluate the final output, so I used steer_after_model() to inspect each response and decide whether to accept it or send it back with feedback.

Here’s my VoiceSteeringHandler:

class VoiceSteeringHandler(SteeringHandler):
    """Evaluates writing output against a style guide using an LLM judge.

    Intercepts model responses via steer_after_model and uses a separate
    steering agent to check for style violations. If a violation is found,
    it guides the agent to rewrite with targeted feedback.
    """

    def __init__(self, style_guide: str, max_retries: int = 3):
        super().__init__(context_providers=[])
        self.style_guide = style_guide
        self.max_retries = max_retries
        self.retry_count = 0

    async def steer_after_model(
        self, *, agent: "Agent", message: Message, stop_reason: StopReason, **kwargs: Any
    ):
        """Evaluate model output against the style guide."""
        print("\n[STEERING] Evaluating model output...")
        text = " ".join(
            block.get("text", "") for block in message.get("content", [])
        )

        if self.retry_count >= self.max_retries:
            self.retry_count = 0
            return Proceed(reason="Max retries reached, accepting output")

        # Use a separate steering agent as an LLM judge
        steering_agent = Agent(
            system_prompt=f"""You evaluate writing against a style guide.
            Catch clear violations, not nitpicks.

            STYLE GUIDE:
            {self.style_guide}

            REJECT for: banned words/phrases from the style guide, em dashes,
            "It's not X. It's Y." reframing, obvious marketing tone, or meta-commentary.

            APPROVE if: tone is developer-to-developer with no banned words/phrases/patterns.
            When in doubt, APPROVE.

            Respond with APPROVE or REJECT: [quote the violation].""",
            model=agent.model,
            callback_handler=None,
        )

        result = str(steering_agent(f"Evaluate this text:\n\n{text}"))

        if "REJECT:" in result.upper():
            self.retry_count += 1
            feedback = result.split("REJECT:", 1)[-1].strip()
            return Guide(
                reason=f"Fix this issue: {feedback[:300]}. "
                "Only fix the cited issue. Output only the content, nothing else."
            )

        self.retry_count = 0
        return Proceed(reason="Output approved by steering agent")

Then to attach it to your main agent, you use a plugin like this:

model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="us-east-1",
)

return Agent(
    model=model,
    system_prompt=f"""You are a writing assistant that writes in a specific voice.
    Follow every rule in the style guide below. Output only the requested writing.
    Never add meta-commentary or questions like "Would you like me to adjust?"

    STYLE GUIDE:
    {style_guide}""",
    plugins=VoiceSteeringHandler(style_guide=style_guide),
)

When the steering agent sees the output doesn't match, the handler returns Guide with specific feedback. The agent discards its response and tries again, knowing exactly what went wrong. After max_retries attempts, it lets the response through rather than looping forever.

The evaluator prompt checks for voice match against your examples, but also flags AI vocabulary (words like "crucial," "delve," "tapestry"), structural patterns (em dashes, pseudo-profound reframing), and other tells that make text sound machine-generated. You give it paragraphs from your actual writing, and it asks "does this new text sound like these examples?" It's essentially a style linter powered by an LLM.

That’s a judgment call, and this is where steering really shines. Instead of trying to build complicated, deterministic evaluation logic, you let a model make that call and provide targeted feedback.

Does It Work?

Yes. Here’s what I saw in my own testing before getting into larger-scale results.

I ran a small evaluation: 5 multi-turn writing sessions where a simulated user iteratively refines a piece, repeated 5 times each using Claude Sonnet 4.5. That's the kind of back-and-forth that happens in real writing workflows, and it's where drift becomes noticeable. The baseline voice adherence averaged 25% by the end of the sessions, but the steered version held at 100%.

For single-turn prompts with more capable models, both performed about the same for a small evaluation dataset, because larger models are already pretty good at following style guides on their own. The difference shows up in the longer sessions where drift compounds, or when weaker models are used.

That's a modest eval set, so take the exact numbers directionally rather than as gospel. But the pattern consistently showed unsteered sessions degraded noticeably after a few turns, while steered sessions stayed on voice throughout.

The more compelling evidence comes from Clare Liguori, Senior Principal Software Engineer at AWS, who ran a similar evaluation at a much larger scale. She tested five approaches to guiding agent behavior on a library book renewal agent across 3,000 runs.

Simple prompt instructions reached 82.5% accuracy, meaning roughly one in five interactions failed
Agent SOPs hit 99.8%, but at 3x the token cost
Graph-based workflows reached 80.8%, often failing outside predefined paths
Steering hit 100% across 600 runs while using 66% fewer input tokens than SOPs and 47% fewer output tokens than workflows

The most common failure without steering was skipping the book status check before renewing (43% of failures), followed by missing the confirmation message (40%). These are exactly the kinds of steps models deprioritize as context grows.

Things To Consider

This pattern works well, but there are a few things you should consider.

Latency

Each steering intervention adds another model call. If the handler returns Guide, the agent has to regenerate with feedback, which can mean two or three round trips for a single response. Once you add in tool calls the latency becomes a real factor.

That’s fine for background tasks or workflows where accuracy matters more than speed. But it’s the wrong tradeoff for real-time applications where users expect quick responses and the stakes are low.

Token costs

Tokens do add up, but the picture is more nuanced than you might expect.

Steering uses more tokens than simple prompt instructions because you’re sending feedback back to the agent when it strays. But compared to approaches that actually achieve high accuracy, like SOPs, steering is often more efficient.

You should reach for steering when a single prompt isn't enough, but try using the single prompt approach first.

Steering prompt quality

The quality of your steering prompts directly impacts performance.

If your handler gives vague feedback, the agent can get stuck retrying without improving. Set retry limits, make your Guide feedback specific, and if the same correction keeps firing, fix the prompt instead of increasing retries.

And remember, you're using a model to judge another model. That means they can share the same blind spots. If both the worker and the evaluator miss the same kind of mistake, steering won't catch it.

Try using two different models, and for high-stakes use cases, pair this with deterministic checks where you can.

When not to use steering

Steering assumes you have a clear definition of "correct." That works for style guides, compliance rules, and structured workflows. It doesn't work as well for creative tasks where you actually want the model to surprise you because steering will pull it back toward whatever your evaluator thinks is right. And if your criteria can be expressed as deterministic checks (regex, schema validation, rule engines), maybe skip steering. It's slower, costs more, and adds uncertainty where you don't need it.

Beyond Writing Assistants

Reliable agents come from the systems you build around them.

Steering applies anywhere an agent needs consistent behavior over time. Customer service agents maintaining tone across dozens of interactions, code review bots enforcing your team's conventions, or compliance workflows where skipping a step has real consequences.

The pattern is the same: evaluate the output, provide guidance, retry if needed. You just swap the evaluator criteria.

Clare Liguori’s post walks through her full evaluation of the library book renewal agent. The steering documentation covers the full API.

Some agents need a buddy to keep them on track. Steering gives you that.

Top comments (3)

klement Gunndu • Mar 18

Ran into the exact same instruction drift problem building a multi-step workflow. The buddy/supervisor pattern works but the latency cost of a second model call per turn adds up fast -- have you measured the overhead in production?

Morgan Willis AWS • Mar 23

I have not yet measured it in production, but with my tests I agree latency is a real consideration. In general, these sorts of ReAct agents have quite a bit of latency so if you double them up like this expect double the latency. However for critical use cases I think the reliability gains outweigh the latency tradeoff. It's all use case dependent of course on what you can tolerate.

Kuro • Mar 25

This resonates — the drift problem is real and under-discussed. We hit it building a personal AI agent that runs 24/7.But I think the buddy system is treating a symptom rather than the root cause. You nailed the diagnosis: "as conversations get longer, the model pays less attention to earlier instructions." The constraint signal decays. Adding a second agent to catch drift downstream is one approach, but it doubles your inference cost and introduces a new failure mode (the buddy itself can drift or be overly strict).What we found more effective: re-inject the relevant constraints at each decision point, not in the original prompt, but as perceived context. The agent reads a constraints file before every action — like a pilot running through a checklist before each phase of flight, not once at takeoff. The file is inspectable, versionable, and can be different per context.The pattern is: perceive → constrain → act → verify. The verification step is similar to your buddy, but the constraint injection step prevents most drift before it happens. In our system, adding perception-first constraint loading eliminated ~80% of the cases where we previously needed post-hoc correction.That said, the buddy pattern absolutely has its place — independent verification for high-stakes decisions (compliance, safety) where you genuinely need a second opinion, not just drift correction. The question is whether writing style drift needs that level of machinery, or whether stronger constraint signals would be a simpler fix.Great write-up. The Strands steering API looks clean — the Proceed/Guide/Interrupt taxonomy is a nice abstraction.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.