DEV Community

Michael O
Michael O

Posted on • Originally published at xeroaiagency.com

How to Write an AI Agent Prompt That Actually Works (Not Just Once)

Most founders who complain that AI "isn't reliable" are writing prompts the same way they write text messages. Casual. Vague. Context-free. Then they wonder why the output is inconsistent.

The problem isn't the model. It's the prompt architecture.

When I started running Xero on AI agents, the single biggest leverage point wasn't picking the right model or buying better tools. It was learning how to write prompts that produce the same quality of output on the 50th run as they did on the first. This post covers exactly that.

What is the difference between a regular prompt and an agent prompt?

A regular prompt is a one-shot instruction that ends when you get your result. An agent prompt is a persistent instruction set that fires on a schedule without you in the loop. The design requirements are completely different, which is why most prompts written for chat interfaces fail in automation.

Writing agent prompts means thinking about:

  • What the agent needs to know before it can start (context)
  • What success looks like specifically (output format)
  • What should happen when something goes wrong (error handling)
  • What it should never do (constraints)

Most tutorials only cover the first one. That's why most agent prompts drift or break after a few runs.

Why should you start with role and context rather than the task?

Leading with the task before context is the most common structural mistake. The model doesn't know who it is, what business it serves, or what success looks like for this workflow. A clear role and context block gives the agent a frame for interpreting every subsequent instruction, dramatically reducing judgment calls you never intended to delegate.

Bad:

Summarize today's news about AI startups and post it to my Telegram.
Enter fullscreen mode Exit fullscreen mode

Better:

You are the content curator for Xero, a solo-founder AI agency. 
Your job is to find and summarize 3 AI startup stories that are relevant 
to non-technical founders building with AI. Avoid stories about model 
releases from major labs unless they have direct tooling implications. 
Summarize each story in 2 sentences max. Then format the output as a 
Telegram message with no markdown headers, just plain text with line breaks.
Enter fullscreen mode Exit fullscreen mode

The second version tells the agent who it is, what its filter criteria are, what to exclude, how to format output, and what platform it's writing for. That's not over-engineering. That's what a real employee needs before they can do the job without calling you every five minutes.

What is the four-part structure that keeps agent prompts consistent over time?

After running hundreds of agent automations, this four-part structure is what holds up. Each section handles a different failure mode: identity prevents context drift, task rules prevent output variation, format specs prevent format inconsistency, and edge case handling prevents silent failures that you only catch two weeks later when something has been broken the whole time.

1. Identity + purpose (2-4 sentences)
Who is this agent, what business context does it operate in, what is its single job in this automation.

2. Task + rules (the bulk)
What to do, in what order, with what constraints. Specific beats general. "Find 3 posts from the last 24 hours" beats "find recent posts." Give the agent a checklist it can follow mentally.

3. Output format (explicit, not implied)
What the output should look like, word counts if relevant, what to include and exclude, what platform it's going to. If the output feeds another tool, describe the expected format for that tool.

4. Edge case handling (often skipped)
What happens when the source has no results? When an API fails? When the content doesn't meet quality criteria? Agents that don't have an answer to these questions either hallucinate through the problem or silently fail. Neither is good.

What quality gates do most people skip when building agent prompts?

Most people skip the constraints layer entirely. Hard constraints beat soft ones, self-evaluation steps catch errors before output ships, and explicit completion definitions close loops that vague instructions leave open. These three additions account for most of the reliability gap between agents that run clean for 90 days and agents that drift within a week.

Hard constraints beat soft ones. "Don't include anything from major tech publications" is better than "prefer indie sources." Give the agent a rule it can apply as a binary check, not a judgment call.

Self-evaluation beats hoping. Add a step where the agent reviews its own output before finalizing it. Something like: "Before you output the final message, check: does it meet all the criteria above? If not, revise it." This one instruction dramatically reduces drift.

Specify what done looks like. "The task is complete when X has been sent to Y and confirmed received" is different from "send X to Y." The first version closes the loop. The second leaves room for the agent to think it finished when it only partially ran.

For more on building quality gates into workflows, how to build an AI agent decision framework and AI agent guardrails are both worth reading alongside this.

Does temperature and model choice actually matter as much as people think?

No. Model choice accounts for maybe 20% of output quality for recurring agent tasks. The prompt accounts for the rest. A tight, specific prompt running on a mid-tier model will outperform a vague prompt on a premium model almost every time. The fixation on model selection is usually a sign that the prompt architecture hasn't been solved yet.

For recurring agent tasks, start with the fastest capable model available, then only upgrade if the output genuinely requires more nuance. Most daily automation tasks don't need a top-tier model. They need a better prompt.

Temperature: lower (0.3 to 0.5) for anything that needs to be consistent and formatted, like reports, summaries, and scheduled posts. Higher for brainstorming, drafts, or creative copy. Cranking temperature doesn't make agents "smarter." It makes them less predictable. For most scheduled automation, that's the opposite of what you want.

Research from Anthropic's model documentation and OpenAI's prompt engineering guide both reinforce this: clarity of instruction matters more than model tier for structured tasks.

What is the memory problem that makes agent prompts drift over time?

Context window growth is the hidden culprit behind most prompt drift. As an agent runs across multiple steps, earlier instructions carry less weight than recent content. Prompts that felt solid at first produce inconsistent outputs not because you changed anything, but because the model reads your instructions from a different position in a much longer window.

A few structural choices help:

Put your most important constraints at both the beginning and end of the prompt. Models pay more attention to the start and end of long inputs. If there's a rule the agent absolutely cannot violate, say it twice.

Keep prompts shorter than you think they need to be. Every unnecessary sentence dilutes the weight of the important ones. Aim for prompts that an experienced human could follow from memory after reading once.

For longer-running agents that accumulate context over multiple steps, the architecture question of how memory works is separate from prompt design. How to give an AI agent persistent memory covers the structural side of that problem.

What should you do when your agent prompt keeps producing inconsistent output?

If you've had three or more runs produce inconsistent output, the problem is almost always one of four things: ambiguous criteria the agent is resolving on its own, missing context it needs but wasn't given, an overloaded task trying to do too much in one prompt, or a missing output spec that lets the format drift between runs.

Ambiguous criteria. The agent is making judgment calls you didn't define. Go back and define them explicitly.

Missing context. The agent doesn't know something it needs to do the job well. Add it to the identity section.

Overloaded task. One prompt is trying to do too many things. Split it into two agents with a handoff between them.

No output spec. The agent is choosing its own format and it keeps changing. Lock down the format explicitly.

Keeping a short log of failed agent runs with notes on what went wrong pays dividends fast. After a few months you see patterns. Almost every failure traces back to one of those four things.

How does running agent prompts in an automation framework differ from running them in chat?

In chat, you're present to course-correct. In automation, no one's watching. Every failure mode needs a defined response because there's no human catching edge cases in real time. Silence is a failure mode. Partial output is a failure mode. Output that looks right but isn't is the worst one, because you won't catch it without explicit checks.

The prompts I write for Xero's scheduled agents are meaningfully different from the prompts I use in chat. More rigid, more explicit about error handling, less reliant on the model's judgment. If you're moving from chat-based AI use to actual agent automation, recalibrate for that difference before anything goes live.

If you want to see what a full agent stack looks like in practice, the AI agent stack for solo founders post breaks down how the pieces fit together across a real solo business.

Where to go from here

If you're building your first real agent automation and want a structured starting point, the $7 Solo Founder AI Guide covers prompt architecture, the task breakdown framework I use at Xero, and the exact setup I run for recurring agent jobs. It's the fastest shortcut I have for skipping the 6-month learning curve.

Or if you want help building a specific agent for your business, book a Build Lab session and we'll scope it, prompt it, and test it together in 90 minutes.

The model isn't the bottleneck. The prompt is. Fix the prompt first.


Start Building Your Own AI System


Want to build your own AI co-founder?

I'm building Xero in public — an AI system that runs distribution, content, and ops while I work a full-time job.

Originally published at xeroaiagency.com

Top comments (0)