Juan Torchia

Posted on May 29 • Originally published at juanchi.dev

System prompts for production agents: the format that survived 3 redesigns

#english #typescript #produccion #anthropic

System prompts for production agents: the format that survived 3 redesigns

The right way to make an agent do less is to write more in the system prompt. I know that sounds backwards. Let me explain.

The first instinct when an agent goes off the rails is to cut its permissions at the application logic level. Validations, guardrails, filters over the response. But the problem usually lives earlier: the model doesn't have a clear contract for what's expected of it. And without a contract, it optimizes for appearing helpful. Not for being correct.

My thesis: a well-structured system prompt isn't documentation for the model — it's a contract. The most important sections aren't the ones describing the role — anyone can write those — but the ones that define explicit limits and the expected output format. Without those two, the agent fills in the gaps with whatever it thinks you want to hear.

This isn't a universal conclusion. It's the pattern I found after redesigning the same format multiple times, backed by the official Anthropic prompt engineering guide.

The concrete problem this format solves

There's a pattern I keep seeing in teams that are just starting with agents: the system prompt is a prose paragraph describing the model's character. "You are an expert assistant in X. Respond clearly and concisely."

That works for demos. In production, that prompt hits cases the author never imagined when writing it. The model has no instructions for what to do when the question is out of scope, when it's missing data to respond correctly, or when the format expected by whatever consumes the response is specific.

The typical result is one of two extremes: the agent invents information to seem complete, or gives responses so generic they're useless. Both are the same design error: the prompt has no edge-case contracts.

Anthropic's guide is explicit on this: separating the system prompt from the human turn has architectural purpose. The system prompt establishes identity, capabilities, and constraints. It's not a place for instructions mixed with dynamic context and no structure.

The four sections that survived three redesigns

The format I use now has fixed sections, marked with uppercase headings so the model identifies them without ambiguity. The structure ended up like this:

// Build the system prompt with fixed sections
// Dynamic context is injected only into CONTEXT — never mixed into ROLE or LIMITS

function buildSystemPrompt(ctx: AgentContext): string {
  return `
ROLE
You are a document processing agent. You analyze structured text and extract entities defined in the schema.

LIMITS
- Do not infer information that isn't present in the source document.
- If a required field doesn't appear in the text, return null for that field. Do not invent a plausible value.
- Do not answer questions outside the scope of entity extraction.
- If the document is in an unsupported language, return a structured error — do not attempt to translate.

CONTEXT
Processing date: ${ctx.processingDate}
Expected entity schema: ${JSON.stringify(ctx.schema, null, 2)}
Supported languages: ${ctx.supportedLanguages.join(', ')}

OUTPUT FORMAT
Return exclusively valid JSON with this structure. No additional text, no markdown, no explanations:
{
  "entities": [...],
  "confidence": number,  // between 0 and 1
  "errors": string[]     // empty if no errors
}
`.trim();
}

ROLE — describes what the agent does, not what its personality is like. That distinction matters: the role is functional, not aesthetic.

LIMITS — this is the section I rewrote the most times. The first design didn't have it. The second had it mixed into the role. The third separated it but used vague language ("don't exaggerate", "be conservative"). The current version uses specific conditions and specific actions. If X, then Y. No room for interpretation.

CONTEXT — the only dynamic section. Everything that changes per request, per user, or per system state goes here. The rule that took me a while to really internalize: if something changes between calls, it doesn't go in ROLE or LIMITS. Mixing it into the fixed sections breaks the semantic consistency of the prompt across requests.

OUTPUT FORMAT — the section that saves the most work in whatever code consumes the response. The more specific it is, the less defensive parsing you need downstream.

Where people go wrong: the hidden cost of unstructured dynamic context

The most common mistake I see in prompts shared in public repos is injecting dynamic context directly into the body of the role, with no separation. Something like this:

// ❌ Problematic pattern: context mixed with fixed instructions
const systemPrompt = `
You are a support assistant. Today is ${new Date().toISOString()}.
The user's name is ${user.name} and they have the ${user.plan} plan.
Help them with their questions. Be friendly and concise.
The response schema is: { message: string }.
`;

The problem isn't that it includes dynamic context — that's fine. The problem is it mixes date, user data, behavioral instructions, and output format in the same block with no structure. When the model receives that, it has no clear signals for what's a limit, what's context, and what's a format instruction.

In practice this shows up two ways: the model ignores the output format when the user context is unusual, or it applies restrictions that were meant for a specific context to all contexts. Both end up as bugs that are hard to reproduce because they depend on the dynamic content of the request.

Anthropic's guide recommends using XML tags to separate sections when content might be ambiguous. Uppercase headings are a variant of the same principle: giving the model unambiguous structural signals.

When dynamic context helps and when it confuses

This is the distinction that took me the longest to articulate precisely:

Dynamic context that helps: factual data the model needs to operate correctly in that specific request. Current date, data schema, environment configuration, relevant system state. Information that can't come from anywhere else.

Dynamic context that confuses: behavioral instructions that change per user or per session. If the agent's limits vary by user type, don't inject them into the prompt as free text. Model them as explicit conditional sections with clear logic, or consider separate agents.

// ✓ Factual dynamic context: correct
const context = `
CONTEXT
Date: ${processingDate}
Active schema: ${JSON.stringify(schema)}
`;

// ❌ Instructions that change per user: problematic
const context = `
CONTEXT
${user.isPremium ? 'You can answer advanced questions.' : 'You only answer basic questions.'}
`;

// ✓ Alternative for variable behavior: explicit conditional section
const limits = user.isPremium
  ? 'LIMITS\nScope: advanced analysis. No length restriction.'
  : 'LIMITS\nScope: basic questions only. Maximum 3 reasoning steps.';

The model handles factual dynamic context well. It handles conditional instructions written in prose worse — especially when they accumulate across multiple versions of the prompt.

Checklist before deploying a system prompt

Before putting a system prompt into production, run it through these questions. They're not exhaustive, but they cover the most common problems:

Is there an explicit limits section separate from the role? If limits are mixed into the role description, separate them.
Is the output format specified with a concrete example? Describing it in prose isn't enough when the format is structured (JSON, XML, a list with specific formatting).
Is all dynamic content in the CONTEXT section? If there are ${variables} outside that section, check whether they should be there.
Do the limits use specific conditions? "Don't make up data" is vague. "If the field doesn't appear in the source text, return null" is a contract.
Do you know what the agent should do when the question is out of scope? If there's no explicit instruction for that case, the model will invent a reasonable-sounding response.
Is the prompt over 800 tokens? Not automatically bad, but it's a signal to check for redundancy between sections or unnecessary context.

This checklist doesn't replace testing the prompt with edge cases. But it cuts the obvious problems before you get to that stage.

The limits of this approach

Here's where I have to be honest about what this format actually solves and what it doesn't.

What it solves well: structural clarity, less ambiguity in edge-case behavior, better consistency in output format. Those benefits are observable without sophisticated metrics.

What it doesn't solve: a model that's wrong for the task, insufficient context to answer correctly, or contradictory instructions that require complex reasoning. Better prompt structure doesn't compensate for those problems.

What I can't claim without an experiment: that this format improves accuracy metrics by a specific percentage, that it works the same across all models, or that four sections is superior to other structures in cases I haven't considered. For that you need logs, evaluations with predefined test cases, and a measurement criterion set up beforehand.

The Anthropic guide is the best starting point for validating structure. What I add is the judgment for when to separate dynamic context and the insistence on limits with specific conditions — not vague prose.

FAQ

How many sections should a system prompt for agents have?
There's no magic number. The four-section format (ROLE, LIMITS, CONTEXT, OUTPUT FORMAT) is the minimum I've found useful for agents with non-trivial behavior. Simpler agents can get away with less. What you don't want to cut are LIMITS and OUTPUT FORMAT — those two sections have the highest impact on consistency.

Where does user information go in a system prompt?
In the CONTEXT section, as factual data. Not in ROLE or LIMITS. If the user information changes the agent's behavior (not just its context), ask yourself whether that should be an explicit conditional section or a different agent with its own system prompt.

Does it make sense to use XML instead of uppercase headings?
Yes, especially if the content of a section might contain characters that confuse parsing. Anthropic recommends it for ambiguous content. Uppercase headings are more readable for human review. Pick whichever is more consistent with the rest of your codebase.

How do I test that the system prompt is working correctly?
With edge cases defined before writing the prompt, not after. The most important ones: out-of-scope question, required data missing from the input, unusual input format. If the agent has no explicit instructions for those cases, it will invent a response. The Anthropic Prompt Engineering Guide has examples of structured evaluation.

Can the system prompt change at runtime?
Technically yes, but it's a source of bugs that are genuinely hard to debug. What changes at runtime is the CONTEXT section. ROLE and LIMITS should be stable across requests from the same agent. If you need radically different behavior, that's a signal you need two agents — not one prompt with complex conditionals.

What do I do if the model ignores the specified output format?
First, verify the format example is in the right section and is unambiguous. Second, try a response prefill (in models that support it) to anchor the start of the output. Third, if the problem persists, check whether the dynamic context is introducing ambiguity that's overriding the format instructions.

Closing: a contract you can actually reason about

What changed how I think about system prompts wasn't reading about prompt engineering — it was facing unexpected behavior from an agent and having no way to reason about which instruction caused it, because everything was in one unstructured block of prose.

Structuring the prompt into sections isn't bureaucracy. It's what lets you say "the model behaved differently because the dynamic context changed" instead of "I don't know, the model is unpredictable." The difference between those two sentences is the difference between a debuggable system and one that operates on vibes.

My practical recommendation: if you have an agent in production with a prose system prompt, don't rewrite it from scratch. Start by pulling the limits into their own section and moving all dynamic content into an explicit CONTEXT section. Those two changes alone already reduce the surface area for ambiguity.

The concrete next step: take your current prompt, paste it into a doc, and ask yourself: what is the most important limit this agent has, and where is it written down explicitly? If you can't point to a specific sentence, that's the first gap.

Original source: