System vs User Prompt in 2026: What Actually Belongs Where

#ai #llm #prompt #tutorial

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You inherit a classification endpoint. It works. Then the bill arrives and your cache hit rate is sitting at zero, even though the prompt is 90% the same on every call. You check the code. The system field is empty, and the whole prompt — the rules, the few-shot examples, the user's text, the current timestamp — is jammed into one user message.

That single design choice is why nothing caches. It's also why the model ignores half your instructions when the input gets weird. The two roles aren't a formality. They tell the model what's a standing rule and what's this-request data, and they tell the API what it can reuse.

The split, stated plainly

Two questions decide where a piece of text goes:

Does it change between requests?
Is it an instruction, or is it data?

Stable instructions go in the system prompt. The role you play, the rules that hold on every call, the output format, the tone. None of it depends on what the user just typed.

Per-request data goes in the user message. The document to summarize, the text to classify, the question to answer. This is the part that's different every time.

That's the whole rule. The trouble starts when you blur it — when a behavior rule leaks into the user turn, or the user's input gets pasted into the system prompt.

Why mixing them hurts caching

Prompt caching is a prefix match. The API hashes your prompt from the start up to a cache breakpoint, and any byte change before that point invalidates everything after it. On Anthropic's API the render order is fixed: tools first, then system, then messages. The exact slots differ across providers, but the principle holds everywhere.

So the stable content has to physically come first. If your system prompt is frozen across requests, the API caches it once and reads it back at a fraction of the cost on every later call. On Anthropic's API, cache reads run about a tenth of the base input price.

Now watch what one mistake does. Here's a system prompt with a timestamp baked into the header:

system = f"""You are a support triage assistant.
Current time: {datetime.now().isoformat()}
Rules:
- Classify each ticket as bug, billing, or other.
- Never invent a category outside those three.
"""

That datetime.now() call changes every request. The first line of the prefix is different every time, so the cache is invalidated from byte one. Nothing downstream caches. You pay full price for the rules, the format spec, all of it, on every single call.

The fix is to pull the volatile piece out of the prefix and push it into the user turn, where it belongs:

system = """You are a support triage assistant.
Rules:
- Classify each ticket as bug, billing, or other.
- Never invent a category outside those three.
"""

user = f"Current time: {now}\n\nTicket:\n{ticket_text}"

The system prompt is now byte-identical on every call. It caches. The timestamp still reaches the model, but it sits after the cached prefix, so it invalidates nothing before it.

Same class of bug shows up with per-user data in the system prompt:

# Breaks cross-user caching — the prefix is unique per user
system = f"You are a coding assistant for {user.name} ({user.id})."

Every user gets a distinct prefix, so the cache never shares across them. Move the identity into the user turn, keep the system prompt generic, and one cached prefix serves everyone.

The tell is in the response. On Anthropic's API, check cache_read_input_tokens across two requests with what you think is the same prefix. If it's zero, something in the prefix is moving. The usual suspects: a timestamp, a UUID, a JSON blob serialized without sorted keys, or a tool list that changes per request.

Why mixing them hurts steerability

Caching is the measurable cost. Steerability is the one that bites you at 2 a.m. when a customer reports the model "ignoring instructions."

When everything lives in one user message, the model has to infer structure on every call. It reads a wall of text and guesses which sentence is a standing rule, which is background, and which is the actual task. It gets that guess right most of the time and wrong on the edge cases — which is exactly where you needed it to hold.

Here's a request-review prompt with everything flattened into the user turn:

user = f"""You are a senior engineer reviewing PRs. Be
careful about idempotency on payment paths. Here's the diff:
{diff}
Tell me if there are bugs. Don't be pedantic. Also flag
missing tests. Return JSON with severity and issues.
"""

The idempotency rule is wedged between the role and the diff. Is it a behavior rule that holds on every PR, or a hint about this one? The model can't tell, because you didn't tell it. The "don't be pedantic" constraint lands at the end, where it gets the most attention but reads as an afterthought. The diff sits in the middle, so the model scans back and forth to separate code from instructions.

Split it by role and the ambiguity is gone:

system = """You are a senior engineer reviewing PRs.

Always:
- Flag idempotency concerns when payment paths change.
- Flag missing tests as a separate issue.
- Avoid pedantic style nits.

Return a JSON object: {severity, issues, merge_decision}.
"""

user = f"Review this diff:\n\n{diff}"

Now the rules are bulleted under a system prompt, so the model treats them as standing behavior, not as a hint about this diff. The diff is alone in the user turn, so there's nothing to disentangle. The format spec lives in the system prompt because it doesn't change between PRs.

The behavior shift is real. A rule stated as a system-level "always" fires more reliably than the same sentence buried mid-paragraph in a user message. And a constraint like "avoid pedantic nits" in the system prompt suppresses the behavior across the board, instead of reading as a one-off aside.

The grey areas

Most text sorts cleanly. A few cases need a second look.

Few-shot examples. These are stable across requests, so they go in the system prompt — or in their own block right after it. They're reference material the model reads, not data it acts on. Keep them out of the user turn, where they'd change the cacheable boundary every time the user's input changes.

Retrieved context (RAG). This is the genuinely awkward one. Retrieved chunks change per request, so by the "does it change" test they're user-turn data. But they're also bulky, and you sometimes want them cached when the same documents come back across a session. Default to the user turn. Cache them with their own breakpoint only when you can confirm the same chunks recur — otherwise you pay the cache-write premium for nothing.

Conversation history. History accumulates in the messages array, after the system prompt. Don't try to summarize prior turns back into the system prompt to "save space" — that rewrites the prefix and invalidates the cache for the whole conversation. Let the history grow where it is and put your cache breakpoint on the last turn.

Dynamic instructions mid-session. Say a mode toggles partway through — the operator enables a terse mode, or a flag flips. The instinct is to edit the system prompt. Don't. Editing the system prompt changes the prefix ahead of every cached turn, so the whole conversation re-processes uncached. Append the instruction as a later message instead. Some APIs now support a dedicated system-role message you can place mid-conversation for exactly this; where that isn't available, a marked instruction in the next user turn does the job. Either way the cached history stays intact.

A checklist you can run today

Open your prompt-building code and trace what flows into system, tools, and messages. Then sort every input:

Never changes? It goes in the system prompt, before any cache breakpoint.
Changes per request? It goes in the user turn, after the prefix.
A timestamp, UUID, or random ID in the system prompt? Move it to the user turn or delete it.
A tool list that varies per user or per request? Tools render first, so a varying tool set invalidates everything. Freeze it.

Then verify. Send the same logical request twice and read cache_read_input_tokens on the second. A non-zero number means your prefix is stable and the split is working. Zero means something in the prefix is still moving, and you have one more input to track down.

The split between system and user isn't a style preference. It's the line between a prompt that caches and steers and one that does neither.

The system-versus-user split is one of those decisions that's invisible until it costs you — a flat cache bill, a model that drifts on edge cases. The Prompt Engineering Pocket Guide goes deeper on prompt structure, caching-aware design, and the per-block choices you're probably making on instinct. Written for engineers who maintain prompts in production.