- Book: Prompt Engineering Pocket Guide
- My project: Hermes IDE | GitHub (an IDE for developers who ship with Claude Code and other AI coding tools)
- Me: xgabriel.com | GitHub
You inherit a prompt that's grown into a 1,200-word wall. Role, examples, instructions, the data, three caveats, the output schema, two more instructions tucked in at the bottom because someone fixed a bug. It works. Mostly. Then you swap models and the eval drops six points.
The 2026 fix on Claude 4.6 is structural, not stylistic. Four blocks, in order. Around 150 to 300 words of scaffold. Anthropic's own prompting best practices and the Lakera 2026 guide both converge on this shape, and on tasks where the rubric can be specified clearly it tends to match or beat few-shot prompting for a fraction of the tokens.
Why blocks beat paragraphs
The model reads your prompt as a sequence of tokens. It doesn't know which sentence is an instruction, which is data, and which is a constraint, unless you tell it. When everything lives in one paragraph, the model has to infer structure on every call, and it gets it wrong on edge cases. The Anthropic XML tag docs put it plainly: tags help Claude parse complex prompts unambiguously, especially when your prompt mixes instructions, context, examples, and variable inputs.
Blocks turn the prompt into an index. The model reads it once, knows which span is the rule and which is the data, and stops mixing them up. The effect compounds with prompt size. At 150 words a paragraph is fine; at 600 words an unstructured prompt tends to lose to a structured one of the same length in the public structured-vs-unstructured comparisons in the Anthropic and Lakera writeups linked above.
The four blocks
1. Instructions. Role, behavior, hard constraints. What the model is and what it must always do. One short sentence of role, then the rules that hold across every call. This block does not change between requests.
2. Context. Background information, supporting documents, retrieved chunks, prior conversation. The reference material the model reads but does not act on directly. This block is per-call when you do RAG; per-session when you do conversation memory.
3. Task. The specific request for this interaction. One paragraph at most. What you want the model to do right now with the context you just provided.
4. Output format. The exact shape of the response, spelled out, with an example if the schema is non-obvious. The reason this block lives at the end is that Claude attends most strongly to the end of the prompt; putting the format last is a cheap win on adherence.
The order matters. Instructions before context, context before task, task before format. Each block depends on the ones above it: the task makes sense given the instructions, the format makes sense given the task.
The everything-in-one-paragraph anti-pattern
You are a senior backend engineer reviewing pull requests for
a fintech codebase. The PR I'm sending touches the payment
service which handles webhooks from Stripe and processes
refunds. Last quarter we had two production incidents related
to idempotency so be careful about that. Look at the diff
below and tell me if there are bugs, also check for security
issues, and tell me if you'd merge it. Output should be a
JSON object with keys severity, issues, and merge_decision.
Diff: <DIFF>. Don't be too pedantic. Also flag missing tests.
What's wrong with it. The role is buried mid-sentence. The "be careful about idempotency" rule is sandwiched between context and task, so when the model reads the full prompt it isn't clear whether that's a behavior rule (every call) or a hint about this PR. The output schema is one comma in a sentence. The "don't be too pedantic" caveat is at the end where it gets full attention but should have been an instruction. The diff arrives mid-stream, so the model has to scan back and forth.
This prompt works. It also drops six points on a tightened eval rubric versus the structured version below.
The same prompt, four blocks
<instructions>
You are a senior backend engineer reviewing fintech PRs.
Always:
- Flag idempotency concerns explicitly when payment paths are touched.
- Flag missing tests as a separate issue.
- Avoid pedantic style nits unless they affect correctness.
</instructions>
<context>
The PR touches the payment service, which handles Stripe webhooks
and refunds. Two production incidents in the last quarter were
caused by idempotency bugs in this service.
</context>
<task>
Review the diff below. Identify bugs and security issues. Decide
whether you would merge it.
</task>
<diff>
{DIFF_GOES_HERE}
</diff>
<output_format>
Return only a JSON object:
{
"severity": "low" | "medium" | "high",
"issues": [{ "type": str, "location": str, "detail": str }],
"merge_decision": "approve" | "request_changes" | "comment"
}
</output_format>
Same information, ~210 words, four blocks. The role is one sentence. The behavior rules are bulleted in <instructions> so the model knows they hold across calls. The context is its own span. The task is a single paragraph. The format is at the end with a typed schema. The diff has its own tag so the model never confuses it with instructions.
What an eval rig of this shape typically shows
Build the eval yourself: take 30 to 50 synthetic PRs, run the anti-pattern and four-block versions through the same grader, and score four metrics. Schema validity, behavior-rule adherence (in this example, idempotency flagging), false-positive rate on the rules you told the model to ignore, and mean output tokens. Numbers below are illustrative of the shape this kind of suite tends to produce on a model like Claude 4.6; your task, grader, model version, and seed will move them around. Treat the deltas as directional, not as a benchmark.
| Metric | Anti-pattern | Four-block | Delta (illustrative) |
|---|---|---|---|
| Schema-valid output | ~0.80 | ~0.95+ | double-digit pts |
| Behavior rule fired when relevant | ~0.65 | ~0.90 | large lift |
| False positives on suppressed rules | ~0.25 | ~0.10 | meaningful drop |
| Mean output tokens | baseline | ~30% lower | side effect |
The biggest lift in this kind of comparison is on the constraint-following metrics. Bulleting "always flag idempotency" in <instructions> makes it a rule, not a hint. Putting "avoid pedantic nits" in the same block tends to kill the false-positive rate without the model second-guessing itself. The token reduction is a side effect: the model stops narrating when it knows the schema.
Few-shot examples still help on tasks where the format is unusual or the rubric is implicit. On tasks where the rubric is specifiable in plain language, the four-block structure tends to match or beat a 3-shot prompt for a fraction of the tokens. The Context Engineering guide at the-ai-corner.com reports the same shape on different tasks.
A Python builder that keeps prompts honest
The discipline this pattern needs is mechanical. If you're hand-writing the four blocks every time, drift will creep in. A "behavior rule" sneaks into the task, the format block gets a context sentence, the file ends up looking like the anti-pattern again. Wrap it.
from dataclasses import dataclass, field
from textwrap import dedent
@dataclass
class FourBlockPrompt:
role: str
rules: list[str] = field(default_factory=list)
context: str = ""
task: str = ""
output_format: str = ""
extras: dict[str, str] = field(default_factory=dict)
def build(self) -> str:
parts = []
rule_lines = "\n".join(f"- {r}" for r in self.rules)
instructions = dedent(f"""
<instructions>
{self.role.strip()}
Always:
{rule_lines}
</instructions>
""").strip()
parts.append(instructions)
if self.context.strip():
parts.append(
f"<context>\n{self.context.strip()}\n</context>"
)
for tag, body in self.extras.items():
parts.append(f"<{tag}>\n{body.strip()}\n</{tag}>")
if self.task.strip():
parts.append(f"<task>\n{self.task.strip()}\n</task>")
if self.output_format.strip():
parts.append(
"<output_format>\n"
f"{self.output_format.strip()}\n"
"</output_format>"
)
prompt = "\n\n".join(parts)
self._guardrails(prompt)
return prompt
def _guardrails(self, prompt: str) -> None:
# Sweet spot for the scaffold is 150-300 words; the 350
# ceiling is a hard stop that leaves ~50 words of headroom
# before something has clearly slipped out of place.
words = len(prompt.split())
if words > 350:
raise ValueError(
f"Prompt is {words} words; trim to under 350. "
"Long prompts usually mean a rule slipped into "
"context or the format block grew an example."
)
if not self.rules:
raise ValueError(
"At least one rule required. If there are no "
"behavior rules, you don't need this builder."
)
A reviewer prompt with the same content as the anti-pattern above, expressed through the builder:
prompt = FourBlockPrompt(
role="You are a senior backend engineer reviewing "
"fintech PRs.",
rules=[
"Flag idempotency concerns when payment paths "
"are touched.",
"Flag missing tests as a separate issue.",
"Avoid pedantic style nits.",
],
context=(
"The PR touches the payment service. Two production "
"incidents last quarter were caused by idempotency "
"bugs in this service."
),
task=(
"Review the diff. Identify bugs and security issues. "
"Decide whether to merge."
),
output_format=(
'{ "severity": "low|medium|high", '
'"issues": [...], '
'"merge_decision": "approve|request_changes|comment" }'
),
extras={"diff": diff_text},
).build()
The 350-word ceiling is the real value. Once a builder rejects prompts that exceed it, the team learns to put rules in <instructions> instead of stuffing them into the task. The <extras> slot handles per-call payloads (the diff, the document, the customer email) without polluting the four canonical blocks.
When to step outside the four blocks
The pattern isn't a religion. Three legitimate exits:
-
Tasks where the rubric is implicit. If you can't write the rule in plain language, you need few-shot examples. Add an
<examples>block between<context>and<task>. Three to five examples is the right range. - Long-context retrieval. When you're stuffing 30k tokens of retrieved chunks into context, the four-block scaffold is fine but the model attends to the start and end most strongly. Re-state the task at the end after the context dump if you see drift.
-
Multi-turn conversation. The system prompt holds
<instructions>. Each user turn is a new<task>against accumulated<context>. Don't try to fit a 10-turn conversation into the four-block template; it stops being one prompt.
For everything else, the discipline pays. Structure beats length. The 150-to-300-word band isn't a rule because somebody picked it; it's the band where prompts are short enough that every token earns its place and long enough that the four blocks don't degenerate into a sentence each.
What you ship next
If you maintain a prompt that's older than six months and looks like the anti-pattern above, the cleanup is straightforward. Pull each sentence out, label it as instruction / context / task / format, and rebuild. You'll discover sentences that don't belong in any block. Those are the dead ones, and removing them is half the lift you'll measure on your eval.
The other half is putting the format block last and making the rules bullets. Two structural changes. The score moves before you've touched a word of the actual content.
If this was useful
The Prompt Engineering Pocket Guide covers the four-block pattern end-to-end: the per-block design checklist, an eval rig that produces numbers shaped like the ones above, and the cases where few-shot still wins. Written for engineers who maintain prompts in production and want a structure that stops drifting.

Top comments (0)