A System-Prompt Skeleton That Survives Claude/GPT/Gemini Swaps

#ai #llm #prompt #tutorial

Book: Prompt Engineering Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The prompt that worked on Claude 4.6 broke on GPT-5.5 the day
you swapped providers. Same task. Same JSON schema. The first
call returned a refusal wrapped in three paragraphs of meta
about why it was refusing. The second call returned valid
JSON but ignored the escalation rule. By Friday afternoon,
commit a91f3c2 had a five-line system prompt and a fallback
that pretended none of this was happening.

You can keep rewriting the prompt every time procurement
picks a new vendor. Or you can give the prompt a shape that
doesn't move when the model under it moves. Below is the
skeleton that holds shape after the same task gets run
across Claude, GPT, and Gemini over a couple of quarters.

Skeletons over templates

Templates lock the wording. Skeletons lock the slots. The
slots are the parts every modern provider's documented
contract assumes you'll fill in: a role, a refusal stance,
tool guidance, an output shape, anchors, and a graceful exit
when nothing else fits.

Anthropic puts the system text in a top-level system field
on the Messages API.
OpenAI accepts it as a developer (or system) message in
the Responses API.
Google takes it as a structured systemInstruction content
on generateContent.
Three transports, one job: tell the model who it is and what
it must do before any user turn arrives.

The skeleton below is what goes in that slot. Keep it
short on purpose. Long system prompts drift across providers
faster than short ones do.

The six sections

[ROLE + SCOPE]
You are <role>. You handle <in-scope tasks>.
You do not handle <out-of-scope tasks>.

[REFUSAL + ESCALATION]
If a request is out of scope, respond with:
{ "status": "out_of_scope", "reason": "<one line>" }
If you are uncertain, escalate with:
{ "status": "needs_human", "reason": "<one line>" }

[TOOL GUIDANCE]
You may call: <tool_a>, <tool_b>.
Call <tool_a> only when <precondition>.
Never call a tool to confirm a fact already in context.

[OUTPUT CONTRACT]
Always return a single JSON object with keys:
status, data, reason. No prose, no markdown fences.

[FEW-SHOT ANCHORS]
Example 1: input -> output (in-scope, success)
Example 2: input -> output (out-of-scope, refusal)

[FALLBACK]
If you cannot satisfy the contract, return:
{ "status": "fallback", "reason": "<one line>" }

Six labeled sections. Bracketed names because every provider
parses uppercase headers as section breaks in a small test
set, and because XML tags work on Claude but read as content
on the other two unless you wrap carefully.

What each section is doing

Role and scope. Two sentences. The role gives the model
an identity to attend to. The scope tells it where to draw
the line. The negative half of the scope is the part most
teams skip; it is the part that prevents "I'd be happy to
help with that adjacent thing you didn't ask for".

Refusal and escalation. A typed refusal beats a polite
prose refusal every time. When the refusal is a JSON shape,
your downstream code can branch on status. When it is a
paragraph, your code parses English. The escalation slot is
the safety valve: a structured way for the model to say "I
am not the right unit for this".

Tool guidance. Names of the tools, plus the condition
under which each one is called. The "never call to confirm"
line is the cheapest way to stop the loop where the model
calls a search tool to verify something the system prompt
already told it.

Output contract. One sentence about shape. The "no prose,
no markdown fences" clause matters more on GPT and Gemini
than on Claude. Both have a stronger default tendency to
wrap JSON in code fences if you let them.

Few-shot anchors. Two examples is enough. One success
case, one refusal. The refusal anchor pulls more weight than
the success one; without it, the model interprets the
refusal contract as advisory.

Fallback. The exit. When the contract cannot be met, the
model returns a structured surrender instead of inventing a
shape. This is the section that turns "weird response, retry
with a different prompt" into "log it, route it, move on".

A tested example task

Task: classify an inbound support email as one of billing,
technical, account, or other, and extract the customer
ID if present. A system prompt that runs cleanly on Claude
4.6, GPT-5.5, and Gemini 2.5 Pro:

[ROLE + SCOPE]
You are an email triage classifier for a SaaS product.
You classify inbound support emails into one category and
extract the customer ID when one is present in the body.
You do not draft replies, summarize threads, or guess at
sentiment.

[REFUSAL + ESCALATION]
If the email is not a support request, return:
{ "status": "out_of_scope", "category": null,
  "customer_id": null,
  "reason": "not a support request" }
If the email contains a legal threat or a security
disclosure, return:
{ "status": "needs_human", "category": null,
  "customer_id": null,
  "reason": "requires human review" }

Split here so the block stays readable. The second half:

[TOOL GUIDANCE]
No tools available for this task. Do not invent tool calls.

[OUTPUT CONTRACT]
Return one JSON object with keys: status, category,
customer_id, reason. No prose. No markdown fences.
status is one of: ok, out_of_scope, needs_human, fallback.
category is one of: billing, technical, account, other, null.
customer_id is a string or null.

[FEW-SHOT ANCHORS]
Input: "Card was declined for invoice 88421. CID: A-7781."
Output: { "status": "ok", "category": "billing",
  "customer_id": "A-7781", "reason": "card declined" }

Input: "We will be filing suit unless this is resolved."
Output: { "status": "needs_human", "category": null,
  "customer_id": null, "reason": "requires human review" }

[FALLBACK]
If you cannot determine a category, return:
{ "status": "fallback", "category": null,
  "customer_id": null, "reason": "ambiguous content" }

Run this against an illustrative batch of inbound emails on
each provider with the same temperature and the same
user-turn template, and all three models will produce
parseable JSON on every call. The disagreements show up on
edge cases. One model classifies a password-reset complaint
as account, another as technical, and those edge cases
are exactly what the contract is meant to surface.

What each provider needs different

Three caveats, drawn from each provider's documented
contract.

Claude. Wrap the sections in XML tags if you want the
strongest section-boundary signal. [ROLE + SCOPE] works,
but <role_and_scope>...</role_and_scope> works better in
practice. Anthropic's own prompt engineering docs
recommend XML tagging for any prompt with multiple distinct
parts.

GPT. Use the developer role on the Responses API where
available. The OpenAI guidance
distinguishes developer instructions (your rules) from user
inputs (the data); on reasoning models, the developer role
is the one whose constraints the model trusts most. Also
re-state the "no markdown fences" clause inside the output
contract section; the default tendency to fence JSON is
stronger on GPT.

Gemini. The systemInstruction field on generateContent
is a structured Content object with a parts array, not a
plain string, in the official docs.
SDK shape mistakes are the most common reason a Gemini
system prompt looks like it's being ignored — it is being
dropped at the transport layer. Test that the field landed
before you blame the prompt.

The skeleton itself is identical across all three. What
changes is the wrapper: XML on Claude, the developer role
on GPT, the Content shape on Gemini.

When the skeleton fails

It fails on tasks where the rubric is implicit and only
emerges from many examples. A creative-writing assistant, a
tone-matching reply drafter, a code-review bot with a house
style. Those tasks need few-shot density the skeleton
doesn't give you, and the structured-refusal contract gets
in the way of the open-ended output the task wants.

It also fails when you try to stuff per-call data into the
system prompt. The skeleton holds the rules. Per-call
context (the email body, the diff, the document) belongs
in the user turn, where every provider's cache and rate
accounting expects it. Put a customer record into the
system prompt and the cache hit rate falls through the
floor — every provider's cache keys off the system prefix.

When the next model ships and the wording you used for
Claude 4.6 stops landing on whatever replaces it, the slots
are still the slots, and the rewrite is a fifteen-minute
job instead of a two-day one. That is the whole point of
trading sentences for shape: the next swap is already
half-done before procurement even tells you it is coming.

If this was useful

The Prompt Engineering Pocket Guide
goes deeper on the skeleton: a per-section design checklist,
the eval rig for testing portability across providers, and
the cases where you should ditch the skeleton and write
something model-specific instead. Written for engineers who
maintain prompts in production and stop trusting any pattern
that hasn't survived a vendor swap.