An experiment comparing system prompts, user prompts, and tool descriptions across Claude and Qwen
Originally published on Medium: https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369
There’s a lot of advice on how to write good prompts:
- Use chain-of-thought
- Add examples
- Be specific
But I hadn’t seen much real-world evidence on a different question:
When you give an LLM agent an instruction, does it matter which slot you put it in?
I am not talking about wording or tone. I mean the structural slot: system message, user message, or tool description.
These aren’t just different positions in a string. They’re different fields in the API payload, and models are trained to treat them differently.
I wanted to know:
Does the slot actually affect whether the model follows the instruction?
So I built an experiment to find out.
GitHub Repository: https://github.com/rajkundalia/prompt-placement-anatomy
The Experiment
The design is deliberately boring.
The agent’s job is to count TODO markers across five markdown files using two filesystem tools:
list_filesread_file
The instruction under test is:
End your final answer with the marker
[DONE].
That instruction gets placed in exactly one of three slots per run:
- System message — typically reserved for persona and behavioral rules
- User message — where the task itself lives
-
Tool description — metadata attached to the tool schema, appended to the
read_filetool description
The task stays identical across all three variants.
The only variable is where the [DONE] instruction lives.
Why No submit_answer Tool?
One design decision worth calling out:
There is no submit_answer or final_response tool.
The agent terminates by returning ordinary text with no further tool calls.
Compliance is checked on that free-text response using a case-insensitive search for [DONE] in the last 80 characters.
This was intentional.
I wanted to measure whether the model follows a formatting instruction in its natural output, not whether it can populate a structured tool argument correctly.
Those are different skills.
Each placement is run multiple times.
Metrics collected:
- Compliance rate (did it append
[DONE]?) - Completion rate (did it finish within the 15-turn cap?)
- Turns to completion
- Total token usage
Compliance is the headline metric.
The other metrics help explain agent behavior but are not the primary outcome.
Why No Frameworks?
The agent loop is a Python while loop:
- Send a message
- Check for tool calls
- Execute tools
- Append results
- Repeat
If the model produces text with no tool calls, the run is done.
I avoided frameworks deliberately.
Frameworks add their own:
- system messages
- tool schema modifications
- hidden instructions
If I’m measuring placement effects, I need to know exactly what’s in each slot and nothing else.
The entire implementation is about 300 lines of Python and fully visible in agent_loop.py.
Results: Qwen 2.5-Coder 3B (Ollama)
50 runs per placement
| Placement | Compliance Rate | Completion Rate |
|---|---|---|
| System | 8% | 100% |
| User | 64% | 100% |
| Tool Description | 2% | 100% |
The model produced a final answer every time.
100% completion across the board.
But whether it remembered to append [DONE] depended almost entirely on where the instruction lived.
User message placement was dramatically more effective than both alternatives.
The gap between user (64%) and system (8%) is large enough that the Wilson 95% confidence intervals do not overlap, suggesting a real difference rather than sampling noise.
Tool description placement was effectively useless at 2%.
The system message wasn’t much better at 8%.
For this model, on this task, only the user message slot reliably delivered instructions.
Results: Claude Sonnet 4.6 (Anthropic API)
20 runs per placement
| Placement | Compliance Rate | Mean Turns |
|---|---|---|
| System | 100% | 3 |
| User | 100% | 3 |
| Tool Description | 100% | 3 |
Completely placement-insensitive.
100% compliance across all three slots.
The model followed the [DONE] instruction regardless of where it lived.
It also used the tools correctly every time:
- List files
- Read files
- Produce answer
No chart generated; a flat line at 100% carries no information.
Results: Claude Haiku 4.5 (Anthropic API)
50 runs per placement
| Placement | Compliance Rate | Mean Turns |
|---|---|---|
| System | 100% | 3 |
| User | 100% | 3 |
| Tool Description | 100% | 3 |
Identical.
This is Anthropic’s smallest and cheapest model, yet it showed the same placement robustness as Sonnet.
Even Haiku exhibited zero placement sensitivity.
No chart generated; a flat line at 100% carries no information.
If you're wondering what "turns" are, Anthropic's Agent SDK documentation explains the agent loop nicely:
https://code.claude.com/docs/en/agent-sdk/agent-loop#the-loop-at-a-glance
The Summary
| Model | Type | System | User | Tool Desc |
|---|---|---|---|---|
| qwen2.5-coder:3b | Small local (Ollama) | 8% | 64% | 2% |
| claude-haiku-4.5 | Small frontier (Anthropic) | 100% | 100% | 100% |
| claude-sonnet-4.6 | Large frontier (Anthropic) | 100% | 100% | 100% |
The biggest difference wasn’t between system, user, and tool slots.
It was between model classes.
Both Anthropic models followed the instruction regardless of placement.
The 3B-parameter open-weight model did not.
For that model, the user message was the only placement that produced meaningful compliance.
Based on these results, placement sensitivity was a major factor for the 3B open-weight model and effectively a non-factor for the two frontier models tested.
What This Means in Practice
Many teams choose small local models for:
- Cost
- Latency
- Privacy
If you’re one of them, instruction placement isn’t a matter of style.
It’s a matter of reliability.
In this experiment, placing a critical instruction in the system message or tool description was almost as ineffective as omitting it entirely.
The user message was the only slot that consistently delivered meaningful compliance.
If you're building with frontier models, placement didn't matter under these conditions.
Caveats
- The prompts were short (~300 tokens for Ollama, ~6,000 tokens for Claude including tool calls).
- Task accuracy was not measured.
- The counting task is a distractor designed to force multi-turn tool use.
- The exact percentages apply only to
qwen2.5-coder:3bon this task. - Different models, quantizations, and tasks may produce different results.
What may generalize more broadly is the ranking:
On similar small open-weight models, the user message may continue to be the most effective placement, even if the size of the advantage changes.
Despite those caveats, the central result is hard to ignore:
For the 3B model, the same instruction produced dramatically different behavior depending solely on where it was placed.
What's Next: Instruction Conflict (Part 2)
This experiment measures placement strength in isolation:
- One instruction
- One slot
- No competing signals
The natural follow-up is instruction conflict.
Imagine:
System prompt
Append [DONE]
User message
Append [FINISHED]
Tool description
Append [COMPLETE]
Then observe which marker appears in the final answer.
This reveals the priority ordering of slots, not just whether they're read.
Questions worth exploring:
- Does the system prompt win over the user message?
- Do frontier models follow a hierarchy?
- Does a small model notice the conflict at all?
- Does it simply follow whichever slot it was already attending to?
Related Reading
Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%
The statistical foundation behind the Wilson confidence intervals used in this experiment.


Top comments (0)