Raj Kundalia

Posted on Jun 13

Where You Put the Instruction Matters More Than What It Says

#ai #promptengineering

An experiment comparing system prompts, user prompts, and tool descriptions across Claude and Qwen

Originally published on Medium: https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369

There’s a lot of advice on how to write good prompts:

Use chain-of-thought
Add examples
Be specific

But I hadn’t seen much real-world evidence on a different question:

When you give an LLM agent an instruction, does it matter which slot you put it in?

I am not talking about wording or tone. I mean the structural slot: system message, user message, or tool description.

These aren’t just different positions in a string. They’re different fields in the API payload, and models are trained to treat them differently.

I wanted to know:

Does the slot actually affect whether the model follows the instruction?

So I built an experiment to find out.

GitHub Repository: https://github.com/rajkundalia/prompt-placement-anatomy

The Experiment

The design is deliberately boring.

The agent’s job is to count TODO markers across five markdown files using two filesystem tools:

list_files
read_file

The instruction under test is:

End your final answer with the marker [DONE].

That instruction gets placed in exactly one of three slots per run:

System message — typically reserved for persona and behavioral rules
User message — where the task itself lives
Tool description — metadata attached to the tool schema, appended to the read_file tool description

The task stays identical across all three variants.

The only variable is where the [DONE] instruction lives.

Why No `submit_answer` Tool?

One design decision worth calling out:

There is no submit_answer or final_response tool.

The agent terminates by returning ordinary text with no further tool calls.

Compliance is checked on that free-text response using a case-insensitive search for [DONE] in the last 80 characters.

This was intentional.

I wanted to measure whether the model follows a formatting instruction in its natural output, not whether it can populate a structured tool argument correctly.

Those are different skills.

Each placement is run multiple times.

Metrics collected:

Compliance rate (did it append [DONE]?)
Completion rate (did it finish within the 15-turn cap?)
Turns to completion
Total token usage

Compliance is the headline metric.

The other metrics help explain agent behavior but are not the primary outcome.

Why No Frameworks?

The agent loop is a Python while loop:

Send a message
Check for tool calls
Execute tools
Append results
Repeat

If the model produces text with no tool calls, the run is done.

I avoided frameworks deliberately.

Frameworks add their own:

system messages
tool schema modifications
hidden instructions

If I’m measuring placement effects, I need to know exactly what’s in each slot and nothing else.

The entire implementation is about 300 lines of Python and fully visible in agent_loop.py.

Results: Qwen 2.5-Coder 3B (Ollama)

50 runs per placement

Placement	Compliance Rate	Completion Rate
System	8%	100%
User	64%	100%
Tool Description	2%	100%

The model produced a final answer every time.

100% completion across the board.

But whether it remembered to append [DONE] depended almost entirely on where the instruction lived.

User message placement was dramatically more effective than both alternatives.

The gap between user (64%) and system (8%) is large enough that the Wilson 95% confidence intervals do not overlap, suggesting a real difference rather than sampling noise.

Tool description placement was effectively useless at 2%.

The system message wasn’t much better at 8%.

For this model, on this task, only the user message slot reliably delivered instructions.

Results: Claude Sonnet 4.6 (Anthropic API)

20 runs per placement

Placement	Compliance Rate	Mean Turns
System	100%	3
User	100%	3
Tool Description	100%	3

Completely placement-insensitive.

100% compliance across all three slots.

The model followed the [DONE] instruction regardless of where it lived.

It also used the tools correctly every time:

List files
Read files
Produce answer

No chart generated; a flat line at 100% carries no information.

Results: Claude Haiku 4.5 (Anthropic API)

50 runs per placement

Placement	Compliance Rate	Mean Turns
System	100%	3
User	100%	3
Tool Description	100%	3

Identical.

This is Anthropic’s smallest and cheapest model, yet it showed the same placement robustness as Sonnet.

Even Haiku exhibited zero placement sensitivity.

No chart generated; a flat line at 100% carries no information.

If you're wondering what "turns" are, Anthropic's Agent SDK documentation explains the agent loop nicely:

https://code.claude.com/docs/en/agent-sdk/agent-loop#the-loop-at-a-glance

The Summary

Model	Type	System	User	Tool Desc
qwen2.5-coder:3b	Small local (Ollama)	8%	64%	2%
claude-haiku-4.5	Small frontier (Anthropic)	100%	100%	100%
claude-sonnet-4.6	Large frontier (Anthropic)	100%	100%	100%

The biggest difference wasn’t between system, user, and tool slots.

It was between model classes.

Both Anthropic models followed the instruction regardless of placement.

The 3B-parameter open-weight model did not.

For that model, the user message was the only placement that produced meaningful compliance.

Based on these results, placement sensitivity was a major factor for the 3B open-weight model and effectively a non-factor for the two frontier models tested.

What This Means in Practice

Many teams choose small local models for:

Cost
Latency
Privacy

If you’re one of them, instruction placement isn’t a matter of style.

It’s a matter of reliability.

In this experiment, placing a critical instruction in the system message or tool description was almost as ineffective as omitting it entirely.

The user message was the only slot that consistently delivered meaningful compliance.

If you're building with frontier models, placement didn't matter under these conditions.

Caveats

The prompts were short (~300 tokens for Ollama, ~6,000 tokens for Claude including tool calls).
Task accuracy was not measured.
The counting task is a distractor designed to force multi-turn tool use.
The exact percentages apply only to qwen2.5-coder:3b on this task.
Different models, quantizations, and tasks may produce different results.

What may generalize more broadly is the ranking:

On similar small open-weight models, the user message may continue to be the most effective placement, even if the size of the advantage changes.

Despite those caveats, the central result is hard to ignore:

For the 3B model, the same instruction produced dramatically different behavior depending solely on where it was placed.

What's Next: Instruction Conflict (Part 2)

This experiment measures placement strength in isolation:

One instruction
One slot
No competing signals

The natural follow-up is instruction conflict.

Imagine:

System prompt

Append [DONE]

User message

Append [FINISHED]

Tool description

Append [COMPLETE]

Then observe which marker appears in the final answer.

This reveals the priority ordering of slots, not just whether they're read.

Questions worth exploring:

Does the system prompt win over the user message?
Do frontier models follow a hierarchy?
Does a small model notice the conflict at all?
Does it simply follow whichever slot it was already attending to?

DEV Community

Where You Put the Instruction Matters More Than What It Says

The Experiment

Why No `submit_answer` Tool?

Why No Frameworks?

Results: Qwen 2.5-Coder 3B (Ollama)

Results: Claude Sonnet 4.6 (Anthropic API)

Results: Claude Haiku 4.5 (Anthropic API)

The Summary

What This Means in Practice

Caveats

What's Next: Instruction Conflict (Part 2)

Related Reading

Connect

Top comments (0)

The Experiment

Why No submit_answer Tool?

Why No Frameworks?

Results: Qwen 2.5-Coder 3B (Ollama)

Results: Claude Sonnet 4.6 (Anthropic API)

Results: Claude Haiku 4.5 (Anthropic API)

The Summary

What This Means in Practice

Caveats

What's Next: Instruction Conflict (Part 2)

Related Reading

Connect

Why No `submit_answer` Tool?