DEV Community

Raj Kundalia
Raj Kundalia

Posted on

Where You Put the Instruction Matters More Than What It Says

An experiment comparing system prompts, user prompts, and tool descriptions across Claude and Qwen

Originally published on Medium: https://medium.com/@rajkundalia/where-you-put-the-instruction-matters-more-than-what-it-says-2d5ffcdd9369

There’s a lot of advice on how to write good prompts:

  • Use chain-of-thought
  • Add examples
  • Be specific

But I hadn’t seen much real-world evidence on a different question:

When you give an LLM agent an instruction, does it matter which slot you put it in?

I am not talking about wording or tone. I mean the structural slot: system message, user message, or tool description.

These aren’t just different positions in a string. They’re different fields in the API payload, and models are trained to treat them differently.

I wanted to know:

Does the slot actually affect whether the model follows the instruction?

So I built an experiment to find out.

GitHub Repository: https://github.com/rajkundalia/prompt-placement-anatomy

Prompt Placement Anatomy

The Experiment

The design is deliberately boring.

The agent’s job is to count TODO markers across five markdown files using two filesystem tools:

  • list_files
  • read_file

The instruction under test is:

End your final answer with the marker [DONE].

That instruction gets placed in exactly one of three slots per run:

  1. System message — typically reserved for persona and behavioral rules
  2. User message — where the task itself lives
  3. Tool description — metadata attached to the tool schema, appended to the read_file tool description

The task stays identical across all three variants.

The only variable is where the [DONE] instruction lives.

Why No submit_answer Tool?

One design decision worth calling out:

There is no submit_answer or final_response tool.

The agent terminates by returning ordinary text with no further tool calls.

Compliance is checked on that free-text response using a case-insensitive search for [DONE] in the last 80 characters.

This was intentional.

I wanted to measure whether the model follows a formatting instruction in its natural output, not whether it can populate a structured tool argument correctly.

Those are different skills.

Each placement is run multiple times.

Metrics collected:

  • Compliance rate (did it append [DONE]?)
  • Completion rate (did it finish within the 15-turn cap?)
  • Turns to completion
  • Total token usage

Compliance is the headline metric.

The other metrics help explain agent behavior but are not the primary outcome.

Why No Frameworks?

The agent loop is a Python while loop:

  1. Send a message
  2. Check for tool calls
  3. Execute tools
  4. Append results
  5. Repeat

If the model produces text with no tool calls, the run is done.

I avoided frameworks deliberately.

Frameworks add their own:

  • system messages
  • tool schema modifications
  • hidden instructions

If I’m measuring placement effects, I need to know exactly what’s in each slot and nothing else.

The entire implementation is about 300 lines of Python and fully visible in agent_loop.py.


Results: Qwen 2.5-Coder 3B (Ollama)

50 runs per placement

Placement Compliance Rate Completion Rate
System 8% 100%
User 64% 100%
Tool Description 2% 100%

The model produced a final answer every time.

100% completion across the board.

But whether it remembered to append [DONE] depended almost entirely on where the instruction lived.

User message placement was dramatically more effective than both alternatives.

The gap between user (64%) and system (8%) is large enough that the Wilson 95% confidence intervals do not overlap, suggesting a real difference rather than sampling noise.

Tool description placement was effectively useless at 2%.

The system message wasn’t much better at 8%.

For this model, on this task, only the user message slot reliably delivered instructions.


Results: Claude Sonnet 4.6 (Anthropic API)

20 runs per placement

Placement Compliance Rate Mean Turns
System 100% 3
User 100% 3
Tool Description 100% 3

Completely placement-insensitive.

100% compliance across all three slots.

The model followed the [DONE] instruction regardless of where it lived.

It also used the tools correctly every time:

  1. List files
  2. Read files
  3. Produce answer

No chart generated; a flat line at 100% carries no information.


Results: Claude Haiku 4.5 (Anthropic API)

50 runs per placement

Placement Compliance Rate Mean Turns
System 100% 3
User 100% 3
Tool Description 100% 3

Identical.

This is Anthropic’s smallest and cheapest model, yet it showed the same placement robustness as Sonnet.

Even Haiku exhibited zero placement sensitivity.

No chart generated; a flat line at 100% carries no information.

If you're wondering what "turns" are, Anthropic's Agent SDK documentation explains the agent loop nicely:

https://code.claude.com/docs/en/agent-sdk/agent-loop#the-loop-at-a-glance


The Summary

Model Type System User Tool Desc
qwen2.5-coder:3b Small local (Ollama) 8% 64% 2%
claude-haiku-4.5 Small frontier (Anthropic) 100% 100% 100%
claude-sonnet-4.6 Large frontier (Anthropic) 100% 100% 100%

The biggest difference wasn’t between system, user, and tool slots.

It was between model classes.

Both Anthropic models followed the instruction regardless of placement.

The 3B-parameter open-weight model did not.

For that model, the user message was the only placement that produced meaningful compliance.

Based on these results, placement sensitivity was a major factor for the 3B open-weight model and effectively a non-factor for the two frontier models tested.

Results Summary


What This Means in Practice

Many teams choose small local models for:

  • Cost
  • Latency
  • Privacy

If you’re one of them, instruction placement isn’t a matter of style.

It’s a matter of reliability.

In this experiment, placing a critical instruction in the system message or tool description was almost as ineffective as omitting it entirely.

The user message was the only slot that consistently delivered meaningful compliance.

If you're building with frontier models, placement didn't matter under these conditions.


Caveats

  • The prompts were short (~300 tokens for Ollama, ~6,000 tokens for Claude including tool calls).
  • Task accuracy was not measured.
  • The counting task is a distractor designed to force multi-turn tool use.
  • The exact percentages apply only to qwen2.5-coder:3b on this task.
  • Different models, quantizations, and tasks may produce different results.

What may generalize more broadly is the ranking:

On similar small open-weight models, the user message may continue to be the most effective placement, even if the size of the advantage changes.

Despite those caveats, the central result is hard to ignore:

For the 3B model, the same instruction produced dramatically different behavior depending solely on where it was placed.


What's Next: Instruction Conflict (Part 2)

This experiment measures placement strength in isolation:

  • One instruction
  • One slot
  • No competing signals

The natural follow-up is instruction conflict.

Imagine:

System prompt

Append [DONE]
Enter fullscreen mode Exit fullscreen mode

User message

Append [FINISHED]
Enter fullscreen mode Exit fullscreen mode

Tool description

Append [COMPLETE]
Enter fullscreen mode Exit fullscreen mode

Then observe which marker appears in the final answer.

This reveals the priority ordering of slots, not just whether they're read.

Questions worth exploring:

  • Does the system prompt win over the user message?
  • Do frontier models follow a hierarchy?
  • Does a small model notice the conflict at all?
  • Does it simply follow whichever slot it was already attending to?

Related Reading

Why 95 Reviews Beats 20 Reviews — Even When Both Score 95%

The statistical foundation behind the Wilson confidence intervals used in this experiment.

Connect

Top comments (0)