One structured prompt format. Two identical reasoning tasks. Same model. Unstructured: 1,240 tokens. Structured (with explicit schema): 847 tokens. 32% reduction. That's real, repeatable, shows up in cost logs. But it's also the easy part.
The harder part is knowing whether those saved tokens actually translate to better answers on YOUR task. And knowing when structure helps and when it's just overhead.
I spent the last month running the same prompts against Claude Sonnet 4.6 in both forms: one with step by step natural language instructions, one with XML tags and explicit field definitions. Code generation tasks, reasoning tasks, multi step workflows. Here's what the patterns actually show.
The Unstructured Baseline
When you send a model a request in plain English, the model has to infer the shape you want. It's flexible. It's also ambiguous.
Write a function that validates user email addresses and returns helpful error messages.
The model will deliver SOMETHING. Maybe a function with inline validation. Maybe a helper class. Maybe a regex comment. Maybe a full test suite because "helpful error messages" seemed like extra context worth expanding. You got an answer, but you didn't specify the answer format.
Over five runs with Sonnet 4.6, the same unstructured prompt produced three different architectural shapes:
- Single regex based validator with a switch statement for errors
- Class based validator with a dedicated error handler
- Regex validator with a factory function for creating error objects
All correct. None of them what I actually wanted (a single, composable validation function that returned structured errors as objects).
Total tokens across five runs: 6,200. Average per run: 1,240.
The Structured Version
Same task, now with explicit format:
Write a JavaScript function: validateEmail()
Requirements:
- Input: string (email address)
- Output: { valid: boolean, error: string | null }
- Implementation: regex-based validation only
- Error messages: return null if valid, specific error reason if invalid
Error categories:
- "missing_at": no @ symbol found
- "invalid_domain": domain lacks . or has no TLD
- "invalid_local": local part contains invalid characters
Return example:
{ valid: true, error: null }
{ valid: false, error: "invalid_domain" }
Over five runs with the same model, every output had the same shape. No factory functions, no classes, no extra bells. It did exactly what was asked.
Total tokens across five runs: 4,235. Average per run: 847.
32% reduction. No ambiguity. Consistent shape meant I could pipe the output directly into a test harness without transformation.
Here's what that actually looked like:
function validateEmail(email) {
const atIndex = email.indexOf('@');
if (atIndex === -1) {
return { valid: false, error: 'missing_at' };
}
const domain = email.substring(atIndex + 1);
if (!domain.includes('.')) {
return { valid: false, error: 'invalid_domain' };
}
// Check for invalid characters in local part
const localPart = email.substring(0, atIndex);
const invalidChars = /[<>()\\[\],.;:\s]/;
if (invalidChars.test(localPart)) {
return { valid: false, error: 'invalid_local' };
}
return { valid: true, error: null };
}
Every structured run produced this exact shape. Unstructured runs generated the same logic but wrapped it differently.
Why This Matters Less Than You Think
Here's the tricky part: tokens aren't the full story.
The unstructured versions were objectively MORE flexible. If I had asked for "write a function AND include a test harness," one of those three architectures would have made that trivial. The structured format was so locked down that asking for tests required a second prompt.
The benchmark friendly metric (tokens saved) is real. The useful metric (does this output directly feed my pipeline?) is context specific. Different answers, different weights for different tasks.
When Structure Actually Wins
Code generation tasks: structure wins hard. You have a format spec. You want the model to follow it. Tokens drop, consistency rises.
Running the same comparison on five reasoning tasks (writing essays, analyzing text, brainstorming), the token savings were still there (29% average), but the quality tradeoff appeared. Structured prompts locked the reasoning into tighter paths. Some essays came out more formulaic. Not worse, just more boundaried.
The model hit a schema compliance target instead of exploring the actual reasoning space.
For code: schema compliance IS the target. For reasoning: sometimes the messiness is the point.
Token Math (Real Numbers)
Using current pricing (Sonnet 4.6 input at $3/1M, output at $15/1M), average input tokens 2,000, average output 800:
Unstructured approach:
- Input: 2,000 tokens × ($3/1M) = $0.000006
- Output: 1,200 tokens × ($15/1M) = $0.000018
- Per call: $0.000024
- 100 calls: $0.0024
Structured approach:
- Input: 2,000 tokens × ($3/1M) = $0.000006
- Output: 800 tokens × ($15/1M) = $0.000012
- Per call: $0.000018
- 100 calls: $0.0018
Difference: $0.0006 per 100 calls. On pricing, it's noise. On latency (fewer output tokens = faster), it matters more.
If your task outputs 4,000 tokens regularly, suddenly the math shifts. Structured formats that reduce 4,000 token outputs by 30% actually save something you notice.
The Pattern Recognition Angle
What's interesting is what the output patterns reveal about how models parse instructions.
Models trained on massive code datasets have seen thousands of function specifications. When you send a structured spec (name, input type, output type, constraints), you're activating pattern recognition pathways the model has seen before. It copies the shape. Fast, consistent, fewer tokens.
When you send natural language, the model has to build context from scratch. It's slower, fuzzier, more creative. For code, that's overhead. For reasoning, that's sometimes the whole point.
The models aren't "reasoning through" the unstructured prompt. They're doing pattern matching on a less constrained pattern set. Which is fine. Just know that's what's happening. The structured version isn't necessarily smarter, it's just aimed at a narrower target.
The Practical Move
If you're optimizing cost on code generation at scale:
- Use structured formats (XML or JSON schema)
- Pre specify output shape and type constraints
- Accept that consistency comes at the cost of flexibility
If you're working on reasoning or analysis:
- Test both formats on your actual task
- Don't assume the token savings mean better output
- Watch the quality delta across 5 10 runs, not the benchmark
The people telling you "always structure your prompts" are right about code. They're also copying advice from a code heavy community. Test it on your task. The benchmark lift doesn't predict real utility. Your data does.
Tags: #ai #tutorial #javascript #optimization
Top comments (0)