DEV Community

Cover image for AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)
Kristofer Jussmann
Kristofer Jussmann

Posted on • Originally published at blog.kaelux.dev

AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)

AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)

We spend hours tweaking the words in our prompts, but how much thought do we give to the structure? If you ask an AI to return data in JSON vs. Markdown, or if you write a concise 50-word prompt vs. a detailed 500-word prompt, does the quality of the reasoning actually change?

To find out, I ran Study E v2: The Format Wars.

I subjected 5 frontier models to 1,080 rigorous evaluations across 12 distinct task domains (coding, math, data extraction, analysis, creative writing, and more). Every single evaluation was scored blindly by a 3-judge LLM jury on a 100-point scale.

The results completely changed how I build AI applications.


πŸ”¬ The Setup: 1,080 Evaluations

We tested five heavyweight models:

  • GPT-5.4 (OpenAI)

  • Nemotron 3 Super 120B (Nvidia)

  • Claude Sonnet 4.6 (Anthropic)

  • Gemini 3.1 Pro (Google)

  • Qwen 3.5 397B (Alibaba)

For each model, we ran 216 evaluations testing 18 unique prompt configurations:

  • 6 Formats: Plain Text, Markdown, XML, JSON, YAML, Hybrid (Text + Code Blocks)

  • 3 Lengths: Short (<50 words), Medium (~150 words), Long (>300 words)

The scoring was handled by a ruthless 3-judge panel (Llama 4 Maverick, Claude Opus 4.6, and Atla Selene Mini) grading on instruction following, reasoning quality, formatting adherence, and edge-case handling.


πŸ† Finding 1: The Model Rankings

Before looking at formats, how did the models perform overall across all 18 permutations?

Overall Model Rankings β€” Average Score out of 100 (1,080 evaluations)

  1. πŸ₯‡ GPT-5.4: 88.1 / 100 β€” Won 10 out of 12 task domains

  2. πŸ₯ˆ Nemotron 120B: 85.1 / 100 β€” Won 1 domain (Data Extraction), extremely close to GPT-5.4

  3. πŸ₯‰ Claude Sonnet 4.6: 69.5 / 100

  4. Gemini 3.1 Pro: 62.6 / 100 β€” Won 1 domain (Question Answering)

  5. Qwen 397B: 61.0 / 100

Takeaway: GPT-5.4 is the undeniable reasoning king right now. But Nvidia's Nemotron 120B is a shocking powerhouseβ€”it scored incredibly close and actually beat GPT-5.4 outright in Data Extraction tasks. If you aren't testing Nemotron in your pipelines, you are missing out.

Task Domain Winners β€” GPT-5.4 dominates 10/12, but Nemotron owns Extraction


🧱 Finding 2: The Best Format is... JSON?

If you want the highest quality reasoning and instruction following from an LLM, what format should you ask it to return?

Format Impact on Reasoning Quality β€” Averaged Over All 5 Models

  1. YAML: 74.6 / 100

  2. JSON: 74.4 / 100 (Statistical tie with YAML)

  3. Hybrid: 73.5 / 100

  4. XML: 73.3 / 100

  5. Markdown: 72.9 / 100

  6. Plain Text: 70.8 / 100

Takeaway: Asking the model to structure its output in JSON or YAML doesn't just make it easier for your code to parseβ€”it actually improves the model's reasoning.

Why? Forcing the model into a strict structural schema (like JSON keys) acts as a cognitive scaffold. It forces the model to categorize its thoughts before generating output, leading to fewer hallucinations and better instruction adherence. Plain unstructured text performed the worst across the board.

But here's the nuance: different models prefer different formats:

Format Γ— Model Heatmap β€” The sweet-spot varies by model

Note: While JSON was the best overall, Nemotron and Qwen actually performed slightly better when outputting YAML.


πŸ“ Finding 3: The Prompt Length Paradox

We've been trained to write massive, highly detailed "megaprompts" with endless context. But the data reveals a startling paradox:

The Length Paradox β€” Shorter Prompts Win Across All Models

  • Short Prompts (<50 words): 80.1 / 100

  • Medium Prompts (~150 words): 72.8 / 100

  • Long Prompts (>300 words): 66.9 / 100

Takeaway: Across all 5 models and all 6 formats, short prompts absolutely demolished long prompts.

When you flood the context window with too many instructions, constraints, and examples, the model suffers from attention dilution. It forgets the primary objective and gets bogged down trying to satisfy secondary constraints.

The worst combination in the entire study? Qwen 397B given a Long prompt asking for Plain Text (38.8/100).


πŸ… Finding 4: The Best and Worst Combinations

What are the absolute best and worst model + format + length trios?

Top 5 vs Bottom 5 Combinations β€” The gap is massive (53+ points)

The Golden Combo scored 92.2 / 100: GPT-5.4 + Hybrid Output + Short Prompt.


πŸš€ The Ultimate Prompting Formula

If you want to maximize the performance of a modern LLM, the data points to a clear formula:

  1. Keep it brief: State your objective clearly in under 50 words. Drop the fluff.

  2. Demand structure: Always ask the model to return its answer in JSON or YAML. Avoid asking for unstructured text.

  3. Use the right model: GPT-5.4 for general reasoning/coding, Nemotron 120B for extraction.

I built PromptTriage specifically to help developers automatically refactor those bloated 500-word metaprompts down into the high-scoring "Short + Structured" format this data proves works best.

Data lovers: The full 1,080-row dataset and analysis script are open-sourced in the PromptTriage repo.

Top comments (0)