AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)

#ai #llm #promptengineering #machinelearning

AI Format Wars: Does the Shape of Your Prompt Matter? (1,080 Evals Later)

We spend hours tweaking the words in our prompts, but how much thought do we give to the structure? If you ask an AI to return data in JSON vs. Markdown, or if you write a concise 50-word prompt vs. a detailed 500-word prompt, does the quality of the reasoning actually change?

To find out, I ran Study E v2: The Format Wars.

I subjected 5 frontier models to 1,080 rigorous evaluations across 12 distinct task domains (coding, math, data extraction, analysis, creative writing, and more). Every single evaluation was scored blindly by a 3-judge LLM jury on a 100-point scale.

The results completely changed how I build AI applications.

🔬 The Setup: 1,080 Evaluations

We tested five heavyweight models:

GPT-5.4 (OpenAI)
Nemotron 3 Super 120B (Nvidia)
Claude Sonnet 4.6 (Anthropic)
Gemini 3.1 Pro (Google)
Qwen 3.5 397B (Alibaba)

For each model, we ran 216 evaluations testing 18 unique prompt configurations:

6 Formats: Plain Text, Markdown, XML, JSON, YAML, Hybrid (Text + Code Blocks)
3 Lengths: Short (<50 words), Medium (~150 words), Long (>300 words)

The scoring was handled by a ruthless 3-judge panel (Llama 4 Maverick, Claude Opus 4.6, and Atla Selene Mini) grading on instruction following, reasoning quality, formatting adherence, and edge-case handling.

🏆 Finding 1: The Model Rankings

Before looking at formats, how did the models perform overall across all 18 permutations?

🥇 GPT-5.4: 88.1 / 100 — Won 10 out of 12 task domains
🥈 Nemotron 120B: 85.1 / 100 — Won 1 domain (Data Extraction), extremely close to GPT-5.4
🥉 Claude Sonnet 4.6: 69.5 / 100
Gemini 3.1 Pro: 62.6 / 100 — Won 1 domain (Question Answering)
Qwen 397B: 61.0 / 100

Takeaway: GPT-5.4 is the undeniable reasoning king right now. But Nvidia's Nemotron 120B is a shocking powerhouse—it scored incredibly close and actually beat GPT-5.4 outright in Data Extraction tasks. If you aren't testing Nemotron in your pipelines, you are missing out.

🧱 Finding 2: The Best Format is... JSON?

If you want the highest quality reasoning and instruction following from an LLM, what format should you ask it to return?

YAML: 74.6 / 100
JSON: 74.4 / 100 (Statistical tie with YAML)
Hybrid: 73.5 / 100
XML: 73.3 / 100
Markdown: 72.9 / 100
Plain Text: 70.8 / 100

Takeaway: Asking the model to structure its output in JSON or YAML doesn't just make it easier for your code to parse—it actually improves the model's reasoning.

Why? Forcing the model into a strict structural schema (like JSON keys) acts as a cognitive scaffold. It forces the model to categorize its thoughts before generating output, leading to fewer hallucinations and better instruction adherence. Plain unstructured text performed the worst across the board.

But here's the nuance: different models prefer different formats:

Note: While JSON was the best overall, Nemotron and Qwen actually performed slightly better when outputting YAML.

📏 Finding 3: The Prompt Length Paradox

We've been trained to write massive, highly detailed "megaprompts" with endless context. But the data reveals a startling paradox:

Short Prompts (<50 words): 80.1 / 100
Medium Prompts (~150 words): 72.8 / 100
Long Prompts (>300 words): 66.9 / 100

Takeaway: Across all 5 models and all 6 formats, short prompts absolutely demolished long prompts.

When you flood the context window with too many instructions, constraints, and examples, the model suffers from attention dilution. It forgets the primary objective and gets bogged down trying to satisfy secondary constraints.

The worst combination in the entire study? Qwen 397B given a Long prompt asking for Plain Text (38.8/100).

🏅 Finding 4: The Best and Worst Combinations

What are the absolute best and worst model + format + length trios?

The Golden Combo scored 92.2 / 100: GPT-5.4 + Hybrid Output + Short Prompt.

🚀 The Ultimate Prompting Formula

If you want to maximize the performance of a modern LLM, the data points to a clear formula:

Keep it brief: State your objective clearly in under 50 words. Drop the fluff.
Demand structure: Always ask the model to return its answer in JSON or YAML. Avoid asking for unstructured text.
Use the right model: GPT-5.4 for general reasoning/coding, Nemotron 120B for extraction.

I built PromptTriage specifically to help developers automatically refactor those bloated 500-word metaprompts down into the high-scoring "Short + Structured" format this data proves works best.

Data lovers: The full 1,080-row dataset and analysis script are open-sourced in the PromptTriage repo.