I spent months using a mix of AI models without much science behind it. A bit of Claude here, some GPT there, with the logic that "more expensive = better." Spoiler: that logic was completely wrong.
A few weeks ago I got tired of paying without knowing exactly what I was buying. So I decided to do something I should have done long ago: a systematic benchmark. No "it feels better" — real numbers, real tasks, category-by-category evaluation.
The result was uncomfortable. Models I had on the podium failed at basic things. Models I'd barely considered crushed them on my real tasks. And the number that hit hardest: I could be saving over 95% in API costs while maintaining exactly the same output quality.
This post is the honest debrief of that process.
The Benchmark: What I Tested and How I Validated It
The idea was simple: no trusting external leaderboards. Public benchmarks measure academic things that rarely reflect real work. I needed to know which model won on my specific tasks.
I evaluated 18 models across 14 categories that correspond exactly to what I do every day:
- Orchestration and chat (my main assistant, OpenClaw)
- Content creation (blog, LinkedIn)
- Social media
- Python code
- n8n workflows (JSON generation)
- Technical course content
- SEO
- Automated customer support (Chatwoot)
- Data analysis
- Summarization and information extraction
- Logical reasoning
- Response speed
- Price-to-quality ratio
- Consistency (does it give the same output if I ask twice?)
The process: for each category, I ran the same prompts across all models. No cherry-picking. No "this prompt works better in this model." Same input, compare outputs.
Tedious? Very. Worth it? Completely.
The Results That Surprised Me Most (The Uncomfortable Part)
I'll be direct: I had biases. Strong ones.
I'm an Anthropic fan. I'd been using Claude Sonnet and Haiku in production for months, had almost automatic respect for them. The narrative that Anthropic is "the serious lab" that does things right had convinced me their models were a safe bet.
And then the Python code results came in.
Claude Haiku 3.5 and Claude Sonnet 4.6 failed with runtime errors on code that other models executed without issues. Not obscure code or weird edge cases. I'm talking about generating moderately complex Python scripts for my workflows. Syntax errors, incorrect imports, logic that breaks on execution.
That raised an uncomfortable question: was I paying for a name or for results?
Honest answer: a bit of both, and that needs to be corrected.
The other finding that shook me was Qwen3-32b via Groq. This Alibaba model, running at practically free ($0.0007 per thousand tokens — not a typo), gave me 10/10 in orchestration and general chat. It's now the model my personal assistant uses for 80% of daily interactions.
For context: GPT-5.2 Pro costs ~$0.015/1K tokens. Qwen3-32b via Groq: $0.0007. Same quality level for the tasks I actually do. 20x price difference.
And Llama 3.3 70B, also via Groq: $0.0014/1K, scoring 10/10 on LinkedIn and blog content. The model that basically helps me write this (plus my review, of course) costs less than a coffee per month.
My Current Stack by Task
This is the configuration I have running today. No theory, no "it depends." This is what I use:
| Task | Model | Provider | Cost/1K tokens |
|---|---|---|---|
| Daily chat / Orchestration (OpenClaw) | Qwen3-32b | Groq | $0.0007 |
| LinkedIn and blog content | Llama 3.3 70B | Groq | $0.0014 |
| n8n workflows (JSON generation) | GPT-5.3 Codex | OpenAI | $0.0158 |
| Technical courses | Qwen 3.5 397B MoE | OpenRouter | $0.0044 |
| SEO content | Claude Sonnet 4.6 | Anthropic | — |
| AI support (Chatwoot) | Llama 3.3 70B | Groq | $0.0014 |
Why GPT-5.3 Codex for n8n: It's the only model I found that generates clean JSON without wrapping it in markdown. Sounds like a minor detail until your workflow breaks for the third time because the model decided to add `json around the output. The rest failed at this systematically. Codex doesn't.
Why I kept Sonnet for SEO: Honestly, I haven't finished evaluating this task thoroughly enough. There's something about how Sonnet handles search intent and semantic structure that I haven't replicated with other models. Could be wrong. It's the one place I still have doubt.
Qwen 3.5 397B MoE for technical courses: This monster with 397 billion parameters in MoE (Mixture of Experts — only activates a fraction of its parameters per query) gave me 10/10 on detailed technical content. Code explanations, module structure, practical exercises. And at $0.0044/1K it's ridiculously cheap for what it delivers.
What I'd Change in a Month
The field moves too fast to say "this is the definitive configuration."
Things on my radar:
Review Sonnet for SEO. If I find a model that replicates the semantic quality I see there at a fraction of the cost, I switch. Looking especially at Qwen 3.5 and Mistral Large.
Monitor Groq pricing changes. The $0.0007 pricing for Qwen3-32b is extraordinary, but inference prices constantly drop. What seems "almost free" today might be normal in six months.
Test Gemini 2.5 Pro on code. Didn't include it in this benchmark to keep it manageable. Several data points suggest it could be very competitive on Python code. Testing it next month.
Deepseek V4 — If it becomes reliably available outside China with good latency, it goes straight into the benchmark.
This configuration has between 2 and 6 months of life before it needs to be revisited. Not to discourage you — but so you don't spend forever optimizing something that'll change anyway.
Recommendation If You're Just Starting
If you're beginning with AI and don't know which model to use, here's the simplified version:
For chat and general tasks: Qwen3-32b via Groq. Practically zero cost, excellent quality. If you don't know how to configure it, use n8n with an LLM node pointing to Groq — everything's connected in 5 minutes.
For writing content: Llama 3.3 70B via Groq. Surprisingly good for blog and LinkedIn.
For code: GPT-5.3 Codex if you need clean JSON for automations. If it's just Python code, try Qwen3-32b first — might surprise you.
The most expensive trap: assuming the most expensive model is the best for your case. It isn't. I assumed this for months and was throwing money away.
The benchmark will save me, conservatively, 95%+ in API costs compared to using GPT-5.2 Pro for everything — without sacrificing quality on any real task. That's real. And it took me a couple days of systematic work to discover.
If you have questions about building your AI stack or want to see how I structure these benchmarks, join my community Cágala, Aprende, Repite — a group of entrepreneurs and builders navigating this together, without filters or formalities.
Last updated: March 2026. The field moves fast — if you're reading this in six months, I've probably updated the stack.
📝 Originally published in Spanish at cristiantala.com
Top comments (0)