I Benchmarked 47 LLM Providers Against Real Queries - Here's What I Found
Every week, a new "GPT-4 killer" drops on Product Hunt. "50% cheaper! 2x faster! Better reasoning!"
I got tired of taking marketing claims at face value. So I spent three months benchmarking every LLM provider I could find against real production workloads. Not synthetic tests. Not academic datasets. Actual queries from real systems.
47 providers tested. 12,847 queries benchmarked. $3,200 spent on API calls just to gather data.
Here's what I found -- and the open-source router I built so you can use the results immediately.
Table of Contents
- The Setup: What I Actually Tested
- The Benchmark Results
- The Matrix: What to Use When
- Building a Smart Router
- Step-by-Step: Setting Up A3M Router
- Production Results
- What I Learned
- Try It Yourself
The Setup: What I Actually Tested
Query Categories
I replayed six months of production queries across five categories:
| Category | Count | Examples |
|---|---|---|
| Simple Q&A | 4,247 | Password resets, FAQs, "how do I..." |
| Code completion | 2,103 | Function suggestions, bug fixes, refactoring |
| Text summarization | 1,892 | Support tickets, document summaries |
| Complex reasoning | 847 | Escalation analysis, multi-step logic |
| Multilingual | 612 | Translations, non-English support |
Metrics Tracked
- Cost per query (actual billed amount, not list price)
- Latency (time to first token + time to complete)
- Quality score (human-rated 1-5 on 500 random samples)
- Uptime (measured over 30 continuous days)
No cherry-picking. No best-of-three. Every query, every provider, every metric.
The Benchmark Results
Speed: Marketing vs Reality
The latency claims you see on provider websites? They're measured on 10-50 token responses. Here's what happens at production scale (~800 tokens average):
| Provider | Listed Latency | Real Latency (800 tok) | Quality |
|---|---|---|---|
| Cerebras | 350ms | 380ms | 82% |
| Groq | 400ms | 420ms | 82% |
| MiniMax | "Ultra-fast" | 600ms | 89% |
| GLM-4 | "Fast inference" | 800ms | 92% |
| OpenAI GPT-4 | 2,100ms | 2,100ms | 95% |
Key insight: Groq and Cerebras actually deliver on their speed promises even at scale. Most others don't.
Cost: The Hidden Math
List price per million tokens vs. quality-adjusted effective cost (accounting for tokenization differences, retry rates, and quality gaps):
| Provider | Cost/1M Tokens | Effective Cost | Best For |
|---|---|---|---|
| CommandCode | $0.00 | $0.00 | Simple Q&A (free tier) |
| Groq | $0.59 | $0.72 | Speed-critical tasks |
| Cerebras | $0.60 | $0.73 | Real-time responses |
| MiniMax | $1.50 | $1.69 | Code, Chinese queries |
| Mistral | $2.00 | $2.22 | Balanced workloads |
| GLM-4 | $2.80 | $3.04 | Multilingual tasks |
| OpenAI GPT-4 | $30.00 | $30.00 | Complex reasoning |
Key insight: Groq at $0.59/1M tokens is 50x cheaper than GPT-4 at $30/1M tokens -- and for code tasks, quality is within 12%. That's not a typo.
Quality by Task Type
Aggregate quality scores are misleading. A provider that's 90% overall might be 95% for summarization and 70% for code:
| Provider | Simple Q&A | Code | Summary | Complex | Multilingual |
|---|---|---|---|---|---|
| GLM-4 | 94% | 88% | 96% | 89% | 97% |
| MiniMax | 91% | 93% | 89% | 87% | 94% |
| Groq | 89% | 91% | 87% | 82% | 85% |
| Mistral | 93% | 90% | 94% | 91% | 92% |
| GPT-4 | 96% | 94% | 97% | 95% | 94% |
Key insight: GLM-4 beats GPT-4 on multilingual tasks (97% vs 94%). MiniMax beats GPT-4 on code speed/quality ratio. No single provider wins every category.
The Matrix: What to Use When
Based on the data, here's the optimal routing strategy:
Simple Q&A β CommandCode (free) or GLM-4 ($2.80/1M)
Code completion β MiniMax ($1.50/1M) or Groq ($0.59/1M)
Summarization β GLM-4 ($2.80/1M) or Mistral ($2.00/1M)
Complex reasoning β GPT-4 ($30/1M) or Claude ($15/1M)
Multilingual β GLM-4 ($2.80/1M) -- beats GPT-4 at 1/10th cost
The pattern: Use premium providers for the 15-20% of queries that actually need them. Route everything else to cheaper alternatives.
Building a Smart Router
Manually switching providers per query is not sustainable. I needed automation. So I built A3M Router -- an open-source routing layer with all the benchmark data baked in.
How It Works
Query Input
β
βΌ
βββββββββββββββββββββββ
β Query Classificationβ β Is it code? Math? Translation? Simple Q&A?
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Provider Matching β β Check cost/quality/speed profiles
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Execute + Fallback β β Call provider, retry on failure
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Cost Tracking β β Log spend per provider
βββββββββββββββββββββββ
The routing decisions are based on the benchmark data I collected. No guessing. No marketing claims.
Step-by-Step: Setting Up A3M Router
1. Install
npm install adaptive-memory-multi-model-router
2. Basic Routing
const { createA3MRouter } = require('adaptive-memory-multi-model-router');
const router = createA3MRouter();
// Simple question? Routes to cheapest capable provider
const result1 = await router.route("How do I reset my password?");
console.log(result1.primary_model); // e.g., commandcode/flash
console.log(result1.estimated_cost); // $0.000
// Code generation? Routes to fast provider
const result2 = await router.route("Write Python to parse JSON");
console.log(result2.primary_model); // e.g., groq/llama-3.3-70b
console.log(result2.estimated_cost); // $0.0004
// Complex reasoning? Keeps premium provider
const result3 = await router.route("Analyze this contract for liability clauses");
console.log(result3.primary_model); // e.g., openai/gpt-4
console.log(result3.estimated_cost); // $0.04
3. Custom Configuration
const router = createA3MRouter({
memory: true, // Learn from past routing decisions
costBudget: 0.05, // Max $0.05 per request
providers: {
// Override default provider priority
preferred: ['groq', 'cerebras', 'mistral'],
// Premium fallback for complex queries
fallback: ['openai', 'anthropic']
},
// Custom quality threshold per category
qualityThresholds: {
code: 0.85,
summary: 0.90,
reasoning: 0.93
}
});
4. Batch Processing
const queries = [
"What is 2+2?",
"Write a JavaScript fetch wrapper",
"Summarize: The quick brown fox...",
"Evaluate: Should we migrate to microservices?",
"Translate 'hello world' to Mandarin"
];
const results = await router.routeBatch(queries);
results.forEach((r, i) => {
console.log(`Query: ${queries[i]}`);
console.log(` β ${r.primary_model} ($${r.estimated_cost.toFixed(4)})`);
});
// Output:
// Query: What is 2+2?
// β commandcode/flash ($0.0000)
// Query: Write a JavaScript fetch wrapper
// β groq/llama-3.3-70b ($0.0004)
// Query: Summarize: The quick brown fox...
// β mistral/mistral-small ($0.0010)
// Query: Evaluate: Should we migrate to microservices?
// β openai/gpt-4 ($0.0400)
// Query: Translate 'hello world' to Mandarin
// β glm-4/flash ($0.0010)
5. Cost Tracking
// After routing several queries, check your spend
const costReport = router.getCostReport();
console.log(`Total spent: $${costReport.total.toFixed(4)}`);
console.log(`By provider:`);
Object.entries(costReport.byProvider).forEach(([provider, cost]) => {
console.log(` ${provider}: $${cost.toFixed(4)}`);
});
console.log(`Avg cost/query: $${costReport.avgPerQuery.toFixed(4)}`);
6. CLI Usage (No Code Required)
# Route a single query and see which provider gets selected
npx a3m-router route "Explain async/await in JavaScript"
# Compare responses across multiple providers
npx a3m-router compare "Write a REST API in Express"
# See all configured providers and their profiles
npx a3m-router providers
# Run the full benchmark suite
npx a3m-router benchmark
# Check cumulative cost tracking
npx a3m-router cost
7. Express.js Integration
const express = require('express');
const { createA3MRouter } = require('adaptive-memory-multi-model-router');
const app = express();
app.use(express.json());
const router = createA3MRouter({ memory: true });
app.post('/chat', async (req, res) => {
const { message, priority } = req.body;
// Route based on query + optional priority hint
const routing = await router.route(message, {
priority: priority || 'balanced' // 'cost' | 'speed' | 'quality' | 'balanced'
});
// routing contains: primary_model, estimated_cost, alternatives, classification
res.json({
model: routing.primary_model,
cost: routing.estimated_cost,
category: routing.classification,
alternatives: routing.alternatives.slice(0, 3)
});
});
app.listen(3000, () => console.log('Router API on :3000'));
Production Results
After six months running the router in production (replacing a single-provider setup):
| Metric | Before (GPT-4 Only) | After (Routed) | Change |
|---|---|---|---|
| Monthly Cost | $2,400 | $720 | -70% |
| Avg Latency | 2,100ms | 800ms | -62% |
| Quality Score | 100% (baseline) | 94% | -6% |
| Uptime | 99.97% | 99.95% | Comparable |
Query Distribution
The router automatically distributed traffic based on query type:
| Category | % of Traffic | Typical Provider | Typical Cost |
|---|---|---|---|
| Simple Q&A | 47% | CommandCode / GLM-4 | $0 - $0.001 |
| Code | 28% | Groq / MiniMax | $0.0004 - $0.002 |
| Summarization | 15% | Mistral / GLM-4 | $0.001 - $0.003 |
| Complex Reasoning | 10% | GPT-4 / Claude | $0.03 - $0.05 |
The 70% cost reduction isn't magic. It's just not using a $30/1M token model for queries that a $0.59/1M token model handles at 90% quality.
What I Learned
1. Chinese Providers Are Underrated
GLM-4 and MiniMax consistently outperformed expectations. GLM-4 beats GPT-4 on multilingual tasks. MiniMax has the best speed/quality ratio for code I've seen outside of Groq. And they're 10-20x cheaper.
2. Free Tiers Are Genuinely Useful
CommandCode isn't just a teaser. For simple Q&A (password resets, FAQs, basic lookups), it works perfectly well at zero cost. If 30-40% of your queries are simple, that's a significant chunk of your bill eliminated.
3. Speed Claims Are Half-True
Providers advertise latency for tiny responses (10-50 tokens). At production scale (500-1000 tokens), the gap narrows dramatically. Groq and Cerebras are the only ones that consistently deliver near-advertised speeds.
4. One Provider Is Never Optimal
This was the biggest takeaway. No single provider wins across all categories. GPT-4 is best for complex reasoning. GLM-4 is best for multilingual. Groq is best for speed. Mistral is the best all-rounder. Routing isn't optional -- it's the only sane approach at scale.
5. The Quality Trade-off Is Worth It
94% quality at 70% cost savings is a no-brainer for most applications. Unless you're in medical, legal, or financial domains where every percentage point matters, the savings far outweigh the small quality dip.
Try It Yourself
Interactive Playground
No installation needed. Test routing decisions right in your browser:
Quick Start
# Install
npm install adaptive-memory-multi-model-router
# Route your first query
npx a3m-router route "Your actual production query here"
# See all providers
npx a3m-router providers --detailed
# Compare providers on a specific query
npx a3m-router compare "Write a binary search in Python"
Links
- GitHub: Das-rebel/adaptive-memory-multi-model-router
- NPM: adaptive-memory-multi-model-router
- Full Benchmark Data: docs/BENCHMARK_DATA.md
- License: MIT (code and data)
Stats
- 872 weekly npm downloads
- 33 tests passing
- 12 providers pre-configured
- 47 providers benchmarked
The Raw Data
I'm sharing the full benchmark dataset because keeping it proprietary defeats the purpose of doing the research. Use it to build your own router, validate my findings, or find providers I missed.
Full dataset: BENCHMARK_DATA.md
Includes all 47 providers, 12,847 query results, cost/latency/quality breakdowns, and query-type-specific recommendations.
Over to You
I tested 47 providers, but I'm sure I missed some. What providers are you using that I should benchmark? Drop them in the comments and I'll add them to the next round.
Also curious:
- Do my quality scores match your experience? I rated 500 samples manually -- would love validation from others running production LLM workloads.
- What's your query mix? Simple Q&A vs code vs complex reasoning? The optimal routing strategy depends heavily on your distribution.
- Has anyone else built routing systems? Would love to compare approaches.
Built this because I was tired of marketing claims. Sharing the data so you don't have to spend $3,200 benchmarking yourself.
Top comments (0)