My content pipeline needed to process 10,000 articles a month.
I had three serious API options: OpenAI, Anthropic, and Cohere.
Every comparison article I found online was either two years old, benchmarked on toy examples, or written by someone with a vendor relationship. So I ran my own.
Three weeks, 4,200 test requests, one specific use case: bulk content generation at production scale. Here's what happened.
My Evaluation Criteria (And Why These Specific Metrics)
Before I get to the numbers, let me be clear about what I was optimizing for. "Best LLM API" is meaningless. Best for my use case is what I cared about:
Output quality on structured content — I needed articles with consistent heading structure, tone, and word count. Not just fluent text.
Cost per 1,000 words — At 10K articles/month, a $0.002 difference per article is $20/month. A $0.02 difference is $200/month.
Latency (p50 and p95) — The p95 matters more than p50 for bulk work. One slow request holds up a queue.
Instruction adherence — If I say "use h2 headers, not h3," does it actually do that across 1,000 requests? Or does it drift?
Error rate over volume — Rate limit errors, context errors, malformed responses. What breaks at scale?
I didn't test: coding tasks, reasoning, math, or anything multimodal. Those benchmarks exist everywhere. This one doesn't.
The Test Setup
Same prompt, same word count target, same structural requirements, across all three providers. I wrote a simple Node.js harness to run the tests and log results to a SQLite database.
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { CohereClient } from "cohere-ai";
import Database from "better-sqlite3";
const db = new Database("benchmark.db");
db.exec(`
CREATE TABLE IF NOT EXISTS results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
provider TEXT,
model TEXT,
prompt_tokens INTEGER,
completion_tokens INTEGER,
latency_ms INTEGER,
cost_usd REAL,
heading_count INTEGER,
word_count INTEGER,
error TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
)
`);
async function runBenchmark(provider, generateFn, prompt) {
const start = Date.now();
try {
const result = await generateFn(prompt);
const latency = Date.now() - start;
// Count h2 headings to measure instruction adherence
const headings = (result.text.match(/^## /gm) || []).length;
const words = result.text.split(/\s+/).length;
db.prepare(`
INSERT INTO results (provider, model, prompt_tokens, completion_tokens,
latency_ms, cost_usd, heading_count, word_count)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
`).run(
provider,
result.model,
result.promptTokens,
result.completionTokens,
latency,
result.cost,
headings,
words
);
return { success: true, latency };
} catch (err) {
db.prepare(`
INSERT INTO results (provider, model, error) VALUES (?, ?, ?)
`).run(provider, "unknown", err.message);
return { success: false, error: err.message };
}
}
I ran 1,400 requests against each provider — enough to get stable percentiles and surface intermittent errors. The prompt asked for a 600-word article with exactly 3 ## sections and a specific tone. Straightforward structural requirements. The kind of thing you'd run 10,000 times.
The Numbers
At 10,000 articles a month (averaging 600 words each), the cost difference between gpt-4o and gpt-4o-mini is roughly $167/month. That's not nothing.
The p95 numbers are where things get interesting. Cohere's command-r-plus had the highest p95 latency by a significant margin — nearly 2x OpenAI's gpt-4o and almost 4x Anthropic's claude-haiku-4-5. For synchronous use cases this would be painful. For queued bulk generation it's manageable, but you need to account for it in your timeout settings.
Claude Haiku had the best p95 of any capable model. If latency matters more than cost in your use case, that's worth noting.
Instruction Adherence
This is the metric nobody else was measuring. I asked for exactly 3 ## sections in every request.
Claude Sonnet followed structural instructions the most consistently. This surprised me — I expected the output quality gap between Sonnet and Haiku to be larger than the instruction adherence gap. It wasn't. Haiku drifted noticeably more on structure.
Cohere's models had the most drift. command-r would frequently add extra sections or collapse two sections into one. For casual content this is fine. For template-driven content pipelines where downstream parsing depends on consistent structure, it's a problem.
At 10,000 requests/month, a 2.8% error rate means 280 failed generations that need retries. That's not catastrophic, but it's a cost: retry logic, queue overhead, and the occasional job that fails three times and needs manual intervention.
OpenAI and Anthropic both had error rates under 1% in my test. Cohere's error rate was high enough that I'd budget for retry infrastructure before relying on it at scale.
Output Quality: The Part That's Hard to Put in a Table
I spot-checked 150 outputs across providers — 50 per provider, sampled across model tiers. I evaluated them on:
- Tone consistency with the prompt
- Logical flow between sections
- Avoidance of filler phrases ("In conclusion...", "It's important to note...")
- Whether the content was actually useful or just plausible-sounding
The honest assessment:
Claude Sonnet produced the most editable drafts. The structure was clean, the tone held throughout, and it was less prone to the kind of filler-heavy conclusions that make AI content feel generic. If I was generating content that humans would lightly edit before publishing, Sonnet gave editors the least work.
GPT-4o was close behind. Slightly more verbose, occasionally padded, but strong structural instincts and good default tone. If you're already in the OpenAI ecosystem and using the Assistants API, there's no compelling reason to switch just for content generation.
Claude Haiku surprised me on quality given its cost. The outputs weren't Sonnet-level, but they were significantly better than I expected from a model at that price point. For high-volume, lower-stakes content (product tags, meta descriptions, brief blurbs), Haiku is underrated.
Cohere command-r-plus had the most inconsistent quality. Some outputs were excellent. Others had structural problems or tonal drift mid-article. For human-reviewed content pipelines this is manageable. For automated pipelines where content goes straight to a CMS, the variance is a real issue.
GPT-4o-mini was fine. Not inspiring. Solid enough for use cases where you're generating high volumes of content that gets human reviewed anyway. At its price point, the quality-per-dollar ratio is hard to beat.
The Retry Handler I Ended Up Writing
Every provider needs retry logic. Here's the one I landed on after testing various approaches:
async function generateWithRetry(generateFn, prompt, options = {}) {
const {
maxRetries = 3,
baseDelay = 1000,
maxDelay = 30000,
} = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await generateFn(prompt);
} catch (err) {
const isRateLimit =
err.status === 429 ||
err.message?.includes("rate limit") ||
err.message?.includes("too many requests");
const isRetryable = isRateLimit || err.status >= 500;
if (!isRetryable || attempt === maxRetries) {
throw err;
}
// Exponential backoff with jitter
// Without jitter, retrying clients hit the API in waves and cause more rate limits
const exponential = baseDelay * Math.pow(2, attempt);
const jitter = Math.random() * 0.3 * exponential;
const delay = Math.min(exponential + jitter, maxDelay);
console.log(
`Attempt ${attempt + 1} failed (${err.status}). Retrying in ${Math.round(delay)}ms...`
);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
}
The jitter is important. I learned this the hard way: without it, rate-limited requests all retry at the same interval, which creates another burst that triggers another rate limit. Jitter spreads the retry load.
What Can Go Wrong (And Did)
Rate limits hit differently at scale than in testing. My test harness ran requests at a controlled rate. In production, queue drains aren't that clean — you get bursts when a lot of jobs land at once. I hit OpenAI rate limits in production that I never hit in testing because of this. Solution: implement a token bucket limiter, not just a fixed delay between requests.
Instruction adherence degrades with longer prompts. The 3-section test used a clean, short prompt. When I added more context (brand guidelines, examples, negative constraints), adherence dropped across all models. Claude Sonnet held up best under prompt complexity. GPT-4o-mini degraded the most.
Cohere's context window handling is different. A few of my requests hit a content filtering response that wasn't a standard API error — it returned a 200 with a specific response body structure. My generic error handler missed it and logged a successful request with garbled output. Read Cohere's error documentation more carefully than I did.
Cold starts are real with Anthropic's API. A small percentage of Haiku requests (roughly 2-3% in my data) had latencies 3-5x higher than normal. Not errors — just slow. I don't know if this is model loading, infrastructure, or something else, but it showed up consistently enough to affect the p95.
My Recommendation (Conditional, As It Should Be)
For high-volume, cost-sensitive generation where content gets human review: gpt-4o-mini or command-r. The cost savings are significant. The quality gap is real but acceptable if humans are in the loop.
For high-volume, automated pipelines where structure consistency matters: claude-haiku-4-5. Best p95 latency, solid instruction adherence, reasonable cost. The quality is better than the price suggests.
For lower-volume, higher-quality generation that feeds into editorial workflows: claude-sonnet-4-5. The instruction adherence and output editability are worth the cost premium when you're generating content that humans will touch.
For Cohere: If you have a specific reason to use it (enterprise contract, data residency requirements, a use case where Command-R performs unusually well for your specific domain), fine. For general content generation benchmarked against my criteria, it didn't compete with the OpenAI and Anthropic options.
One thing I'd do differently: I didn't test Anthropic's prompt caching for repeated system prompts. For bulk content generation where you're sending the same lengthy system prompt with each request, caching can significantly reduce input token costs. That's the next benchmark I'm running.
What's your use case? I'm curious whether the instruction adherence numbers match what others are seeing — especially if you're doing high-volume structured generation. Different domains might surface different failure modes than content generation did.




Top comments (0)