How We Reduced LLM Costs by 95%: Cache + Batch + Cascade in PHP
We build news software — a content platform used by more than 200 publishers at Alesta WEB. Once we wired language models into the newsroom workflow (headline suggestions, summaries, SEO fields, tag extraction, draft scaffolding), something predictable happened: the AI bill started growing faster than the feature list.
The naive version of "add AI" is a thin wrapper around one expensive frontier model, called fresh on every request. It works in a demo. In production, across thousands of articles a day, it's a slow way to set money on fire.
This is the architecture we settled on after eighteen months of running it. Three layers — cache, batch, cascade — plus the quality gates that make the cheap layers safe to rely on. The result was roughly a 95% reduction in per-task cost versus the naive "frontier-model-only, no cache" baseline, with no measurable drop in editorial quality.
The code is PHP, because the platform is PHP. The ideas are language-agnostic.
1. The Naive Approach (and What It Costs)
Here's the version almost everyone ships first:
function generateHeadline(string $articleBody): string
{
$response = $client->chat([
'model' => 'gpt-4o',
'messages' => [
['role' => 'system', 'content' => 'You write concise news headlines.'],
['role' => 'user', 'content' => $articleBody],
],
]);
return $response['choices'][0]['message']['content'];
}
Nothing is wrong with this code. The problem is the usage pattern around it:
- The same wire-service story gets posted by dozens of sites, so we generate near-identical headlines over and over.
- Editors regenerate three or four times to compare options.
- A frontier model is doing work — extracting tags, normalizing a category — that a model costing a fraction as much would do just as well.
- Every call is synchronous and real-time, even when nothing about the task is urgent.
When you multiply "fresh frontier call every time" by real newsroom volume, the per-article AI cost lands somewhere that makes the CFO ask uncomfortable questions. Each of the three layers below attacks one of those waste sources.
2. Layer 1 — Cache: What to Cache, and What Not To
The single biggest win is the most boring one: don't ask the same question twice.
A large share of LLM calls in a news system are functionally identical. The key insight is that the cache key should be derived from the meaningful inputs — the task type, the model, the prompt template version, and a normalized hash of the content — not from the raw request object.
function cachedComplete(string $task, string $content, callable $compute): string
{
$key = sprintf(
'llm:%s:%s:%s',
$task,
PROMPT_VERSION[$task], // bump to invalidate on prompt change
hash('xxh128', normalize($content))
);
$hit = $cache->get($key);
if ($hit !== null) {
return $hit; // ~0 cost, sub-millisecond
}
$result = $compute($content); // the actual API call
$cache->set($key, $result, ttlFor($task));
return $result;
}
function normalize(string $s): string
{
// collapse whitespace, strip volatile boilerplate so trivially
// different inputs map to the same key
return trim(preg_replace('/\s+/u', ' ', $s));
}
Two details matter more than the cache engine you pick (we use Redis, but a database table works):
Version the prompt in the key. When you change a prompt template, you want every cached answer for that task to become a miss. Putting a PROMPT_VERSION constant into the key turns prompt edits into a clean, instant invalidation instead of a stale-output bug you chase for a week.
Know what not to cache. Anything personalized, anything real-time, anything where two identical inputs should legitimately produce different outputs (a "give me a fresh alternative" button) must bypass the cache. We mark those tasks explicitly rather than relying on TTL alone.
In our workload the cache hit rate sits around 60–70%, mostly because syndicated content repeats across sites. That one layer alone removes well over half the spend.
3. Layer 2 — Batch APIs: Trade Latency for Money
A surprising amount of LLM work in a newsroom is not time-sensitive. Nightly re-tagging of the archive. Generating summaries for the previous day's articles. Backfilling SEO descriptions on older content. None of it needs an answer in 800 milliseconds.
The major providers offer batch endpoints that run asynchronously — you submit a file of requests, and within a window (typically up to 24 hours) you get the results back, at roughly half the price of the synchronous API.
// Collect non-urgent jobs into a single batch submission
$lines = [];
foreach ($pendingJobs as $job) {
$lines[] = json_encode([
'custom_id' => $job->id,
'method' => 'POST',
'url' => '/v1/chat/completions',
'body' => [
'model' => $job->model,
'messages' => $job->messages(),
],
]);
}
$batchFile = uploadJsonl(implode("\n", $lines));
$batch = $client->batches()->create([
'input_file_id' => $batchFile['id'],
'endpoint' => '/v1/chat/completions',
'completion_window' => '24h',
]);
// A worker polls for completion and writes results back by custom_id
$queue->later('poll_batch', ['batch_id' => $batch['id']], minutes: 15);
The discipline this forces is healthy: you have to classify each task as interactive (editor is waiting) or deferred (a cron job can handle it tonight). Once we did that audit, far more work turned out to be deferrable than we expected. Roughly a quarter of our remaining spend — after caching — moved onto batch pricing for a flat ~50% discount on that slice.
A caveat: batch pricing differs by provider, and so does the completion window and the failure behavior. Build your batch layer behind an interface so the provider is swappable, and always handle partial failures — a batch of 5,000 requests will occasionally return 4,997.
4. Layer 3 — Cascade Routing: Match the Model to the Task
The last layer is the one people resist, because it feels like cutting corners. It isn't — it's refusing to pay frontier prices for kindergarten work.
Not every task needs the smartest model. Extracting tags from a story, mapping a category, cleaning up whitespace, classifying sentiment — small, cheap models handle these perfectly. Reserve the expensive model for genuinely hard generation: nuanced summaries, editorial rewriting, anything where a mistake is visible to readers.
const TASK_TIER = [
'tag_extraction' => 'cheap',
'category_map' => 'cheap',
'sentiment' => 'cheap',
'summary' => 'mid',
'headline' => 'mid',
'editorial_rewrite'=> 'frontier',
];
const TIER_MODEL = [
'cheap' => 'gpt-4o-mini',
'mid' => 'gpt-4o',
'frontier' => 'the-strongest-model-you-trust',
];
function modelFor(string $task): string
{
return TIER_MODEL[TASK_TIER[$task] ?? 'mid'];
}
We run six providers behind one interface (the editorial team never sees which one answered), which means cascade routing can also fail over: if a cheap model's output fails a quality gate, the task is automatically re-run one tier up. That gives you the cost of the cheap tier on the 90%+ of cases it handles well, and the safety of the expensive tier on the cases it doesn't.
Cascade routing is what takes you from "big savings" to "almost free for the easy majority."
5. Quality Gates: Keeping Cheap Models Honest
Cascade routing only works if you can detect when a cheap model got it wrong — otherwise you're trading money for garbage. Quality gates are cheap, deterministic checks that run on the output before it's accepted:
function passesGate(string $task, string $output): bool
{
return match ($task) {
'headline' => mb_strlen($output) <= 90
&& !str_contains($output, "\n")
&& !looksTruncated($output),
'summary' => mb_strlen($output) >= 100
&& sentenceCount($output) >= 2,
'tag_extraction' => isValidJsonArray($output),
default => $output !== '',
};
}
None of these call an LLM. They're string length, format validity, structure checks — the kind of thing that costs nothing and catches the most common cheap-model failures (truncation, wrong format, empty output). When a gate fails, the cascade re-runs the task one tier up and logs it. If a particular task fails its gate too often, that's your signal to move it up a tier permanently.
This is the piece that makes the whole architecture trustworthy. Without gates, "use a cheaper model" is a gamble. With gates, it's a measured decision with an automatic safety net.
6. A Cost Dashboard You Actually Look At
You can't optimize what you don't measure. We log every LLM call with four fields: task, tier, whether it was a cache hit, and the token counts. That's enough to answer the only questions that matter:
- Which task is costing the most? (Usually the surprise here is a "cheap" task being called a million times.)
- What's our real cache hit rate, per task?
- How often is the cascade escalating — and which tasks?
A weekly rollup of per-task economics turns cost control from a panic ("the bill doubled!") into a routine ("tag extraction escalation rate crept up, the prompt drifted, fix the template"). The dashboard is boring on purpose. Boring means in control.
7. The Numbers, After Eighteen Months
Against the naive baseline (one frontier model, every call fresh and synchronous), the layers compound:
- Cache removes ~60–70% of calls outright.
- Batch takes ~50% off a meaningful slice of what remains.
- Cascade routes the easy majority of the rest to models costing a fraction of frontier prices.
Stacked, that lands at roughly 95% lower per-task cost for the same workload — and, because cache hits are instant and cheap models are fast, the median latency for AI features actually improved. Cheaper and faster, which is not the trade-off people expect when they hear "we cut the AI budget."
The editorial quality held because the expensive model still does all the work that's actually hard; we just stopped paying it to do the easy work.
8. What's Next: Prompt Caching
The newest lever we're rolling out is provider-side prompt caching — where a long, stable system prompt (style guide, formatting rules, examples) is cached on the provider's side and billed at a steep discount on repeat calls. For a news system with a large, rarely-changing editorial style prompt prepended to thousands of calls, that's a natural fit on top of the three layers above.
The throughline across all of it is the same: a language model is a power tool, not a default. Cache the repeats, defer what isn't urgent, route easy work to cheap models, and verify cheap output with checks that cost nothing. Do that, and AI features stop being a line item that scares the finance team and go back to being what they should be — a feature.
We build news software used by 200+ publishers — agency integration, AI-assisted editorial workflows, native mobile apps, and subscription infrastructure. More on the platform at alestaweb.com.
Top comments (0)