Mahmut Gündüzalp

Posted on May 16

Building a Multi-LLM News CMS with PHP 8.2: Lessons from 200+ Production Sites

#ai #webdev #tutorial #php

Introduction

Over the past 21 years, our team has helped build and maintain a news content management ecosystem that now powers 200+ active news portals across Turkey. In the last 18 months, we've integrated six different LLM providers (OpenAI, Anthropic, Google Gemini, DeepSeek, Groq, and Mistral) into news production workflows.

This article shares the architectural decisions and lessons we've learned — not implementation specifics, but the why-and-when of multi-LLM systems for news publishing. The patterns that let us reduce AI inference costs by ~95% while keeping quality high.

No buzzwords. Just decisions that hold up in production.

Why Multi-LLM Instead of "Just Use GPT-4"?

Three reasons we don't rely on a single AI provider:

1. Cost optimization. GPT-4o costs $2.50/M input tokens. Gemini Flash costs $0.075/M — 33x cheaper. A simple summary task doesn't need GPT-4o's reasoning. Routing tasks to the right model means massive savings.

2. Vendor independence. When OpenAI had outages in 2024-2025, sites that relied solely on GPT broke. Multi-provider setups fell back to Claude or Gemini seamlessly.

3. Specialized strengths. Claude is better at long-context reasoning. Gemini is better at structured output. Groq is fastest for real-time chat. Mistral handles multilingual content well. Each provider has a sweet spot.

The Cascade Routing Strategy

The core idea: try the fastest and cheapest model first; fall back to more capable (and expensive) models only when needed.

For a news CMS, this means categorizing tasks by complexity:

Simple summaries → fast cheap models (Groq Llama, Gemini Flash)
Headline suggestions → mid-tier models (GPT-4o-mini, Claude Haiku)
SEO meta generation → cheap models suffice
Long-form content generation → premium models (Claude Sonnet, GPT-4o)
Fact-checking → highest reliability tier (accuracy critical)
Translation → mid-tier multilingual models

Each task type gets a fallback chain. If the primary model is rate-limited or unavailable, the system tries the next one — no human intervention needed.

The savings come from the realization that most news CMS tasks don't need the smartest model. A two-line summary of a news article doesn't require frontier reasoning. Reserving premium models for the genuinely hard tasks (complex analysis, fact-checking) is where multi-LLM pays off.

Why Abstraction Matters

Each AI provider has different SDKs, request formats, authentication, error handling, and pricing models. Hiding all of this behind a common interface is what makes multi-LLM practical.

The principle is simple: any provider should be swappable without changing the calling code. A workflow that "summarizes an article" shouldn't care if it's OpenAI, Anthropic, or Google under the hood. Today it might be Gemini Flash; next month it might be a new provider that didn't exist when the code was written.

This abstraction also makes A/B testing painless. Want to know if Claude Sonnet produces better summaries than GPT-4o for Turkish news? Route 50% of traffic to each, measure quality and cost, decide. Without abstraction, this experiment would require parallel codebases.

Cost Optimization: 95% Reduction in Practice

The cost reduction comes from three compounding layers:

Layer 1: Caching (~60% of savings)

Many news CMS tasks are deterministic: "summarize this article" with the same article produces the same answer. Cache once, reuse forever (until the source content changes).

Real-world cache hit rate in production: ~70% for common tasks like summaries, SEO meta tags, and headline suggestions.

The trick is knowing what to cache and what not to. Personalized content, real-time chat, and time-sensitive analysis shouldn't be cached. But the bread-and-butter of news editing (summarize, tag, rewrite headline) is highly cacheable.

Layer 2: Batch APIs (~25% additional savings)

OpenAI's Batch API offers 50% discount with a 24-hour SLA. Anthropic offers the same. Many news tasks don't need to happen in real-time:

Overnight SEO meta generation for the day's articles
Bulk product description generation for e-commerce catalogs
Archived content tagging and categorization
Translation backlog processing

Workers collect these into batches and submit them periodically. The savings compound across thousands of operations per day.

Layer 3: Cascade Routing (~10% additional savings)

By the time tasks reach a premium model, they've already been filtered through:

Cache (free)
Cheap model attempt (Groq, Gemini Flash)
Mid-tier model attempt (GPT-4o-mini, Claude Haiku)
Premium model (only when truly needed)

Quality gates between layers reject inadequate outputs from cheap models, but most outputs pass the gate. Premium model usage drops to <10% of total inference calls.

Turkish News Agency Landscape

This is where geographic context matters. Turkey's news ecosystem revolves around 8 major wire services:

Anadolu Ajansı (AA) — state news agency
Demirören Haber Ajansı (DHA) — major commercial wire
İhlas Haber Ajansı (İHA) — conservative-aligned wire
ANKA, THA, HİBYA, İGFA, BHA — regional and specialized agencies

Each has its own content format, distribution protocol, and category taxonomy. Generic global CMS platforms (WordPress, Drupal) don't handle this — there's no "Turkish news agency plugin" that connects all eight.

The implementation pattern here is adapter-style integration: each agency gets its own integration module that conforms to a common interface, so the downstream workflow doesn't care which agency the content came from. Adding a 9th or 10th agency becomes a few days of work, not a months-long rewrite.

A scheduled job runs every few minutes, fetches new articles from all enabled agencies in parallel, normalizes image formats (WebP conversion for web performance), generates responsive thumbnails, deduplicates against existing content, and stores everything in a moderation queue. Editors then review, approve, edit, or reject — never publishing raw wire content blindly.

AI Visibility: SEO's Next Frontier

Traditional SEO targets Google. AI visibility targets ChatGPT, Claude, Gemini, Perplexity, and their successors. The standards are emerging:

llms.txt — a markdown file similar to robots.txt but content-focused. It tells LLM crawlers what your site is about, key sections, and how to navigate it.
ai-sitemap.xml — like sitemap.xml but with article summaries and structured metadata that LLMs can ingest efficiently.
Schema.org JSON-LD — NewsArticle, NewsMediaOrganization, BreadcrumbList, FAQPage markups give crawlers structured access to content semantics.
Bot allow/disallow rules — explicitly permitting GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, CCBot, Bytespider, AppleBot, and Google-Extended.

The bet is that LLM-based search and answer engines will eventually rival Google for content discovery. Sites optimized only for traditional SEO will lose visibility in this new layer. Adding AI visibility to the standard SEO checklist is cheap insurance.

Production Numbers

After 18 months of running multi-LLM stacks across news production:

AI inference cost reduction: approximately 95% vs naive GPT-4o-only approach
Cache hit rate: approximately 70% on common tasks (summaries, headlines, SEO meta)
Provider availability: 99.97% (vs single-provider ~99.5%)
Processing throughput: sub-second cached responses, 1-3 seconds for fresh inference
Agency content ingestion: 8 agencies polled regularly, thousands of articles processed daily

These numbers come from real production environments, not benchmarks. Your mileage will vary depending on traffic patterns, cache TTL strategy, and quality requirements.

Lessons Learned

After 21 years building CMS software and 18 months optimizing for AI:

1. Don't lock into one provider. It's tempting to "just use OpenAI." Don't. The day they have an outage or change pricing, you'll wish you had alternatives ready.

2. Cache aggressively, but thoughtfully. Most AI tasks repeat with deterministic outputs. Cache them. But know which tasks must always be fresh (personalized, real-time, time-sensitive).

3. Route by task complexity, not by hype. Most tasks don't need GPT-4o or Claude Opus. A cheap model gets 90% of the work done at 5% of the cost. Save premium models for genuinely hard tasks.

4. Local regulations are first-class concerns. In Turkey: KVKK (data protection), İYS (marketing consent registry), BİK (Press Advertising Authority) compliance. In EU: GDPR, AI Act. Don't bolt these on later — design for them from day one.

5. Quality gates matter. A cheap model giving wrong answers is more expensive than an expensive model giving right ones (especially when wrong outputs damage brand trust). Add validation between cascade layers.

6. Stable beats shiny. Modern PHP isn't trendy. Smarty isn't trendy. MySQL isn't trendy. They all run reliably for years. The newest framework will be deprecated in three. Pick stable.

Closing Thoughts

The "best" architecture for a news CMS isn't the most novel. It's the one that:

Works reliably for years
Costs less than what you charge clients
Handles local quirks (regulatory, linguistic, cultural)
Survives provider deprecations

Multi-LLM with cascade routing and aggressive caching fits that bill in 2026. It will probably fit it in 2030 too — the providers will change, but the abstraction principle won't.

If you're building or evaluating a multi-provider AI architecture, focus on decision points rather than specific implementations. The provider you start with may not be the one you finish with. The pattern that works today should still work when the entire AI landscape has rotated through three or four hype cycles.

I write about practical software architecture, multi-LLM systems, and lessons from running CMS at scale. Feel free to drop questions in the comments — I read all of them, even when I don't reply quickly.

DEV Community