Building Newsroom AI Modules in PHP: 50+ Specialized Workflows

#ai #php #news #architecture

When people picture "AI in a newsroom," they usually imagine one big chat box bolted onto the editor. That's the wrong mental model, and it's the reason most of those features get used twice and then ignored.

A newsroom doesn't have one AI need. It has dozens of small, specific, repetitive ones: write three headline options, pull a 280-character social blurb, suggest a category, tag entities, generate alt text for the lead image, flag a defamation risk, translate the dek into English, propose an SEO title under 60 characters. Each is a tiny job with its own input shape, its own output contract, and its own tolerance for being wrong.

After running this across 200+ news sites, the architecture that actually stuck was not "an AI feature." It was a registry of small, specialized workflows behind a uniform interface. This article is how that's built in PHP, and the handful of decisions that matter more than which model you pick.

The unit of work: a task, not a prompt

The first thing to get right is the boundary. The reusable unit is a task — a named workflow with a fixed contract — not a free-floating prompt string. A task knows three things: what it needs, what it promises to return, and how expensive it's allowed to be.

interface AiTask
{
    public function name(): string;          // 'headline.suggest'
    public function tier(): string;          // 'cheap' | 'standard' | 'premium'
    public function build(array $input): array;   // -> messages for the model
    public function parse(string $raw): array;    // -> validated structured output
}

That parse() method is the part teams skip and regret. A model that returns prose when you expected JSON is not an edge case — it's Tuesday. If every task is responsible for validating its own output, a bad response fails inside the task where you can retry or fall back, instead of leaking malformed data into the editor.

A concrete task looks like this:

final class HeadlineSuggestTask implements AiTask
{
    public function name(): string { return 'headline.suggest'; }
    public function tier(): string { return 'cheap'; }

    public function build(array $input): array
    {
        return [
            ['role' => 'system', 'content' =>
                'You are a Turkish news copy editor. Return exactly 3 headline '
                . 'options as a JSON array of strings. Max 70 characters each. '
                . 'No clickbait, no ALL CAPS.'],
            ['role' => 'user', 'content' => mb_substr($input['body'], 0, 4000)],
        ];
    }

    public function parse(string $raw): array
    {
        $data = json_decode($this->stripFences($raw), true);
        if (!is_array($data) || count($data) < 1) {
            throw new BadOutputException('headline.suggest: not a list');
        }
        return array_slice(array_map('strval', $data), 0, 3);
    }
}

Once you have one of these, you have the shape for fifty. The interesting work isn't writing the fiftieth task — it's the machinery around them.

Why "50+" is an architecture choice, not a brag

The number isn't the point; the granularity is. You could collapse "suggest headline," "suggest SEO title," and "suggest social blurb" into one mega-prompt that returns all three. Don't. Three reasons:

Different cost tiers. A category suggestion can run on a fast, cheap model. A legal-risk flag should not.
Different failure handling. If headline generation fails, you shrug and the editor types one. If entity tagging fails silently, your archive search quietly rots.
Different UX surfaces. The blurb belongs to the social scheduler; the alt text belongs to the image picker. Coupling them in one call couples two unrelated screens.

Small tasks compose. Big prompts calcify.

Here's the rough taxonomy that emerged — grouped, because grouping is how editors actually find them:

Group	Example tasks
Headlines & framing	headline options, SEO title, social blurb, push-notification text
Structure	summary/dek, key-points list, "read more" suggestions
Classification	category suggest, tag/entity extraction, topic clustering
Media	image alt text, caption draft, thumbnail crop hint
Quality & risk	tone check, defamation/risk flag, fact-claim highlighter, profanity filter
SEO & distribution	meta description, schema keywords, related-article linking
Language	translate dek, simplify, localize idiom

That's already 25+ before you count per-language and per-section variants. The registry is what keeps it from becoming chaos.

The registry and the router

The registry is boring on purpose: a name-to-task map. The router is where the one genuinely valuable idea lives — routing by tier, not by vibes.

final class AiRouter
{
    /** @param array<string, AiTask> $tasks */
    public function __construct(
        private array $tasks,
        private ModelGateway $gateway,
        private AiCache $cache,
    ) {}

    public function run(string $taskName, array $input): array
    {
        $task = $this->tasks[$taskName]
            ?? throw new UnknownTaskException($taskName);

        $key = $this->cacheKey($taskName, $input);
        if ($hit = $this->cache->get($key)) {
            return $hit;
        }

        $messages = $task->build($input);
        $raw      = $this->gateway->complete($task->tier(), $messages);

        try {
            $out = $task->parse($raw);
        } catch (BadOutputException $e) {
            // one retry on the next tier up before giving up
            $raw = $this->gateway->complete($this->escalate($task->tier()), $messages);
            $out = $task->parse($raw);
        }

        $this->cache->put($key, $out, $task->ttl());
        return $out;
    }
}

Three things are doing real work here and each earns its place:

Caching by (task, input) hash. The same article body gets a headline suggestion once. Editors click these buttons repeatedly; without a cache you pay for every nervous re-click. This single layer was the biggest cost reduction we measured.
Tier-based escalation on bad output. Cheap model first. If it returns garbage that fails parse(), retry once on a stronger tier. Most cheap-model failures are formatting failures, and they don't repeat on the better model. You get cheap-model economics with premium-model reliability on the tail.
The gateway hides the provider. complete(tier, messages) is the entire surface the task sees. Whether 'cheap' maps to one provider this month and another next month is an ops decision, not a code change in 50 tasks.

The gateway: one seam for every provider

The gateway is what makes provider diversity survivable. News work is bursty and rate limits are real, so you want the freedom to move a tier between providers without touching task code.

final class ModelGateway
{
    public function __construct(private array $tierConfig) {}

    public function complete(string $tier, array $messages): string
    {
        $cfg = $this->tierConfig[$tier];          // provider + model + limits
        $provider = ProviderFactory::make($cfg['provider']);

        return $provider->chat(
            model:    $cfg['model'],
            messages: $messages,
            maxTokens: $cfg['max_tokens'],
        );
    }
}

The payoff is operational, not architectural elegance for its own sake. When a provider degrades at 9 a.m. on an election morning — and it will — you change a config map, not a deployment. Keeping the model identity out of the task and in the tier config is the difference between a five-minute mitigation and a panic.

Quality gates: cheap models lie confidently

A specialized-task design tempts you toward cheap models everywhere, because each job is small. The trap is that small jobs still produce confidently wrong output. The defense is deterministic gates around non-deterministic output — code, not another model, checks the boring constraints.

function validateHeadlines(array $headlines): array
{
    return array_values(array_filter($headlines, function (string $h): bool {
        $len = mb_strlen($h);
        if ($len < 15 || $len > 70)           return false; // length contract
        if ($h === mb_strtoupper($h))          return false; // no ALL CAPS
        if (preg_match('/!{2,}|\?{2,}/', $h))  return false; // no "!!!"
        return true;
    }));
}

Notice there's no model in that function. The expensive judgment ("is this a good headline?") stays with the human editor. The cheap, mechanical judgment ("is this even a valid headline-shaped string?") is plain PHP that runs in microseconds and never hallucinates. Push every constraint you can express as code out of the prompt and into a gate. Prompts are for taste; code is for rules.

The one place to spend a premium model deliberately is risk — defamation, sensitive claims, anything where a wrong call has legal weight. That task should run on your strongest tier, never cache its "looks fine" verdict for long, and always present as advisory to a human. AI flags; people decide.

Asynchronous by default

Editors will not wait four seconds for a button. Anything slower than roughly a second belongs in the background, with the result arriving when it's ready rather than blocking the save.

// On save: enqueue, don't block.
$queue->push('ai.enrich', [
    'article_id' => $id,
    'tasks'      => ['summary.make', 'tag.extract', 'category.suggest', 'seo.meta'],
]);

// A worker drains the queue and writes suggestions back as drafts
// the editor can accept or ignore — never auto-published.

Two rules that saved us real pain:

Suggestions are drafts, never silent writes. AI output lands in a "suggested" state. A human accepts it. The day you let a model write directly to the published field is the day you explain a hallucinated dateline to your editor-in-chief.
Idempotent jobs. Queues retry. If summary.make runs twice, the second run should overwrite the same suggestion slot, not create a duplicate. Key the write by (article_id, task_name).

What I'd tell my past self

Model the task contract first, the prompt second. The interface outlives any specific model.
Validate output inside the task. Malformed responses are normal; treat them as control flow, not exceptions to your worldview.
Route by tier, cache by input. These two together did more for cost and reliability than any prompt-engineering cleverness.
Keep rules in code, taste in prompts. Every constraint you can check deterministically is one the model can't violate.
Async, advisory, idempotent. The newsroom trusts a tool that suggests and never surprises.

The "50+" isn't a feature count to put on a slide. It's what falls out naturally once each AI job is small enough to have a single clear contract. Build the seam — task, registry, router, gateway, gate — and adding the fifty-first workflow is an afternoon, not a project.

We've been refining this pattern in production news software at Alesta WEB, across publishers of very different sizes, and the lesson keeps repeating: the architecture, not the model, is what makes newsroom AI feel reliable instead of gimmicky.