Spud Was the Rumored GPT-6. It Shipped as GPT-5.5, Two Tiers Inside.

#ai #llm #architecture #prompt

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The codename was "Spud." Pretraining reportedly finished in late March 2026, the rumor calendar settled on April 14, and then April 14 came and went with no blog post and no model. Nine days later, on April 23, the model actually shipped — but as GPT-5.5, not GPT-6 (Axios coverage, OpenAI's announcement). OpenAI kept the GPT-5 family name, three variants (standard, Thinking, Pro), API reasoning effort knobs (low, medium, high, xhigh, non-reasoning), and a 60% claimed reduction in hallucinations versus the previous release.

The "delay" headline is the boring read. The interesting read, and the one worth a Sunday afternoon of architecture thinking, is what the tiered reasoning surface tells you about where inference is heading. The Spud release is not one model. It's a routing layer over multiple compute tiers, with the prompt and the reasoning.effort parameter deciding which tier handles your call. That's the same pattern the rest of us have been hand-rolling for two years under names like "verifier loop" and "cheap-fast plus deep." OpenAI is normalizing it.

If you're shipping anything that calls an LLM, the question now is whether you're ready when the tier choice becomes a first-class API parameter on every provider, the way model name is today.

What "two-tier" actually means in 2026

The fast version: System-1 is the cheap, fast pass that answers most queries. System-2 is the expensive verifier or reasoner that checks the work, does math, walks the chain. The naming comes from Kahneman. The engineering pattern is older than that. It's just if confidence < threshold: ask the smart model.

What changed in April 2026 is how the API surfaces it. Look at OpenAI's GPT-5.5 docs. You don't pick "GPT-5.5 Fast" or "GPT-5.5 Verifier." Instead, you pick one model and pass reasoning.effort=high or reasoning.effort=low. Internally, OpenAI routes your call through different compute paths. Externally, it looks like a single model with a dial. The dial is the API.

The hallucination claim that came with the release (60% fewer than the prior release on OpenAI's own evals) is real but methodology-locked. One independent benchmark tells a more complicated story: as reported by VentureBeat citing Artificial Analysis, GPT-5.5 hit a hallucination rate of 86% on the AA-Omniscience eval, against Opus 4.7 at 36%. On that eval, the model knows more than the competition. On that eval, the model is also more confident when it's wrong. That's the part the press release skipped.

The two-tier framing makes the methodology question even more important. Hallucination rates on effort=xhigh are not hallucination rates on effort=low, and your production code is probably calling low to keep the bill down.

Why this pattern was always coming

The economic argument writes itself. Most queries are easy. "What's the capital of France." "Format this JSON." "Rephrase this email." A 400B parameter reasoner is overkill, and the latency is bad enough to hurt the UX. Routing those to a smaller fast path is free latency and free money.

The remaining queries (chain-of-thought, multi-step planning, math, code) benefit from the expensive path. So you build a router. The fast path handles 80% of traffic at 1/10 the cost. The slow path handles the 20% where it actually earns its keep. Average cost drops, average quality goes up, the user can't tell which path their query took.

This is also how Anthropic's extended thinking, Google Gemini's thinking modes, and DeepSeek's reasoning models all work. They share one architecture: a cheap pass with a gated escalation to an expensive one. The convergence is not a coincidence. It's the only design that makes economic sense at scale.

What you should build now

The next 12 months will see every major LLM provider expose two-tier semantics. If your code calls one model for everything, you'll waste money on easy queries and miss quality on hard ones. The fix is to wrap your LLM calls in a router today, even if the router is dumb to start. When two-tier APIs land, you swap the dumb router for a smarter one and the rest of your code doesn't change.

A first pass, sketch only. The two helpers (avg_top_logprob, needs_reasoning) are illustrative placeholders for your team's own confidence-scoring and prompt-classification heuristics:

from anthropic import Anthropic
from openai import OpenAI

fast = OpenAI()
slow = Anthropic()

def two_tier_complete(prompt: str, max_retries: int = 1) -> str:
    """Cheap pass first, verifier on low-confidence answers."""
    fast_resp = fast.chat.completions.create(
        model="gpt-5.5-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        logprobs=True,
        top_logprobs=5,
    )
    answer = fast_resp.choices[0].message.content
    confidence = avg_top_logprob(fast_resp)

    if confidence > -0.5 and not needs_reasoning(prompt):
        return answer

    verify_prompt = (
        f"Question: {prompt}\n"
        f"Draft answer: {answer}\n"
        "If the draft is correct, repeat it verbatim. "
        "If incorrect, give the corrected answer with brief reasoning."
    )
    slow_resp = slow.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        messages=[{"role": "user", "content": verify_prompt}],
    )
    return slow_resp.content[0].text

Three things to notice. First, the gating signal is a mix of model self-confidence (logprobs) and a heuristic on the prompt (needs_reasoning). Second, the slow model isn't called from scratch. It sees the fast model's draft. That's how you get the cost savings: the verifier reads a short draft instead of generating from zero. Third, the API contract upstream stays the same. Your handler still calls two_tier_complete(prompt). The tiering is internal.

The verifier prompt is the hard part

The gating logic above is easy. The verifier prompt is where teams spend their time and where the quality lives. A naive "is this answer correct" prompt will rubber-stamp the fast model's output most of the time because the verifier sees a confident answer and pattern-matches to "looks fine."

The pattern that works is to make the verifier answer the question fresh, with the draft as a hint. Phrase the second-tier prompt as "answer the question, given this draft as a hint." The verifier is doing its own reasoning, then comparing. If the verifier's answer matches the draft, ship the draft. If it diverges, ship the verifier's answer.

verify_prompt = (
    "You are answering this question independently. "
    f"Question: {prompt}\n\n"
    f"Hint (may be wrong): {draft}\n\n"
    "Give your own answer. Then say 'AGREES' or 'DIFFERS' "
    "compared to the hint."
)

This trick keeps the verifier honest. It's also how OpenAI's reasoning.effort=high mode internally avoids the rubber-stamp failure. The deep model isn't checking the shallow model's homework. It's doing the homework from scratch with the shallow model's draft as scaffolding.

What to instrument before two-tier APIs land

When the official two-tier APIs ship, every call you make will be one of three things: pure fast tier, pure slow tier, or fast-then-verified. Your traces need to capture which one. Otherwise you'll burn money on accidental effort=high calls and not notice for a month.

Three fields, every span:

span.set_attribute("llm.tier", "fast" | "slow" | "verified")
span.set_attribute("llm.fast_cost_usd", fast_cost)
span.set_attribute("llm.slow_cost_usd", slow_cost or 0)

Add a daily report that breaks cost down by tier and answer length. The leading-indicator metric is "% of queries that escalated to slow tier." If that climbs from 12% to 35% over a week, your gating heuristic is broken or your traffic mix shifted, and either way you want to know now, not at month-end.

The second metric to watch is disagreement rate. Of the queries where slow-tier verification ran, what fraction produced a different answer than the fast tier? A healthy ratio is 5–15% as a rough rule of thumb; tune to your traffic. If it's 40%, your fast tier is too dumb for your traffic. If it's 0%, the verifier is rubber-stamping and you're paying for nothing.

The architectural read

The practical effect of OpenAI's "delay" reads as a rebrand. The interesting story is the API shape, not the version number.

Value here goes to teams that wrapped their LLM calls in a router before the providers did. They already know which queries belong in which tier. Their bill doesn't spike when traffic does. That work pays off whether the next big release is from OpenAI, Anthropic, or DeepSeek. The router survives the model swap. The model doesn't.

Build the wrapper this week. Even a dumb one. The day a competitor ships an "effort" knob you'll already have a place to plug it in.

If this was useful

Two-tier inference is an agent-design pattern wearing an inference-API costume. The pocket guides below cover the underlying shape: how to design the verifier loops, what prompts make the second tier earn its keep, and the wrapping patterns that survive a provider swap.