OpenAI Outage Postmortem: What Status Pages Don't Tell You

#llm #observability #devops #ai

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

On April 20, 2026, around 10:00 AM ET, ChatGPT users started losing access to projects mid-session, with users reporting that in-flight work was lost. The OpenAI status page eventually flagged a "partial outage" that ran roughly 90 minutes before the recovery banner went up, per Tom's Guide live coverage. Earlier in March, customers reported that an Azure-hosted GPT-5.2 endpoint returned HTTP 400 and HTTP 429 to a slice of traffic from 23:20 UTC on March 9 to 19:32 UTC on March 10 while the rest of the model lineup ran fine (per the Azure OpenAI Service status history). Two incidents in 30 days. Both status pages flipped late. Both postmortems focus on aggregate availability rather than per-call behavior.

The status page is a coarse aggregate signal with a green dot. Vendor postmortems are written after the fact and typically report aggregate availability, not what your p99 looked like at 10:14 AM. If your service depends on a hosted LLM, you cannot wait on vendor lights and you cannot wait for the writeup. You instrument your side of the wire.

Below: the five signals I instrument on every LLM call and the OpenTelemetry snippet that captures them. Tested against the shape of real incidents, not the shape of "is the API up."

Why status pages miss what your users feel

The status page is a binary on a smoothed metric across a global aggregate. It flips when the global aggregate crosses a threshold, which means three failure modes routinely fly under it:

Silent latency creep. p50 stays normal, p99 doubles, the average sits inside the SLA band, the dot stays green. Your users notice. The status page does not.
Regional skew. US-East is fine, EU is degraded. Aggregate is fine. Your EU traffic is bleeding.
Model-routing shifts. Your call to gpt-4o lands on a fallback variant during incident recovery. Latency is fine. Output quality drifts. No status page tracks that.

To these three add two more that I have seen burn teams during the last 12 months. Partial token-throughput degradation: your stream is producing tokens, half as fast, so wall-clock time-to-first-byte is fine but total response time blows past your timeout. Schema-validation drift: the model's structured output starts failing your validator at a higher rate, which is invisible to latency metrics and shows up only in your downstream pipeline.

The five signals

Worth instrumenting on every LLM call. If you have one of them you have an alarm. If you have all five you have an incident timeline.

Per-model p50 / p95 / p99 latency, broken out by model_name. Aggregate latency hides model-specific regressions.
Per-region error rate, broken out by your egress region. If your traffic crosses regions, this catches regional skew.
Token throughput, in tokens per second, measured during streaming. The streaming hot signal. Status pages cannot measure this; you can.
Time-to-first-token (TTFT), separately from total latency. TTFT spikes are the earliest hint of provider-side queueing.
Structured-output validation rate, the share of calls that pass your downstream schema. Catches the silent drift that latency metrics miss.

These five are the leading indicators. Cost-per-call, total volume, and cache hit rate are the lagging ones; useful for the postmortem but not for the alarm.

The OTel snippet

OpenTelemetry traces and metrics for an OpenAI call, with the five signals captured. Python, but the shape is the same in every language with an OTel SDK.

import time
from contextlib import contextmanager
from openai import OpenAI
from opentelemetry import metrics, trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("llm.client")
meter = metrics.get_meter("llm.client")

latency_hist = meter.create_histogram(
    "llm.request.duration",
    unit="s",
    description="End-to-end LLM call latency.",
)
ttft_hist = meter.create_histogram(
    "llm.request.ttft",
    unit="s",
    description="Time to first streamed token.",
)
tps_hist = meter.create_histogram(
    "llm.request.tokens_per_second",
    unit="tok/s",
    description="Streaming throughput. Use the provider's usage.completion_tokens on the final chunk for accurate token counts; the snippet below shows a word-count proxy you can swap out.",
)
err_counter = meter.create_counter(
    "llm.request.errors",
    description="LLM call errors by region and model.",
)
schema_counter = meter.create_counter(
    "llm.request.schema_failures",
    description="Structured-output schema validation failures.",
)

client = OpenAI()


@contextmanager
def llm_span(model: str, region: str):
    attrs = {"llm.model": model, "llm.region": region}
    start = time.perf_counter()
    with tracer.start_as_current_span("llm.call", attributes=attrs) as span:
        try:
            yield span, attrs
        except Exception as e:
            err_counter.add(1, attrs)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
        finally:
            latency_hist.record(time.perf_counter() - start, attrs)


def call_streaming(model: str, region: str, prompt: str, validate):
    with llm_span(model, region) as (span, attrs):
        stream = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        )
        first_token_at = None
        token_count = 0
        chunks = []
        start = time.perf_counter()

        for chunk in stream:
            delta = chunk.choices[0].delta.content or ""
            if delta and first_token_at is None:
                first_token_at = time.perf_counter()
                ttft_hist.record(first_token_at - start, attrs)
            # Word-count proxy. For accurate tokens/sec, swap with
            # tiktoken or chunk.usage.completion_tokens on the final chunk.
            token_count += len(delta.split())
            chunks.append(delta)

        elapsed = time.perf_counter() - (first_token_at or start)
        if elapsed > 0:
            tps_hist.record(token_count / elapsed, attrs)

        text = "".join(chunks)
        if not validate(text):
            schema_counter.add(1, attrs)
            span.set_attribute("llm.schema.valid", False)
        else:
            span.set_attribute("llm.schema.valid", True)
        return text

A few choices worth naming. TTFT is recorded on first non-empty delta, not on the first network packet, because the upstream sometimes streams empty deltas as a keepalive and you want the first useful token. Tokens-per-second is computed from the first-token boundary, not from request start, so you measure the streaming path and not the queue. Schema validation runs after the response and increments a separate counter, because schema drift is a different alarm class from latency.

Alert thresholds that hold up

Default to multi-window burn-rate alerts, not static thresholds. A static "p99 > 3 seconds" alert pages you every time a long prompt comes in. A burn-rate alert pages you when the rate of breaches over a 5-minute window is high enough to consume your error budget for the day.

Thresholds I have seen hold up across teams shipping on hosted LLMs:

Latency. p95 over a rolling 5-minute window > 2x the trailing-7-day p95 baseline, page after sustained 10 minutes.
TTFT. p95 > 1.5x trailing baseline, page after 5 minutes. TTFT spikes ahead of total-latency spikes; this is your earliest signal.
Token throughput. p50 < 0.5x trailing baseline, page after 5 minutes. The classic "the model is generating but slowly" failure.
Per-region error rate. > 2% sustained over 5 minutes, broken out by region. Catches the regional skew.
Schema validation rate. > 5% drop in pass rate from baseline over 30 minutes (equivalently, a rise in the failure rate). Slower-burning, but the one that catches silent quality regressions.

You will tune these. The point is that the thresholds are rates of change relative to your own baseline, not absolutes. The hosted model's "normal" drifts over time as the provider re-routes and re-balances; your baseline drifts with it. Static thresholds get you paged on quiet days and silent on loud ones.

What the March incident would have looked like on these dashboards

During the March 9-10 Azure GPT-5.2 window, customers reported HTTP 400 and HTTP 429 scoped to that model variant. On a status page that aggregates across the model family, that kind of scoped failure can stay below the threshold until enough customers complain. On a per-model error-rate dashboard, you would see a clean spike on gpt-5.2 and a flat line on every other model. Two-minute response: the model variant is degraded, route to a sibling.

The April 20 incident degraded ChatGPT projects; based on the symptom shape, that path appears to sit separately from the API. If you only watch the status page, you wait for the official banner. If you watch your TTFT and token-throughput dashboards, you see the API path stay healthy, you see the projects path degrade, and you know to fail traffic that depends on projects to a different feature path.

Neither incident required heroics. Both required not trusting the vendor's lights.

What to log on every call, beyond metrics

Metrics give you the alarm. Logs give you the postmortem. On every LLM call worth logging:

Request ID returned by the provider. OpenAI returns it on the response object. Always log it. Without it, the support ticket goes nowhere.
Routing region or PoP, if your provider exposes it.
Model variant fingerprint (the model name alone is not enough). Some providers return a system_fingerprint that pins the exact deployment. When this changes mid-day, your output changes, and you want to correlate it.
Prompt cache key or hash, if you use caching. Cache invalidation is a silent latency killer.
Full token usage breakdown, prompt and completion separately. Cost reconstruction depends on it.

Sample at a sane rate. Log 100% of errors and 1-5% of successes. Hash and truncate prompts before logging unless your data policy says otherwise.

The closer

Hosted LLMs are infrastructure. Infrastructure that you do not own and do not see inside is, by definition, the part of your stack you instrument hardest. The five signals above are the part you can ship today, and the part that paged you at 10:02 AM on April 20 instead of the 10:48 AM banner update.

If this was useful

The shape of LLM observability — what to put on a span, how to define an SLO that survives a model migration, how to wire evals to traces so quality regressions page you the same way latency regressions do — is the topic of LLM Observability Pocket Guide. It is the book I wish had existed the first time I had to explain to a CTO why our P95 was fine but our customers were furious.