Your LLM Gateway Is a Blind Spot. Here's How to Instrument It After the LiteLLM Incident.

#ai #observability #security #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

In March 2026, the LiteLLM security incident put a spotlight on a layer most production LLM stacks have added without thinking about it: the gateway.

If your application calls OpenAI, Anthropic, Gemini, Bedrock, and a couple of open-weight models through a unified interface, that interface is almost certainly a gateway. LiteLLM. Portkey. OpenRouter. Kong AI Gateway. Cloudflare AI Gateway. A home-grown adapter. Whatever you call it, it is a single chokepoint that sees every prompt, every response, every API key.

It is also, for most teams, the least-instrumented service in the stack.

Why the gateway is a blind spot

Three reasons it's underobserved:

It's infrastructure, not application code. Teams instrument the app. The gateway came later; nobody added a tracing pass to it.
It often runs as a sidecar or a library. LiteLLM's Python SDK is imported inline. There's no separate service to add observability to. The calls blend into the application's existing spans.
Its failure modes don't look like failures. A gateway that silently routes to the wrong model, or strips a header, or drops a retry budget, returns HTTP 200. The downstream call succeeds. The application never knows the gateway did something weird.

After March 2026, there is a fourth: the gateway is also an attack surface. If compromised, it sees every prompt your users send and every API key you use to call providers.

What to instrument on the gateway path

Three layers, in priority order.

Layer 1: OTel GenAI spans on every call

Every outbound LLM call through the gateway should emit a chat (or embeddings, retrieval, execute_tool) span with the full GenAI semantic-convention attribute set. Most gateways now ship this; some don't. Verify it or add it.

For LiteLLM specifically:

# gateway_instrumentation.py
import litellm
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("gateway")

def traced_completion(**kwargs):
    with tracer.start_as_current_span(
        f"chat {kwargs.get('model')}"
    ) as span:
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.request.model", kwargs.get("model"))
        try:
            resp = litellm.completion(**kwargs)
            span.set_attribute(
                "gen_ai.response.model", resp.model
            )
            span.set_attribute(
                "gen_ai.usage.input_tokens",
                resp.usage.prompt_tokens,
            )
            span.set_attribute(
                "gen_ai.usage.output_tokens",
                resp.usage.completion_tokens,
            )
            span.set_attribute(
                "gen_ai.response.id", resp.id
            )
            return resp
        except Exception as exc:
            span.set_status(
                Status(StatusCode.ERROR, str(exc))
            )
            span.record_exception(exc)
            raise

The attribute that matters most for the gateway layer is gen_ai.response.model — the model the provider actually served. Gateways with aliasing, model routing, or fallback logic can end up serving a different model than your code requested. Capturing both request and response model is how you detect that.

Layer 2: Routing decisions as span events

When the gateway decides to fall back, retry, or route differently, that's a decision. Log it as a span event on the parent span:

span.add_event(
    "gateway.fallback",
    {
        "from_provider": "anthropic",
        "to_provider": "openai",
        "reason": "anthropic 529 overloaded",
        "attempt": 2,
    },
)

Span events are cheap. They're searchable. When the on-call is triaging an incident and wants to know "why did this call end up on GPT instead of Claude," this is the signal that answers them in one click.

Layer 3: Security auditability

Post-LiteLLM-incident, treat the gateway as auditable infrastructure. Minimum bar:

Outbound request logging (redacted). Every call through the gateway is logged with timestamp, requesting tenant, model, token count. Not the prompt itself — that's a privacy decision — but the metadata.
API key rotation tracking. The gateway knows which keys it holds. A key that hasn't rotated in 90 days should be visible on a dashboard.
Version pinning. The gateway SDK version is a supply-chain risk. Pin it. Watch its GitHub releases. Alert on the gateway's own CVE feed.

The fallback-tier observability gap

Most gateways support multi-provider fallback. A call to claude-sonnet-4-6 falls back to gpt-5.4 when Anthropic brownouts, then to gemini-3-pro when GPT brownouts. This is good. This is also a failure mode your evals are not watching.

The fallback tiers have different tokenizers, different context limits, and different instruction-following profiles. A prompt tuned for Sonnet 4.6 produces measurably worse output on GPT-5.4 and worse again on Gemini 3 Pro. "Measurably worse" is exactly the thing you built online evals to detect.

The rule: run your online judge on the fallback tiers in steady state, not just during the incident. A tertiary tier that scores 0.55 on a Tuesday afternoon is a tertiary tier that will fail you during the outage that forces you to use it.

The instrumentation for this is an existing online-eval pipeline with one new slice: group by gen_ai.response.model (not gen_ai.request.model). The delta between what you asked for and what you got is the fallback signal.

The gateway-aware circuit breaker

A quality-aware circuit breaker around the gateway trips on HTTP error and judge-score drop. Adapted from Chapter 18:

class GatewayCircuitBreaker:
    def __init__(
        self,
        err_threshold: float = 0.02,
        judge_threshold: float = 0.70,
        cooldown_s: int = 300,
    ):
        self.err_threshold = err_threshold
        self.judge_threshold = judge_threshold
        self.cooldown_s = cooldown_s
        self.state = "CLOSED"

    def record(self, ok: bool, judge_score: float | None):
        err_rate = self._rolling_err_rate()
        judge_1h = self._rolling_judge_mean()
        if err_rate > self.err_threshold:
            self._open()
            return
        if judge_1h is not None and judge_1h < self.judge_threshold:
            self._open()
            return

    def allow(self) -> bool:
        if self.state == "OPEN":
            if self._elapsed() > self.cooldown_s:
                self.state = "HALF_OPEN"
                return True
            return False
        return True

A breaker that watches only HTTP status will happily keep serving a provider that has silently degraded. A breaker that watches judge score without a minimum sample will trip on noise. You want both, and you want the breaker to trip pessimistically and recover conservatively. Probe with a single request; if clean, ramp. Do not flip.

If this was useful

The gateway is one layer of a five-layer stack: application, gateway, provider, observability backend, eval runtime. Observability for LLM Applications covers all five. Chapter 4 has the OTel GenAI semconv that the gateway should emit. Chapter 15 covers the roll-your-own gateway + Collector path. Chapter 18 covers the incident response playbook when the gateway is the blast radius.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.