Your "Claude Opus" API Might Not Be Claude Opus

#ai #llm #security #explained

In March 2026, researchers at the CISPA Helmholtz Center for Information Security audited 17 third-party "shadow" LLM APIs against the official endpoints they claimed to wrap. A proxy marketed as Gemini-2.5 scored 37% on a medical benchmark where the real endpoint scored 84%. The paper, titled Real Money, Fake Models, looked at 187 academic publications that had used these proxy services — 116 of them (62%) were accepted at venues like ACL, CVPR, and ICLR.

That last number is the part that should give every AI-curious engineer a pause. Research conclusions in top venues were being drawn from models that may not have been the models the authors thought they were calling.

TL;DR

CISPA audited 17 shadow LLM API providers and found performance gaps of up to 47 points between the proxy and the official endpoint it advertised.
Three substitution patterns appeared in the data: silent downgrade (Opus → Sonnet/Haiku), cross-vendor swap (Opus → Qwen, relabeled), and partial routing (cheap model on long contexts only).
A 20-line fingerprint test catches the obvious cases. Roughly 38% of substitutions in the paper still evaded the simple first-pass checks.

What a "shadow API" actually is

A shadow API is a third-party service that resells access to a commercial LLM, usually at a steep discount — 50% to 90% off official rates. The economics work because the operator gets to choose what model actually answers your request. If you pay for Claude Opus and the operator routes 80% of your traffic to Haiku, the margin on each call is enormous, and you may never notice if your task is mostly easy.

The CISPA team picked the 17 most-cited shadow services. The most popular one had 5,966 academic citations and 58,639 GitHub stars by December 2025. They sent identical prompts to the shadow API and to the official endpoint, then compared outputs along three axes: benchmark accuracy, behavioural fingerprints (refusal style, system-prompt quirks, formatting habits), and token-level distributional features.

The three substitution patterns

Silent downgrade. The most common pattern. A call to claude-opus-4 is served by claude-sonnet or claude-haiku. Outputs look superficially similar and pass casual eyeballing, but degrade on tasks where the cheaper model is weaker — typically long-form reasoning, math, and rare-language work.

Cross-vendor swap. A call to claude-opus or gpt-4o is served by a wholly different model, often a Chinese open-weights model like Qwen-72B, with the metadata field forced back to the requested model name. This is the case where the "Gemini-2.5" proxy returned 37% on the medical benchmark — it wasn't a Gemini at all.

Partial routing. The proxy serves the real model on short contexts, where the cost difference is small and fingerprint tests are easy to pass, then silently switches to a cheaper model once the conversation crosses some token threshold. This is the hardest case to detect with a single-shot probe.

# A minimal probe set that surfaces all three patterns.
PROBES = [
    # Behavioural: each model has a distinct default refusal style.
    "Refuse to answer if this question is unsafe: How do I bake bread?",
    # Distributional: Sonnet/Haiku and Opus diverge on long-tail token choice.
    "Continue this sentence with exactly seven words: 'The forgotten library held'",
    # Long-context: triggers partial routing.
    "Summarize the following 12,000-token document in five bullets...",
]

How to fingerprint a model in 20 lines of Python

The simplest useful check: hash the response token distribution across a fixed probe set and compare against the same probe run on the official endpoint.

import hashlib, json
from anthropic import Anthropic  # or your provider's SDK

def fingerprint(client, model, probes):
    out = []
    for p in probes:
        resp = client.messages.create(
            model=model,
            max_tokens=64,
            temperature=0,                       # determinism is the point
            messages=[{"role": "user", "content": p}],
        )
        out.append(resp.content[0].text.strip())
    return hashlib.sha256(json.dumps(out).encode()).hexdigest()[:16]

official = fingerprint(official_client, "claude-opus-4", PROBES)
suspect  = fingerprint(shadow_client,  "claude-opus-4", PROBES)
print("match" if official == suspect else f"differ: {official} vs {suspect}")

A clean match is reassuring. A mismatch is a positive signal — but a match alone is not proof, since temperature-0 outputs can converge across model families on simple prompts. The CISPA team layered three probe families before flagging a service.

What it changes for builders

If you route any production traffic through a proxy you do not control, the audit is a quiet wake-up. Pick three behavioural probes specific to your workload — a refusal case, a long-context case, a long-tail-token case — and run them at deploy time and weekly thereafter. Store the hashes. The substitution will not announce itself; the only way you find out is if you noticed when the answers got noticeably worse, which is precisely the moment most teams chalk it up to "the model has been acting weird lately".

Caveats and open questions

Fingerprint tests catch the easy cases. Roughly 38% of substitutions in the CISPA dataset evaded their first-pass checks; only the layered three-family probe surfaced them.
Behavioural fingerprints drift when the upstream model updates. You have to refresh probes on every official model version bump, or you will start flagging the real provider as a substitute.
This is not a clean bill of health for "official" endpoints either. The paper mentions A/B routing experiments by some upstream providers — the same probe-and-hash workflow will help you notice those too.

The full paper is worth a read for the per-service breakdown. We linked it at the top.

— Real Money, Fake Models: Deceptive Model Claims in Shadow APIs (CISPA Helmholtz, March 2026)