varun pratap Bhardwaj

Posted on May 6 • Originally published at qualixar.com

The Pass^k Wall: One Failure Mode Behind AI's Quietly Disastrous Week

#aireliabilityengineering #agentops #evaluation #behavioralcontracts

Last week was loud for AI. Five separate stories ran the front page of every tech outlet, every newsletter, every Slack channel that takes itself seriously.

Anthropic admitted three quality regressions its own evaluation suite missed. GitHub announced the end of flat-rate Copilot. Uber's CTO publicly conceded the company had burned through its entire 2026 AI budget in four months. Cyera disclosed CVE-2026-7482 — a critical heap leak affecting 300,000 Ollama deployments. And Princeton's HAL leaderboard paused new model additions to launch a Reliability Dashboard.

Most readers saw five separate stories. AI is too unpredictable. AI is too expensive. AI is too vulnerable. AI evaluation is moving on. AI tooling is fragmenting.

That's the wrong read. There is one story here, written five different ways. Every headline above documents the same engineering failure: the industry knows how to measure capability under fresh inputs and has no idea how to measure reliability under accumulated state.

This is the gap AI Reliability Engineering exists to close. Let me walk through the evidence.

Signal 1 — Anthropic missed regressions in its own product

On April 23, Anthropic published a postmortem on three quality regressions in Claude Code. A March 4 change to default reasoning effort. A March 26 caching bug that wiped multi-turn thinking state every turn instead of once per idle session. An April 16 verbosity-reduction system prompt that ran multiple weeks of internal testing without flagging any regressions.

When ablations finally ran across a broader evaluation set, the verbosity prompt showed a 3% drop on Opus 4.6 and 4.7.

Anthropic's eval shop is the most sophisticated in the industry. They have an unfair amount of compute, an unfair amount of internal user data, and an unfair number of researchers per regression. They missed three issues for weeks. AMD's Stella Laurenzo published an audit of 6,852 sessions and 234,000 tool calls before Anthropic confirmed anything was wrong.

If their evaluations missed it, your evaluations are missing more.

Signal 2 — GitHub's flat-rate Copilot model broke under agentic load

On April 27, GitHub announced Copilot moves to usage-based billing on June 1. PRUs (Premium Request Units) become "GitHub AI Credits" priced against actual token consumption.

The internal driver is more interesting than the announcement. Microsoft's leaked planning documents show the weekly cost of running Copilot has doubled since the start of the year. New sign-ups for Copilot Pro and Pro+ were temporarily paused April 20-22 because agentic workflows — long-running, parallelized, tool-using sessions — were consuming far more compute than the original plan structure was built to support.

Code completion is bounded. An agent reasoning across a multi-step trajectory is not. The flat-rate pricing model assumed bounded usage. Production reality blew the assumption apart.

Signal 3 — Uber burned its entire 2026 AI budget in four months

Uber's CTO Praveen Neppalli Naga gave a candid interview admitting the company exhausted its annual AI spending allocation in the first third of the year. Adoption of Claude Code went from 32% of engineers in February to 84% in March. Per-engineer spend reached $500–$2,000 per month against a list-price tier that advertises $20.

The pattern is not unique. Visa consumed 1.9 trillion tokens in March, double its February run. JPMorgan, Disney, and others have rolled out internal AI adoption dashboards with leaderboards and gamification. The phrase "tokenmaxxing" is now in the industry vocabulary, and engineering culture is rewarding token consumption as a productivity proxy.

This is the cloud-billing-shock pattern from 2010 repeating with one new variable: nobody can predict the consumption curve because nobody is measuring spend-per-task. They are measuring monthly aggregates and getting blindsided every quarter.

Signal 4 — Bleeding Llama exposed 300k Ollama servers

On April 28, Cyera Research disclosed CVE-2026-7482 — a critical (CVSS 9.1-9.3) heap leak in Ollama affecting roughly 300,000 publicly exposed deployments. The exploit chain takes three unauthenticated API calls. Send a crafted GGUF file with a tensor offset larger than the file itself. Request F16-to-F32 quantization, which is lossless. Push the resulting model — now containing readable heap memory — to an attacker-controlled registry via /api/push.

Output: API keys, system prompts, environment variables, and concurrent users' conversation data, exfiltrated cleanly.

The engineering takeaway is not "patch Ollama." It is that local LLM deployments now have an enterprise-grade threat surface, and the assumption of "we run it on-prem so it's safe" was always a category error. Defense by obscurity ages worse for AI infrastructure than it did for any prior generation of internet-facing software.

Signal 5 — Princeton paused its leaderboard

The Holistic Agent Leaderboard at Princeton is the most-cited agent benchmark in academic literature. Its Reliability Dashboard launch marked a public pivot: HAL paused adding new models to focus on a multi-dimensional reliability view — consistency, robustness, safety, self-awareness — beyond raw accuracy.

The metric anchoring this pivot is pass^k, introduced in the original τ-bench paper: the probability an agent succeeds on the same task across k independent trials. Sierra Research's published experiments show gpt-4o-class function-calling agents below 50% pass@1 on retail customer-service tasks. Pass^8 falls below 25%.

Translation: even the strongest generalist agents on the strongest benchmarks complete the same task across 8 trials less than a quarter of the time.

That is not an evaluation footnote. That is the reason your production agent's "97% accuracy" feels nothing like 97% to your support team.

The unifying gap — capability versus reliability under state

The five stories above look like five different problems. They share one root.

Capability is the property frontier labs optimize. Given a fresh input, a clean context window, and a well-formed prompt, can the model produce the right output? Pass@1 measures capability. Every leaderboard score we have ever celebrated measures capability.

Reliability under state is the property production breaks on. Given accumulated context — earlier tool outputs that may have been wrong, retrieved snippets that may have been stale, intermediate decisions that may have been suboptimal — does the agent still produce the right output? Across 8 trials with statistically equivalent inputs, does it produce the right output 80% of the time, or 25%?

Anthropic's regressions were reliability regressions, not capability regressions. The model could still answer the same benchmark questions correctly. It performed worse over the course of long sessions, where state accumulated and compounded. Anthropic's evaluation suite tested capability. The world tested reliability. The two diverged for weeks before anyone noticed.

GitHub's Copilot bill explosion is a reliability problem. An agent that reliably converges in 200 tokens costs 10× less than one that wanders for 2,000. The capability-per-token improvement of frontier models has been slower than the reliability-under-trajectory degradation.

Uber's budget burn is the same problem with a finance department. When per-task spend is unpredictable because trajectories are non-deterministic, monthly forecasting breaks.

Bleeding Llama is reliability of a different surface — the state of the Ollama process becomes the attack surface because /api/create accepts inputs that mutate process memory in ways the original threat model never anticipated.

Princeton's HAL pivot is the formal admission. The most credible agent-evaluation institution in academia has effectively said: we have been measuring the wrong thing. Pass@1 was a useful metric for a few years. It is no longer the metric the field needs. pass^k is.

The engineering term — stateful trajectory decay

Once you see the pattern, the engineering term names itself.

Stateful trajectory decay: the failure mode where an agent's correctness degrades along its execution trajectory because internal state — context, intermediate results, tool outputs, retrieved facts — mutates without verification. No persistent reliable substrate grounds it. No behavioral contract asserts the properties you care about must continue to hold. No statistical gate fires when distributional drift exceeds tolerance.

The metaphor that fits is structural fatigue. A bridge does not fail because the load exceeded its instantaneous capacity. It fails because micro-fractures accumulated under repeated loading until a fracture became a fault. Capability is instantaneous strength. Reliability is fatigue resistance. We have been engineering AI agents for instantaneous strength.

pass^k is the fatigue test. Pass@1 is the static load test. You can ship a bridge that holds today's traffic. You cannot ship one that holds traffic 50,000 times across the next decade unless you measure differently.

Three things to run on Monday

Reading the failure mode is half the job. Naming what to do about it is the other half. The actions below are not abstract. They are commands you can run.

1. Run pass^k against your top 3 agent tasks before next deploy

Pick the three most production-critical agent tasks you ship. For each, pick k = 8 (the τ-bench standard). Generate 8 independent trials with statistically equivalent inputs. Score them.

The deployment gate is: across all 8 trials, succeed on at least 80%. Not 8/8 — 7/8. Allow exactly one failure across the 8.

If you can't hit that bar, you don't have a production system. You have a demo.

You will hate this number the first time you run it. That is the point.

2. Instrument spend-per-task as a first-class metric

Every team measures latency. Almost no team measures spend-per-task with the same operational rigor.

Add a per-trajectory token counter to your observability stack. Set a hard budget per task class — for example, customer_support_resolution_max_tokens = 50,000. Reject (or alert on) trajectories that exceed it. Track the median, p95, and p99 spend across trajectories per task class, every day. When p99 starts walking up, your agent is wandering — which is also a signal that something earlier in the trajectory is breaking.

This is the lesson Uber learned in production. Spend-per-task is the canary. Latency is the bird that already died.

3. Inject one failure mode into staging before launch

Pick one of:

a corrupted tool output (return malformed JSON from a tool the agent depends on)
a 5× latency spike on a downstream service
a stale retrieval (return a result from 30 days ago when the agent expects fresh)

Inject it in your staging agent loop. Run the trajectory. Observe what happens.

If the agent does not have a recovery path — a circuit breaker, a re-query, a graceful degradation — your system is not resilient. It is a happy-path demo with the staging environment doing the work the recovery logic was supposed to do.

This is the chaos engineering discipline applied to AI. Netflix's chaos monkey was called paranoid for the first three years and prescient for the next twenty. The same calendar applies here.

One engineered answer — and why we built it

The three Monday actions above are platform-agnostic. You can run them with any tool stack. They will tell you, ruthlessly, where your reliability gaps live.

Closing the gaps requires something the open guardrails frameworks do not have: a runtime contract system with formal mathematical backing. Guardrails AI, NeMo Guardrails, AWS Bedrock Guardrails, AgentCore Policy — they catch per-message violations. None of them measures session-level distributional drift. None of them gives you a single deployment-readiness score with statistical bounds underneath. None of them composes safely across multi-agent pipelines.

We built AgentAssert — the Agent Behavioral Contract framework — because that gap was not closing on its own. Six pillars in one library: a YAML contract DSL with 14 operators, hard/soft constraint separation with graduated recovery, Jensen-Shannon Divergence drift detection, (p, δ, k)-satisfaction as a three-parameter compliance contract, compositional safety bounds for multi-agent pipelines, and Ornstein-Uhlenbeck stability dynamics with a Lyapunov convergence proof.

The reason (p, δ, k) has three parameters and not one threshold is that every real compliance contract at scale has three knobs hiding behind it: how often does compliance hold (p), how far can soft drift go (δ), and how fast must recovery happen (k). Reduce it to a single number and you throw away two thirds of what regulators actually want to know.

The output is a single number — the Reliability Index Θ — bounded in [0, 1], with a deployment threshold of Θ ≥ 0.90. None of GPT-5.3, Claude Sonnet 4.6, or Mistral-Large-3 cleared it on the retail-shopping benchmark in the paper. The number is a deployment-readiness signal, not a safety guarantee — but it is the first such signal that combines compliance, drift, recovery, and statistical certification under one bound.

Released on PyPI as agentassert-abc. AGPL-3.0 with a commercial license for production use. The math is in the paper. The runtime ships the math.

If you have read this far and the failure mode is recognizable, evaluate it. If you have a better answer, ship it and tell me. Either way, stop measuring capability and pretending it is reliability.

"Shallow men believe in luck or in circumstance. Strong men believe in cause and effect."
— Ralph Waldo Emerson

The five stories last week were not bad luck. They were cause and effect. The cause was an industry that measured the wrong thing for a few years too long. The effect is a reliability wall that frontier capability alone will not climb.

What's the worst pass^k collapse you have seen in production? Reply or send me a note — I will feature the most instructive case in Issue #3 of the AI Reliability Engineering newsletter.

Newsletter Issue #2 ships at 7 PM IST tonight.

🌐 varunpratap.com · 🐦 @varunPbhardwaj · 🔗 qualixar.com · ⭐ github.com/qualixar · 📺 @myhonestdiary