Last month we decided to take advantage of Anthropic’s 67% price reduction on Opus 4.6 and its new 1M context window.
On paper, the move looked straightforward: lower marginal cost, stronger reasoning on long-horizon causal tasks, same 5/25 token pricing as the previous generation. A clear win for any data science team watching burn rate.
The complication appeared the moment we reran our established evaluation suite.
Even with temperature fixed at 0 and identical system prompts, the new model shifted our primary causal risk difference estimates by 0.12 to 0.19 percentage points across key subgroups. More concerning, bootstrap confidence interval widths increased by 23% on protected attribute cohorts. What felt like “better intelligence” was silently altering the statistical conclusions we had been using for fairness audits and product decisions.
Then came the April 6–7 Claude outages. The failure mode was not clean refusals or 503s. We received partial JSON responses that passed initial schema validation but contained truncated reasoning traces. Because longer treatment prompts (average 4.2k input tokens) were more likely to degrade, our missing-at-random assumption broke. This introduced a measurable selection bias that inflated apparent treatment lift by roughly 8.7% in the affected cohorts before detection.
Replaying 3,412 logged requests against a fallback provider and correcting the downstream datasets took the better part of two days. The config change to switch models? Under an hour. The statistical cleanup? Still not fully closed three weeks later.
Around the same time, GLM-5.1 dropped on April 7 under MIT license; a 744B MoE model showing strong performance on long-horizon agentic tasks. We routed a subset of our autonomous causal discovery workflows through it for a week. Completion rate rose from 64% to 71% at ~40% lower token cost, but human review time per task jumped from 11 to 19 minutes due to subtle factual drifts in intermediate steps.
The pattern is clear: every time the provider landscape shifts, whether through pricing, deprecation (Claude 3 Haiku retiring mid-April), or new model releases like Meta’s Muse Spark on April 8; the surface-level change is trivial. The second-order effects on eval baselines, alerting thresholds, attribution chains, and fairness metrics are not.
After repeating this cycle too many times, we introduced a thin, consistent abstraction layer in front of all LLM calls. We now define stable model groups with explicit fallback rules and output normalization. One provider can change behaviour or disappear without forcing the entire statistical pipeline to re-baseline.
(We use this one: https://github.com/maximhq/bifrost, though LiteLLM and Portkey also handle similar routing needs.)
The infrastructure change itself was five minutes. The downstream reconciliation of old baselines took most of a week, but the next provider addition or outage will not require repeating the exercise.
In causal and fairness-sensitive work, model quality is only one variable. Stability of the experimental surface matters more. Treating the LLM layer as just another interchangeable dependency is an expensive illusion. A small, well-governed routing layer turns chaotic provider churn into something closer to boring, predictable infrastructure.
We should have done this way earlier.
Top comments (0)