Why Model Upgrades in Causal Inference Pipelines Are Never Just a Pricing Decision

Claire Dubois — Tue, 14 Apr 2026 13:00:28 +0000

Last month we decided to take advantage of Anthropic’s 67% price reduction on Opus 4.6 and its new 1M context window.

On paper, the move looked straightforward: lower marginal cost, stronger reasoning on long-horizon causal tasks, same 5/25 token pricing as the previous generation. A clear win for any data science team watching burn rate.

The complication appeared the moment we reran our established evaluation suite.

Even with temperature fixed at 0 and identical system prompts, the new model shifted our primary causal risk difference estimates by 0.12 to 0.19 percentage points across key subgroups. More concerning, bootstrap confidence interval widths increased by 23% on protected attribute cohorts. What felt like “better intelligence” was silently altering the statistical conclusions we had been using for fairness audits and product decisions.

Then came the April 6–7 Claude outages. The failure mode was not clean refusals or 503s. We received partial JSON responses that passed initial schema validation but contained truncated reasoning traces. Because longer treatment prompts (average 4.2k input tokens) were more likely to degrade, our missing-at-random assumption broke. This introduced a measurable selection bias that inflated apparent treatment lift by roughly 8.7% in the affected cohorts before detection.

Replaying 3,412 logged requests against a fallback provider and correcting the downstream datasets took the better part of two days. The config change to switch models? Under an hour. The statistical cleanup? Still not fully closed three weeks later.

Around the same time, GLM-5.1 dropped on April 7 under MIT license; a 744B MoE model showing strong performance on long-horizon agentic tasks. We routed a subset of our autonomous causal discovery workflows through it for a week. Completion rate rose from 64% to 71% at ~40% lower token cost, but human review time per task jumped from 11 to 19 minutes due to subtle factual drifts in intermediate steps.

The pattern is clear: every time the provider landscape shifts, whether through pricing, deprecation (Claude 3 Haiku retiring mid-April), or new model releases like Meta’s Muse Spark on April 8; the surface-level change is trivial. The second-order effects on eval baselines, alerting thresholds, attribution chains, and fairness metrics are not.

After repeating this cycle too many times, we introduced a thin, consistent abstraction layer in front of all LLM calls. We now define stable model groups with explicit fallback rules and output normalization. One provider can change behaviour or disappear without forcing the entire statistical pipeline to re-baseline.

(We use this one: https://github.com/maximhq/bifrost, though LiteLLM and Portkey also handle similar routing needs.)

The infrastructure change itself was five minutes. The downstream reconciliation of old baselines took most of a week, but the next provider addition or outage will not require repeating the exercise.

In causal and fairness-sensitive work, model quality is only one variable. Stability of the experimental surface matters more. Treating the LLM layer as just another interchangeable dependency is an expensive illusion. A small, well-governed routing layer turns chaotic provider churn into something closer to boring, predictable infrastructure.

We should have done this way earlier.

Causal inference for credit risk: why prediction alone isn't enough

Claire Dubois — Tue, 07 Apr 2026 13:03:23 +0000

There's a pattern I've seen repeatedly in financial ML: a model achieves excellent predictive performance — AUC above 0.80, stable on holdout — and the team ships it. Then, six months later, someone asks "but why is the model denying more applicants from this postal code?" and nobody has a good answer.

Prediction and causation are different things, and conflating them is expensive in credit risk specifically.

The core issue

When you train a credit risk model, you're typically predicting P(default | features). This is a conditional probability — it tells you what tends to be true about people who look like this applicant. It doesn't tell you what caused their credit behavior, and it doesn't tell you what will happen if you lend to them.

This distinction matters for two reasons.

First, selection bias. Your training data only contains outcomes for people who were previously approved for credit. The people who were denied — perhaps by a prior model or manual policy — have no observed outcome. Your model is learning from a censored dataset, and it will systematically underestimate creditworthiness for groups that historical policies excluded. This is a causal problem masquerading as a data problem.

Second, feature confounding. A feature that predicts default might do so because it's a proxy for the thing that actually causes default, not because of any direct relationship. If you act on that feature — use it to set rates or deny applications — you can create feedback loops that make the proxy worse over time.

A small worked example

Suppose you're building a model and you notice that applicants with shorter employment tenure have higher default rates. You might add employment tenure as a feature. But is tenure causing default? Or is it correlated with income stability, which is correlated with the actual causal factors?

In causal notation using a simple DAG:

Income Stability → Employment Tenure
Income Stability → Default

If this is the true structure, conditioning on tenure is conditioning on a mediator of income stability — not on a cause of default. Adding it might actually hurt generalization if the tenure-income-stability relationship changes across different applicant populations (which it does: gig economy workers, self-employed people, recent graduates).

You can sketch this in Python using the pgmpy library:

from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD

# Define the causal structure
model = BayesianNetwork([
    ('IncomeStability', 'EmploymentTenure'),
    ('IncomeStability', 'Default')
])

# This structure tells you: controlling for IncomeStability,
# EmploymentTenure is independent of Default.
# If you can't observe IncomeStability directly, tenure is a noisy proxy —
# useful, but not causal.

This is toy-level, but the point is that before you add a feature, it's worth asking: what does the causal graph look like here? Where does this variable sit in it?

Counterfactual thinking for policy decisions

The more practically important application of causal thinking in credit is when you're setting policy rather than just predicting outcomes.

Suppose your model predicts that applicant A has a 12% probability of default. Should you approve them? At what interest rate? The answer depends not just on the predicted probability but on what would happen if you approved them vs. denied them — a counterfactual question.

Difference-in-differences can help here if you have policy variation. For instance, if your institution ran a pilot program that approved a random subset of borderline applicants, you can estimate:

import pandas as pd
import numpy as np

# Assume df has columns: pilot (bool), approved (bool), defaulted (bool)
# pilot == True means the applicant was in the random approval pilot

pilot_approved = df[(df['pilot'] == True) & (df['approved'] == True)]['defaulted'].mean()
control_approved = df[(df['pilot'] == False) & (df['approved'] == True)]['defaulted'].mean()

# The naive comparison is biased — control group approved via selection model
# DiD removes the selection bias if pilot assignment was truly random

# More carefully:
pilot_group = df[df['pilot'] == True]['defaulted'].mean()
control_group = df[df['pilot'] == False]['defaulted'].mean()

# This is the actual causal effect of approval on default rate
did_estimate = pilot_group - control_group
print(f"Causal effect of approval on default rate: {did_estimate:.3f}")

This is only valid under the parallel trends assumption and proper randomization, but the point is that this kind of analysis tells you something a predictive model can't: what the policy does, not just what it predicts.

Fairness is a causal question

I want to say this plainly because I see it get glossed over: fairness in credit models is not a matter of removing demographic variables from your feature set.

If the causal structure includes a path from race or gender to default that runs through structural inequity (lower access to credit history, discrimination in employment, etc.), then your model is going to pick up that relationship through proxies — zip code, credit utilization patterns, employment tenure — regardless of whether you included race explicitly.

Equalized odds and demographic parity are useful measures, but they're metrics on your model's output, not fixes for the underlying structural problem. You can satisfy equalized odds and still be making decisions that replicate the effects of discrimination via features correlated with protected attributes.

What actually helps:

Drawing out the causal graph for your features and understanding which paths you want to block
Orthogonalization: regressing out protected attribute variation from proxy features before training (double ML approach)
Testing calibration across groups, not just aggregate calibration

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

groups = df['protected_group'].unique()
fig, ax = plt.subplots()

for group in groups:
    subset = df[df['protected_group'] == group]
    fraction_pos, mean_pred = calibration_curve(
        subset['defaulted'], 
        subset['predicted_prob'], 
        n_bins=10
    )
    ax.plot(mean_pred, fraction_pos, label=f'Group: {group}')

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.legend()
ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction positive')
ax.set_title('Calibration by group')
plt.tight_layout()
plt.show()

A model that's miscalibrated for one group — even if aggregate calibration looks fine — is making systematically wrong decisions for that group.

What I'd actually recommend

This is not "you must learn do-calculus before you're allowed to build models." That's not realistic or helpful. But a few concrete things make a material difference:

Sketch the causal graph for your most important features. Even informally. Ask: is this a cause, a proxy, or a consequence of the thing I'm trying to predict?
Check whether your training data is censored and in which direction. If prior decisions affect who appears in your training set, your model is not learning from a representative sample.
Separate your prediction task from your policy decision. The model gives you a probability. The policy is what you do with it. Causal thinking belongs in the policy layer.
Calibrate across subgroups, always. Aggregate calibration is necessary but not sufficient.

Prediction is a tool. It's a useful one. But in high-stakes decisions — credit, healthcare, hiring — prediction without causal reasoning is how you build systems that perform well on metrics and cause harm in the world.

Enfin. That's the argument I make internally at least twice a month, so I figured I'd write it down properly.