Causal inference for credit risk: why prediction alone isn't enough

#machinelearning #datascience #ai #python

There's a pattern I've seen repeatedly in financial ML: a model achieves excellent predictive performance — AUC above 0.80, stable on holdout — and the team ships it. Then, six months later, someone asks "but why is the model denying more applicants from this postal code?" and nobody has a good answer.

Prediction and causation are different things, and conflating them is expensive in credit risk specifically.

The core issue

When you train a credit risk model, you're typically predicting P(default | features). This is a conditional probability — it tells you what tends to be true about people who look like this applicant. It doesn't tell you what caused their credit behavior, and it doesn't tell you what will happen if you lend to them.

This distinction matters for two reasons.

First, selection bias. Your training data only contains outcomes for people who were previously approved for credit. The people who were denied — perhaps by a prior model or manual policy — have no observed outcome. Your model is learning from a censored dataset, and it will systematically underestimate creditworthiness for groups that historical policies excluded. This is a causal problem masquerading as a data problem.

Second, feature confounding. A feature that predicts default might do so because it's a proxy for the thing that actually causes default, not because of any direct relationship. If you act on that feature — use it to set rates or deny applications — you can create feedback loops that make the proxy worse over time.

A small worked example

Suppose you're building a model and you notice that applicants with shorter employment tenure have higher default rates. You might add employment tenure as a feature. But is tenure causing default? Or is it correlated with income stability, which is correlated with the actual causal factors?

In causal notation using a simple DAG:

Income Stability → Employment Tenure
Income Stability → Default

If this is the true structure, conditioning on tenure is conditioning on a mediator of income stability — not on a cause of default. Adding it might actually hurt generalization if the tenure-income-stability relationship changes across different applicant populations (which it does: gig economy workers, self-employed people, recent graduates).

You can sketch this in Python using the pgmpy library:

from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD

# Define the causal structure
model = BayesianNetwork([
    ('IncomeStability', 'EmploymentTenure'),
    ('IncomeStability', 'Default')
])

# This structure tells you: controlling for IncomeStability,
# EmploymentTenure is independent of Default.
# If you can't observe IncomeStability directly, tenure is a noisy proxy —
# useful, but not causal.

This is toy-level, but the point is that before you add a feature, it's worth asking: what does the causal graph look like here? Where does this variable sit in it?

Counterfactual thinking for policy decisions

The more practically important application of causal thinking in credit is when you're setting policy rather than just predicting outcomes.

Suppose your model predicts that applicant A has a 12% probability of default. Should you approve them? At what interest rate? The answer depends not just on the predicted probability but on what would happen if you approved them vs. denied them — a counterfactual question.

Difference-in-differences can help here if you have policy variation. For instance, if your institution ran a pilot program that approved a random subset of borderline applicants, you can estimate:

import pandas as pd
import numpy as np

# Assume df has columns: pilot (bool), approved (bool), defaulted (bool)
# pilot == True means the applicant was in the random approval pilot

pilot_approved = df[(df['pilot'] == True) & (df['approved'] == True)]['defaulted'].mean()
control_approved = df[(df['pilot'] == False) & (df['approved'] == True)]['defaulted'].mean()

# The naive comparison is biased — control group approved via selection model
# DiD removes the selection bias if pilot assignment was truly random

# More carefully:
pilot_group = df[df['pilot'] == True]['defaulted'].mean()
control_group = df[df['pilot'] == False]['defaulted'].mean()

# This is the actual causal effect of approval on default rate
did_estimate = pilot_group - control_group
print(f"Causal effect of approval on default rate: {did_estimate:.3f}")

This is only valid under the parallel trends assumption and proper randomization, but the point is that this kind of analysis tells you something a predictive model can't: what the policy does, not just what it predicts.

Fairness is a causal question

I want to say this plainly because I see it get glossed over: fairness in credit models is not a matter of removing demographic variables from your feature set.

If the causal structure includes a path from race or gender to default that runs through structural inequity (lower access to credit history, discrimination in employment, etc.), then your model is going to pick up that relationship through proxies — zip code, credit utilization patterns, employment tenure — regardless of whether you included race explicitly.

Equalized odds and demographic parity are useful measures, but they're metrics on your model's output, not fixes for the underlying structural problem. You can satisfy equalized odds and still be making decisions that replicate the effects of discrimination via features correlated with protected attributes.

What actually helps:

Drawing out the causal graph for your features and understanding which paths you want to block
Orthogonalization: regressing out protected attribute variation from proxy features before training (double ML approach)
Testing calibration across groups, not just aggregate calibration

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

groups = df['protected_group'].unique()
fig, ax = plt.subplots()

for group in groups:
    subset = df[df['protected_group'] == group]
    fraction_pos, mean_pred = calibration_curve(
        subset['defaulted'], 
        subset['predicted_prob'], 
        n_bins=10
    )
    ax.plot(mean_pred, fraction_pos, label=f'Group: {group}')

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.legend()
ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction positive')
ax.set_title('Calibration by group')
plt.tight_layout()
plt.show()

A model that's miscalibrated for one group — even if aggregate calibration looks fine — is making systematically wrong decisions for that group.

What I'd actually recommend

This is not "you must learn do-calculus before you're allowed to build models." That's not realistic or helpful. But a few concrete things make a material difference:

Sketch the causal graph for your most important features. Even informally. Ask: is this a cause, a proxy, or a consequence of the thing I'm trying to predict?
Check whether your training data is censored and in which direction. If prior decisions affect who appears in your training set, your model is not learning from a representative sample.
Separate your prediction task from your policy decision. The model gives you a probability. The policy is what you do with it. Causal thinking belongs in the policy layer.
Calibrate across subgroups, always. Aggregate calibration is necessary but not sufficient.

Prediction is a tool. It's a useful one. But in high-stakes decisions — credit, healthcare, hiring — prediction without causal reasoning is how you build systems that perform well on metrics and cause harm in the world.

Enfin. That's the argument I make internally at least twice a month, so I figured I'd write it down properly.