The engineering behind three fintech machine-learning projects: behavioural default risk, affordability-based risk, and Open Banking transaction categorisation, with the decisions that actually matter."
Most "predict loan default" tutorials make the same three mistakes: they report accuracy on an imbalanced target, they leak future information into the features, and they stop at a probability instead of a decision. This write-up is about avoiding all three, across three projects on real lending data that together build toward an affordability-first view of credit risk.
A companion narrative piece covers why this matters. This one is about how: the data wrangling, the feature engineering, the experiment design, and the production logic. All the code is real, lifted from the repositories linked at the end.
The throughline
Three projects, three datasets, one argument:
- Behavioural default risk on 30,000 real credit-card customers: the traditional credit-history approach, done with the right metrics and a cost-based decision.
- Affordability-based risk on 1.35 million real Lending Club loans: a controlled test of whether affordability out-predicts a credit score.
- Transaction categorisation on 259,000 real bank transactions: the Open Banking data layer that turns a raw bank feed into the income and spending signals affordability needs.
Lesson 1: under imbalance, accuracy is a trap
Roughly one borrower in five defaults in these datasets. A model that predicts "everyone repays" scores about 80% accuracy and is worthless. So across all three projects the metrics are precision-recall (average precision), the KS statistic, and recall at a chosen operating point, never raw accuracy.
The precision-recall curve is the meaningful picture because the no-skill baseline is not 0.5, it is the prevalence:
from sklearn.metrics import average_precision_score, roc_auc_score
# PR-AUC floor is the positive rate (~0.20), not 0.5
print(f"PR-AUC {average_precision_score(y_test, probs):.3f} "
f"(no-skill = {y_test.mean():.3f})")
For the credit-card model this is the difference between a believable ROC-AUC of 0.78 / PR-AUC 0.56 and a meaningless "97% accurate" headline.
Lesson 2: a probability is not a decision
A risk score becomes useful only when you decide where to cut. That cut is a business choice, because the two errors cost different amounts: a missed default loses the loan, while a wrongly declined good customer only loses the margin. So instead of defaulting to 0.5, sweep the threshold to minimise expected cost.
def optimal_threshold(probs, fn_to_fp_ratio):
thresholds = np.linspace(0.01, 0.99, 99)
costs = [
fn_to_fp_ratio * ((probs < t) & (y_test == 1)).sum() # missed defaults
+ ((probs >= t) & (y_test == 0)).sum() # good customers declined
for t in thresholds
]
return thresholds[int(np.argmin(costs))]
The threshold moves with the cost ratio, and so does the business outcome: at a 10:1 ratio the credit-card model catches 92% of defaulters; at 2:1 it approves far more and catches half. There is no single correct cutoff without a cost view, and the valuable artefact is this curve, not the raw probability.
Lesson 3: leakage is how credit models cheat
This is the big one, and it is the whole reason the affordability project's numbers are trustworthy. The Lending Club file has 151 columns, and many of them are recorded after the loan runs: total payments received, recoveries, the latest FICO pull. Train on those and you get a spectacular AUC that collapses in production, because at decision time they do not exist.
So the rule is strict: keep only what a lender knows at origination. Every post-loan field is dropped, and so are Lending Club's own grade and int_rate, because those already encode its internal risk model and would short-circuit the experiment.
BAD = {"Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off"}
GOOD = {"Fully Paid", "Does not meet the credit policy. Status:Fully Paid"}
# the raw file is 1.6 GB, so stream it in chunks and keep only finished loans
parts = []
for chunk in pd.read_csv(path, usecols=ORIGINATION_COLS,
compression="gzip", chunksize=300_000):
done = chunk[chunk.loan_status.isin(BAD | GOOD)].copy()
done["default"] = done.loan_status.isin(BAD).astype("int8")
parts.append(engineer(done))
df = pd.concat(parts, ignore_index=True) # ~1.35M completed loans
Two things worth noting: usecols plus chunksize keeps a 1.6 GB file inside a few hundred MB of RAM, and filtering inside the loop means you never materialise the rows you do not need.
Lesson 4: engineer affordability properly
Affordability is a ratio of obligations to income, so the features have to express that, and they have to be correct for joint applications (where two people share the liability and one person's income understates capacity).
joint = df.application_type.eq("Joint App")
df["income"] = np.where(joint & df.annual_inc_joint.notna(),
df.annual_inc_joint, df.annual_inc)
df["dti_eff"] = np.where(joint & df.dti_joint.notna(),
df.dti_joint, df.dti)
df["payment_to_income"] = df.installment * 12 / df["income"] # annual burden
df["loan_to_income"] = df.loan_amnt / df["income"]
These three (payment-to-income, debt-to-income, loan-to-income) each produce a clean monotonic default gradient on their own, before any model touches them.
Lesson 5: design the experiment so the result means something
The headline claim, that affordability out-predicts a credit score, is only credible if the comparison is controlled. So I held the algorithm fixed and changed only the feature set: credit-and-bureau features, affordability features, then both. Any difference in performance is then attributable to the features, not to model tuning.
AFFORD = ["income", "dti_eff", "payment_to_income", "loan_to_income",
"installment", "loan_amnt", "term_months", "emp_length_num"]
CREDIT = ["fico", "delinq_2yrs", "inq_last_6mths", "revol_util",
"open_acc", "pub_rec", "total_acc", "credit_history_yrs"]
def evaluate(features):
model = XGBClassifier(n_estimators=300, max_depth=5, learning_rate=0.05,
scale_pos_weight=spw, tree_method="hist", n_jobs=4)
model.fit(X_train[features], y_train)
p = model.predict_proba(X_test[features])[:, 1]
return roc_auc_score(y_test, p), average_precision_score(y_test, p)
evaluate(CREDIT) # ROC-AUC 0.617
evaluate(AFFORD) # ROC-AUC 0.699 <- affordability alone wins
evaluate(AFFORD + CREDIT) # ROC-AUC 0.706
tree_method="hist" is what makes XGBoost comfortable on 1.35 million rows, and scale_pos_weight set to the negative/positive ratio handles the imbalance without resampling. The result holds up: affordability beats the bureau record, and the two combined beat either alone.
Lesson 6: cleaning real transaction text
The categorisation project is a different kind of engineering. Real bank descriptions are hostile: Earnin PAYMENT Donatas Danyal, transfers buried in authorisation codes, dates and reference numbers everywhere. None of that noise carries category signal, so it gets stripped before vectorising.
import re
def clean(s):
s = re.sub(r"\d+", " ", str(s).lower()) # auth codes, dates, amounts
s = re.sub(r"[^a-z\s]", " ", s) # punctuation
s = re.sub(r"\b\w{1,2}\b", " ", s) # 1-2 char fragments
return re.sub(r"\s+", " ", s).strip()
# 'Earnin PAYMENT Donatas Danyal' -> 'earnin payment donatas danyal'
Then TF-IDF over unigrams and bigrams, with transaction amount stitched on as a numeric feature, because payroll and loan amounts are large and coffees are small:
from scipy.sparse import hstack, csr_matrix
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=3,
max_features=30_000, sublinear_tf=True)
X_text = tfidf.fit_transform(df.clean_desc)
X = hstack([X_text, csr_matrix(scaled_log_amount)])
A class-weighted linear SVM on this reaches 0.80 macro-F1 across 31 categories (baseline 0.01), and crucially it is fully inspectable: reading the top-weighted terms per class shows it learned that mcdonald means restaurants and uber/lyft in incoming payments mean gig income. Macro-F1, not accuracy, again, because the categories are heavily imbalanced.
The confusions are reassuring rather than alarming: the various transfer types blur into each other because their text genuinely overlaps, which is label ambiguity, not model failure.
Lesson 7: turn confidence into an operating policy
A model that labels everything is a research artefact. A model that knows when to defer is a system. For the linear SVM, the gap between the top two class scores is a usable confidence signal:
margins = clf.decision_function(X_test) # (n_samples, n_classes)
top2 = np.partition(margins, -2, axis=1)
confidence = top2[:, -1] - top2[:, -2] # best minus runner-up
order = np.argsort(-confidence) # most confident first
correct = (preds == y_test)[order]
# cumulative accuracy as coverage grows -> the auto-classify / review trade-off
Auto-classifying the most-confident 80% holds 98% accuracy, sending only a fifth to human review. That single curve is what turns the categoriser into something a lending operation could actually deploy.
Lesson 8: explainability is non-negotiable
Regulated lending has to justify decisions, so every model here is explained with SHAP, at both the portfolio level and the individual decision. The single-applicant view is the one a compliance team cares about:
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_sample)
# waterfall for one applicant -> exactly what an adverse-action notice needs
On the credit-card model, this produces a legible story for a single 97%-risk applicant: behind in every month, two months behind most recently, repaying almost nothing. Not a black box, a defensible decision.
Putting it together
The engineering themes repeat across all three projects, and they are the transferable part:
- Judge models on the metric the problem demands (PR-AUC and KS under imbalance, macro-F1 across many classes), never on accuracy.
- Treat the threshold as a cost decision, not a default.
- Be ruthless about leakage; it is the single biggest reason credit models look great offline and fail in production.
- Design comparisons so the result is attributable to the thing you are testing.
- Engineer for the data you actually have, whether that is a 1.6 GB file that needs streaming or transaction text that needs aggressive cleaning.
- Ship the human-in-the-loop logic, because confidence-based routing is what separates a model from a system.
Each project stands alone, but the line they trace, from credit history to affordability to the transaction data that makes affordability computable, is the actual shape of modern credit decisioning.
Full code, data notes and reproducible notebooks: credit-default risk, affordability-based risk, transaction categorisation.


Top comments (0)