Your model isn't underfitting. Your features are lazy.

#data #ai #python

Here's the scene I've watched play out on a dozen teams. Accuracy plateaus. Someone rips out the logistic regression, drops in XGBoost, and waits for the jump. It doesn't come — or it comes with two points you can't explain to anyone. So the week disappears into hyperparameter tuning, and you end up with a slower, heavier, less interpretable model that's barely better than where you started.

The model was almost never the bottleneck. The features were.

This post is the long, practical version of that argument. We'll define the two camps in plain language, run real code, look at when boosting genuinely wins, and then walk through the failure mode nobody warns you about — the one where the fancy model is "winning" because it's quietly cheating.

A note before we start: keep your examples generic. We'll predict a numeric target — think demand, a quantity, a score on a tabular dataset. The principles are the same everywhere, and you should validate them on your own data.

The two camps, in plain terms

Linear / logistic regression fits a straight-line relationship: each feature gets a weight (a coefficient), and the prediction is a weighted sum. Logistic regression is the same idea bent for classification — it outputs a probability.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# the whole model, readable in one line per feature:
for name, weight in zip(feature_names, model.coef_[0]):
    print(f"{name:<20} {weight:+.3f}")

That loop is the entire model. A positive weight means "more of this pushes the prediction up," and you can hand that table to a stakeholder and defend every number. The cost: it assumes the relationship is roughly linear and that features act independently. Real data often isn't that polite.

Gradient boosting (XGBoost, LightGBM, sklearn's GradientBoostingClassifier) builds hundreds of small decision trees, each one correcting the mistakes of the last. It captures nonlinearity and feature interactions for free, and on messy tabular data it usually wins on raw accuracy.

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=300, max_depth=4, learning_rate=0.05)
model.fit(X_train, y_train)

The cost is the mirror image: it's a black box. You can't read it the way you read coefficients, it will happily overfit if you let it, and — this is the part that bites — it will exploit any leakage in your data with terrifying enthusiasm.

When boosting genuinely wins

Let me be fair to boosting, because it deserves it. Build a dataset with a real interaction effect — where the target depends on two features multiplied together, not added — and watch what happens.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

rng = np.random.default_rng(0)
n = 5000
x1 = rng.normal(size=n)
x2 = rng.normal(size=n)

# the signal lives in the INTERACTION: x1 * x2, not x1 + x2
logit = 3 * (x1 * x2)
y = (rng.uniform(size=n) < 1 / (1 + np.exp(-logit))).astype(int)
X = np.column_stack([x1, x2])

lr  = LogisticRegression()
xgb = XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.1)

print("logreg:", cross_val_score(lr,  X, y, cv=5, scoring="roc_auc").mean())
print("xgb:   ", cross_val_score(xgb, X, y, cv=5, scoring="roc_auc").mean())

Logistic regression will score around chance here — close to 0.5 AUC — because there's no straight-line relationship between either feature alone and the target. Boosting will score much higher, because trees can split on x1 and then split on x2 inside that branch, which is exactly an interaction.

That's the honest case for boosting: when the signal is nonlinear or lives in interactions, and you don't know that ahead of time. Trees find structure you didn't hand-engineer.

But notice the catch in that last sentence — "you didn't hand-engineer." What if you had?

The plot twist: features close the gap

Give the linear model the interaction term explicitly, and it catches right up:

# hand the interaction to the linear model as a feature
X_better = np.column_stack([x1, x2, x1 * x2])

print("logreg + feature:", cross_val_score(lr, X_better, y, cv=5,
                                            scoring="roc_auc").mean())

One engineered column — x1 * x2 — and the "weak" model is now competitive with boosting, while staying fully interpretable. You can look at the coefficient on that interaction term and know what the model learned.

This is the whole thesis in one experiment. Boosting wasn't smarter. It was compensating for a feature you forgot to create. The accuracy gap between a simple model and a complex one is very often just the complex model rediscovering, internally and opaquely, a feature you could have written by hand.

Better features beat a better algorithm, and they cost less to run and far less to trust.

The failure mode nobody warns you about: leakage

Here's where boosting's enthusiasm turns dangerous. Data leakage is when information sneaks into your features that wouldn't actually be available at prediction time — usually because it's downstream of the very thing you're predicting.

A concrete example. Say you're predicting whether an order will be cancelled. Someone adds a feature refund_amount. It's wildly predictive — accuracy jumps ten points. Ship it!

Except refunds only happen after a cancellation. At the moment you actually need to predict, refund_amount is always zero. You've trained a model to predict cancellations using a column that only exists because of cancellations. In production it's useless, and you won't find out until the numbers quietly fall apart.

# This "feature" is the answer wearing a disguise.
# It is only populated after the event you're trying to predict.
df["refund_amount"]   # leaks the target

Why does this matter more for boosting? Because a linear model spreads its attention across features and a single leaky column produces one suspiciously huge coefficient you might actually notice. Boosting will find the leak, latch onto it, and route most of its trees through it — handing you a gorgeous validation score that's pure fiction. The more powerful the model, the more efficiently it exploits a mistake in your data.

There's a subtler version too — preprocessing leakage — where you compute something over the whole dataset before splitting:

# WRONG: scaler sees the test set's statistics before you split
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test = train_test_split(X_scaled)

# RIGHT: fit preprocessing on train only, inside a pipeline
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression())
scores = cross_val_score(pipe, X, y, cv=5)   # scaler refits on each fold's train

A Pipeline isn't a style preference. It's the thing that keeps test information from bleeding into training, and it's the difference between a validation score you can believe and one you can't.

So how do I actually choose?

Here's the decision I'd hand a junior engineer, in order:

→ Start with the simple model. Logistic or linear regression, clean features, a real cross-validation setup. This is your baseline and your sanity check — if it scores absurdly well, you probably have leakage, and the simple model made it easy to spot.

→ Spend your effort on features, not models. Interactions, ratios, time-since-event, sensible encodings. Most of the accuracy you're chasing lives here. Every feature you engineer by hand is one the black box doesn't have to reconstruct opaquely.

→ Reach for boosting when the simple model plateaus and you've ruled out leakage and you've exhausted obvious features. Now you're using boosting for what it's actually good at — nonlinearity you genuinely can't hand-engineer — instead of as a band-aid over lazy features.

→ When you do use boosting, demand interpretability back. Feature importances, SHAP values, partial dependence. If you can't explain why it predicts what it predicts, you can't catch it when it's wrong.

The principle underneath all of it: model choice is a data decision, not a leaderboard contest. A clean regression on good features will beat boosting on dirty ones almost every time, and it'll be cheaper to run and easier to defend. XGBoost won't save you from a pipeline that feeds it lies. Nothing will.

When your accuracy last stalled — did you reach for a new model, or did you go back and interrogate the features first? I'm curious which instinct fired, because it tells you a lot about where you are in this.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I build and fix the data pipelines that feed models like these. If this is your world, this is the work I do at vf-insights.com.