Factor Models vs ML: Alpha with 200 Samples, Not 200K

#factormodels #machinelearning #quantitativefinance #overfitting

The Data Poverty Problem Nobody Talks About

Most quant ML tutorials assume you have thousands of stocks and years of daily data. In reality, you're often stuck with 200 monthly observations of a niche universe—emerging market small-caps, sector rotation signals, or alternative data that only goes back five years. Throw a random forest at that and watch it memorize your training set while delivering zero alpha out-of-sample.

This isn't a hypothetical. I've seen teams burn weeks tuning XGBoost hyperparameters on datasets where a three-factor linear model outperforms by 20% Sharpe ratio simply because it has 197 fewer parameters to overfit. The dirty secret of quant finance is that data scarcity is the norm, not the exception.

Factor models—linear combinations of known risk premia like value, momentum, quality—aren't sexy. But they encode decades of financial economics research into a handful of coefficients. When your sample size is in the low hundreds, that prior knowledge is worth more than any gradient boosting magic.

Continue reading the full article on TildAlice

DEV Community

Factor Models vs ML: Alpha with 200 Samples, Not 200K

The Data Poverty Problem Nobody Talks About

Top comments (0)