Berkan Sesen

Posted on Jun 29 • Originally published at sesen.ai

AIC and BIC: Choosing the Right Model Without Overfitting

#statistics #supervisedlearning #discriminative

Imagine you're fitting a curve to noisy data. A straight line misses the shape entirely, so you try a quadratic, then a cubic, then keep going. By degree 10 the curve passes through nearly every point, training error is almost zero, and your model is worthless on new data.

This is the oldest trap in machine learning: more parameters always improve the fit to training data, but at some point the improvement is just noise-fitting. You need a principled way to say "this model is complex enough." That is exactly what the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) provide. By the end of this post, you'll compute both from scratch, understand why they penalise complexity differently, and know when to reach for each one.

The Model Selection Problem

Before we look at code, watch what happens when we increase the complexity of a polynomial fit. A straight line (degree 1) misses the pattern entirely. The true cubic (degree 3) captures the shape. But by degree 10, the model is chasing individual data points:

Both AIC and BIC drop sharply at degree 3 and then climb as complexity increases. The information criteria are doing exactly what we want: rewarding better fit but penalising unnecessary parameters.

Let's Build It

Click the badge to run this yourself:

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Generate data from a true cubic with Gaussian noise
n = 50
x = np.linspace(-3, 3, n)
y_true = 0.5 * x**3 - 2 * x + 1
y = y_true + np.random.normal(0, 3, n)

# Fit polynomials of degree 1 through 10
for degree in range(1, 11):
    coeffs = np.polyfit(x, y, degree)
    y_pred = np.polyval(coeffs, x)
    residuals = y - y_pred

    # ML estimate of sigma
    sigma_ml = np.sqrt(np.mean(residuals**2))

    # Log-likelihood under Gaussian errors
    ll = np.sum(norm.logpdf(y, loc=y_pred, scale=sigma_ml))

    # Number of parameters: (degree+1) coefficients + 1 sigma
    k = degree + 2

    aic = 2 * k - 2 * ll
    bic = k * np.log(n) - 2 * ll

    print(f"Degree {degree:2d}: k={k:2d}, LL={ll:.2f}, "
          f"AIC={aic:.2f}, BIC={bic:.2f}")

Degree  1: k= 3, LL=-127.93, AIC=261.87, BIC=267.60
Degree  2: k= 4, LL=-127.29, AIC=262.58, BIC=270.23
Degree  3: k= 5, LL=-119.19, AIC=248.38, BIC=257.94
Degree  4: k= 6, LL=-119.19, AIC=250.37, BIC=261.84
Degree  5: k= 7, LL=-118.74, AIC=251.49, BIC=264.87
Degree  6: k= 8, LL=-117.25, AIC=250.50, BIC=265.79
Degree  7: k= 9, LL=-117.07, AIC=252.14, BIC=269.34
Degree  8: k=10, LL=-117.05, AIC=254.10, BIC=273.22
Degree  9: k=11, LL=-115.98, AIC=253.96, BIC=274.99
Degree 10: k=12, LL=-115.90, AIC=255.80, BIC=278.75

Both AIC and BIC select degree 3, which is the true generating function. The log-likelihood keeps increasing (better fit) as degree grows, but the penalty terms eventually outweigh the improvement.

What Just Happened?

The Log-Likelihood: Measuring Goodness of Fit

The foundation of both AIC and BIC is the log-likelihood: how probable is the observed data under a given model? For a linear model with Gaussian errors, each data point $y_i$ is assumed to follow:

Where $\hat{y}_i$ is the model's prediction and $\sigma^2$ is the error variance. The log-likelihood of the entire dataset is the sum of individual log-probabilities:

This is exactly what the original R code computed manually. The code calculated $\sigma$ using the ML estimate (dividing by $n$ , not $n-p$ ) and evaluated dnorm() at each residual:

# The R code's approach, translated to Python:
# LL = sum(log(dnorm(x=residuals, mean=0, sd=sigma_ML)))
# Which is equivalent to:
ll = np.sum(norm.logpdf(residuals, loc=0, scale=sigma_ml))

A higher log-likelihood means the model assigns higher probability to the observed data. But here is the catch: adding more parameters always increases the log-likelihood (or at least does not decrease it). A degree-10 polynomial can perfectly fit any 10 points; the log-likelihood will be very high. So we need a way to penalise complexity.

AIC: Penalising for Prediction

The Akaike Information Criterion adds a penalty proportional to the number of parameters:

Where $k$ is the number of estimated parameters and $\ell(\hat{\theta})$ is the maximised log-likelihood. Lower AIC is better. The $2k$ term penalises complexity: each additional parameter costs 2 points, and the benefit (from increased log-likelihood) must exceed that cost.

BIC: Penalising for Truth

The Bayesian Information Criterion uses a heavier, sample-size-dependent penalty:

Where $n$ is the number of data points. Since $\ln(n) > 2$ for any $n \geq 8$ , BIC penalises complexity more harshly than AIC. This means BIC tends to select simpler models, especially as the sample size grows.

The Penalty Decomposition

To see the difference concretely, look at the decomposition of each criterion into its fit and penalty components:

The light-coloured portion is the goodness of fit ( $-2\ell$ ), which shrinks as the model improves. The dark-coloured cap is the penalty, which grows with complexity. For AIC, the penalty grows by 2 per parameter. For BIC (with $n = 50$ ), it grows by $\ln(50) \approx 3.91$ per parameter, nearly double.

Underfitting vs Overfitting at a Glance

The degree-1 line misses the cubic curvature (AIC=261.9, BIC=267.6). The degree-3 polynomial captures the true shape (AIC=248.4, BIC=257.9). The degree-10 polynomial chases noise with wild oscillations at the edges (AIC=255.8, BIC=278.7). Both criteria correctly identify degree 3 as the sweet spot.

Going Deeper

AIC vs BIC: When to Use Which

The choice between AIC and BIC depends on your goal.

Use AIC when you want to predict. AIC minimises the expected Kullback-Leibler divergence between the true data-generating process and the fitted model. It does not assume the true model is in your candidate set. AIC is willing to select a slightly more complex model if the extra parameters improve prediction, even if those parameters are not "real" effects.

Use BIC when you want to explain. BIC is derived from a Bayesian model comparison perspective and is consistent: as $n \to \infty$ , BIC selects the true model with probability 1 (assuming it is among the candidates). BIC's heavier penalty makes it more conservative, preferring simpler models.

In practice, when AIC and BIC agree (as they do here), you can be quite confident in the selection. When they disagree, AIC will choose the more complex model and BIC the simpler one. Neither is universally better.

Cross-Validation as an Alternative

Cross-validation is the main competitor to information criteria for model selection. Instead of penalising complexity analytically, it estimates out-of-sample performance directly by holding out data.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for degree in range(1, 11):
    pipe = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(pipe, x.reshape(-1, 1), y, cv=kf,
                             scoring='neg_mean_squared_error')
    print(f"Degree {degree:2d}: CV MSE = {-scores.mean():.3f} ± {scores.std():.3f}")

Degree  1: CV MSE = 10.404 ± 3.612
Degree  2: CV MSE = 11.268 ± 3.149
Degree  3: CV MSE = 8.181 ± 2.801
Degree  4: CV MSE = 8.333 ± 2.661
Degree  5: CV MSE = 9.262 ± 3.775
Degree  6: CV MSE = 9.958 ± 4.022
Degree  7: CV MSE = 10.662 ± 4.458
Degree  8: CV MSE = 11.776 ± 5.469
Degree  9: CV MSE = 11.331 ± 6.141
Degree 10: CV MSE = 12.503 ± 7.651

All three methods agree: degree 3 is the best model.

The key tradeoffs:

	AIC / BIC	Cross-Validation
Speed	One model fit per candidate	k model fits per candidate
Assumptions	Requires likelihood function	Model-agnostic
Small samples	Works well (especially AICc)	Noisy with few data points
Non-parametric models	Not directly applicable	Works with any model

For linear models and GLMs, AIC and BIC are fast and effective. For black-box models like random forests or neural networks, cross-validation is the standard approach.

Common Pitfalls

Comparing models on different data. AIC and BIC values are only comparable across models fitted to the same dataset. If you subset or transform the data differently between models, the scores are meaningless.
Forgetting to count parameters. For polynomial regression, $k$ includes all coefficients plus the variance parameter $\sigma^2$ . A degree-3 polynomial has 4 coefficients (intercept, $x$ , $x^2$ , $x^3$ ) plus $\sigma$ , so $k = 5$ .
Using AIC with small samples. When $n/k < 40$ , use the corrected version AICc:

This adds an extra penalty that vanishes as $n$ grows but prevents overfitting with small datasets.

Try It Yourself

Change the noise level. Set the noise standard deviation to 1 instead of 3. Do AIC and BIC still agree? (Hint: with less noise, the signal is clearer and both criteria become more decisive.)
Try AICc. Implement the small-sample correction and compare it to AIC for $n = 20$ data points. Which degrees does each select?
Apply to real data. Load the Boston housing dataset from scikit-learn, fit models with different feature subsets, and use AIC/BIC to choose the best.

Where This Comes From

Hirotsugu Akaike presented his criterion at the Second International Symposium on Information Theory in 1973, with the paper published the following year. Working at the Institute of Statistical Mathematics in Tokyo, Akaike had a key insight: model selection could be framed as an information-theoretic problem. The Kullback-Leibler divergence measures the information lost when one probability distribution is used to approximate another. Akaike showed that $-2\ell + 2k$ provides an asymptotically unbiased estimate of this divergence (up to a constant that cancels when comparing models).

"The choice of model is treated as an estimation procedure. The criterion for the 'best approximating model' is the minimum of the expected entropy."

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.

Five years later, Gideon Schwarz at the Hebrew University of Jerusalem derived a competing criterion from a Bayesian perspective. He showed that under certain regularity conditions, $k \ln(n) - 2\ell$ approximates the log of the Bayes factor between two models. Where Akaike asked "which model predicts best?", Schwarz asked "which model is most probably true?"

"The dimension of the true model is determined with probability 1 as the sample size tends to infinity."

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464.

The Mathematical Connection

Both criteria share the same structure: a goodness-of-fit term ( $-2\ell$ ) minus a complexity penalty. The difference lies entirely in the penalty:

AIC's penalty is fixed regardless of sample size. BIC's penalty grows with $\ln(n)$ , which means BIC becomes increasingly conservative as you collect more data. For our 50-point dataset, BIC penalises each parameter by 3.91 compared to AIC's 2.0, nearly twice as harsh.

This difference has a theoretical consequence: AIC is efficient (it minimises prediction error asymptotically) but not consistent (it may select overly complex models even with infinite data). BIC is consistent (it selects the true model as $n \to \infty$ ) but not efficient (it may underfit with finite data). There is no free lunch.

Interactive Tools

Overfitting Explorer — Drag a polynomial degree slider and watch the bias-variance tradeoff in real time

Maximum Likelihood Estimation from Scratch — The log-likelihood function that AIC and BIC penalise is derived from MLE; this post builds MLE from first principles.
Linear Regression Five Ways — We use polynomial regression as our test case here; this post covers the broader linear regression toolkit.
Hyperparameter Optimisation: Grid vs Random vs Bayesian — Cross-validation is the main alternative to information criteria for model selection.
Gaussian Mixture Models and EM in Practice — BIC is commonly used to select the number of mixture components in GMMs.
From MLE to Bayesian Inference — BIC connects to Bayesian model comparison via the Bayes factor approximation.

Frequently Asked Questions

What is the difference between AIC and BIC?

Both penalise model complexity to prevent overfitting, but they differ in philosophy and penalty strength. AIC uses a fixed penalty of 2 per parameter and is optimised for prediction accuracy. BIC uses a penalty of ln(n) per parameter, which grows with sample size and is designed to identify the true generating model. For any dataset with 8 or more observations, BIC penalises complexity more harshly than AIC.

Can I compare AIC or BIC values across different datasets?

No. AIC and BIC values are only meaningful when comparing models fitted to the exact same dataset. If you change the number of observations, transform the response variable, or subset the data differently between models, the scores are not comparable. Always ensure identical data across all candidate models.

When should I use AICc instead of AIC?

Use the corrected version AICc when the ratio of sample size to parameters (n/k) is below 40. AICc adds an extra penalty term that prevents overfitting with small datasets but vanishes as n grows. In practice, AICc is never worse than AIC, so some practitioners use it by default regardless of sample size.

Can AIC and BIC be used with non-linear models like neural networks?

Not directly. Both criteria require a well-defined likelihood function and a countable number of parameters. For black-box models such as random forests or deep neural networks, cross-validation is the standard approach to model selection. AIC and BIC work best with linear models, generalised linear models, and other parametric statistical models.

What does it mean when AIC and BIC disagree on the best model?

When the two criteria disagree, AIC will favour the more complex model and BIC will favour the simpler one. This reflects their different goals: AIC prioritises predictive accuracy while BIC prioritises parsimony. If your goal is forecasting, lean toward the AIC-selected model. If your goal is understanding the true data-generating process, lean toward the BIC-selected model.

DEV Community