DEV Community

Cover image for AIC and BIC: Choosing the Right Model Without Overfitting
Berkan Sesen
Berkan Sesen

Posted on • Originally published at sesen.ai

AIC and BIC: Choosing the Right Model Without Overfitting

Imagine you're fitting a curve to noisy data. A straight line misses the shape entirely, so you try a quadratic, then a cubic, then keep going. By degree 10 the curve passes through nearly every point, training error is almost zero, and your model is worthless on new data.

This is the oldest trap in machine learning: more parameters always improve the fit to training data, but at some point the improvement is just noise-fitting. You need a principled way to say "this model is complex enough." That is exactly what the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) provide. By the end of this post, you'll compute both from scratch, understand why they penalise complexity differently, and know when to reach for each one.

The Model Selection Problem

Before we look at code, watch what happens when we increase the complexity of a polynomial fit. A straight line (degree 1) misses the pattern entirely. The true cubic (degree 3) captures the shape. But by degree 10, the model is chasing individual data points:

Animation showing polynomial fits from degree 1 to 10, with AIC and BIC scores updating as each model is evaluated

Both AIC and BIC drop sharply at degree 3 and then climb as complexity increases. The information criteria are doing exactly what we want: rewarding better fit but penalising unnecessary parameters.

Let's Build It

Click the badge to run this yourself:

Open In Colab

import numpy as np
from scipy.stats import norm

np.random.seed(42)

# Generate data from a true cubic with Gaussian noise
n = 50
x = np.linspace(-3, 3, n)
y_true = 0.5 * x**3 - 2 * x + 1
y = y_true + np.random.normal(0, 3, n)

# Fit polynomials of degree 1 through 10
for degree in range(1, 11):
    coeffs = np.polyfit(x, y, degree)
    y_pred = np.polyval(coeffs, x)
    residuals = y - y_pred

    # ML estimate of sigma
    sigma_ml = np.sqrt(np.mean(residuals**2))

    # Log-likelihood under Gaussian errors
    ll = np.sum(norm.logpdf(y, loc=y_pred, scale=sigma_ml))

    # Number of parameters: (degree+1) coefficients + 1 sigma
    k = degree + 2

    aic = 2 * k - 2 * ll
    bic = k * np.log(n) - 2 * ll

    print(f"Degree {degree:2d}: k={k:2d}, LL={ll:.2f}, "
          f"AIC={aic:.2f}, BIC={bic:.2f}")
Enter fullscreen mode Exit fullscreen mode
Degree  1: k= 3, LL=-127.93, AIC=261.87, BIC=267.60
Degree  2: k= 4, LL=-127.29, AIC=262.58, BIC=270.23
Degree  3: k= 5, LL=-119.19, AIC=248.38, BIC=257.94
Degree  4: k= 6, LL=-119.19, AIC=250.37, BIC=261.84
Degree  5: k= 7, LL=-118.74, AIC=251.49, BIC=264.87
Degree  6: k= 8, LL=-117.25, AIC=250.50, BIC=265.79
Degree  7: k= 9, LL=-117.07, AIC=252.14, BIC=269.34
Degree  8: k=10, LL=-117.05, AIC=254.10, BIC=273.22
Degree  9: k=11, LL=-115.98, AIC=253.96, BIC=274.99
Degree 10: k=12, LL=-115.90, AIC=255.80, BIC=278.75
Enter fullscreen mode Exit fullscreen mode

Both AIC and BIC select degree 3, which is the true generating function. The log-likelihood keeps increasing (better fit) as degree grows, but the penalty terms eventually outweigh the improvement.

AIC and BIC plotted against polynomial degree, showing both criteria reaching their minimum at degree 3, with log-likelihood on the secondary axis

What Just Happened?

The Log-Likelihood: Measuring Goodness of Fit

The foundation of both AIC and BIC is the log-likelihood: how probable is the observed data under a given model? For a linear model with Gaussian errors, each data point $y_i$ is assumed to follow:

equation

Where $\hat{y}_i$ is the model's prediction and $\sigma^2$ is the error variance. The log-likelihood of the entire dataset is the sum of individual log-probabilities:

equation

This is exactly what the original R code computed manually. The code calculated $\sigma$ using the ML estimate (dividing by $n$, not $n-p$) and evaluated dnorm() at each residual:

# The R code's approach, translated to Python:
# LL = sum(log(dnorm(x=residuals, mean=0, sd=sigma_ML)))
# Which is equivalent to:
ll = np.sum(norm.logpdf(residuals, loc=0, scale=sigma_ml))
Enter fullscreen mode Exit fullscreen mode

A higher log-likelihood means the model assigns higher probability to the observed data. But here is the catch: adding more parameters always increases the log-likelihood (or at least does not decrease it). A degree-10 polynomial can perfectly fit any 10 points; the log-likelihood will be very high. So we need a way to penalise complexity.

AIC: Penalising for Prediction

The Akaike Information Criterion adds a penalty proportional to the number of parameters:

equation

Where $k$ is the number of estimated parameters and $\ell(\hat{\theta})$ is the maximised log-likelihood. Lower AIC is better. The $2k$ term penalises complexity: each additional parameter costs 2 points, and the benefit (from increased log-likelihood) must exceed that cost.

BIC: Penalising for Truth

The Bayesian Information Criterion uses a heavier, sample-size-dependent penalty:

equation

Where $n$ is the number of data points. Since $\ln(n) > 2$ for any $n \geq 8$, BIC penalises complexity more harshly than AIC. This means BIC tends to select simpler models, especially as the sample size grows.

The Penalty Decomposition

To see the difference concretely, look at the decomposition of each criterion into its fit and penalty components:

Stacked bar chart showing AIC and BIC decomposed into the negative log-likelihood (fit) and penalty terms for each polynomial degree

The light-coloured portion is the goodness of fit ($-2\ell$), which shrinks as the model improves. The dark-coloured cap is the penalty, which grows with complexity. For AIC, the penalty grows by 2 per parameter. For BIC (with $n = 50$), it grows by $\ln(50) \approx 3.91$ per parameter, nearly double.

Underfitting vs Overfitting at a Glance

Three-panel comparison showing degree 1 (underfitting), degree 3 (optimal), and degree 10 (overfitting), each with AIC and BIC scores

The degree-1 line misses the cubic curvature (AIC=261.9, BIC=267.6). The degree-3 polynomial captures the true shape (AIC=248.4, BIC=257.9). The degree-10 polynomial chases noise with wild oscillations at the edges (AIC=255.8, BIC=278.7). Both criteria correctly identify degree 3 as the sweet spot.

Going Deeper

AIC vs BIC: When to Use Which

The choice between AIC and BIC depends on your goal.

Use AIC when you want to predict. AIC minimises the expected Kullback-Leibler divergence between the true data-generating process and the fitted model. It does not assume the true model is in your candidate set. AIC is willing to select a slightly more complex model if the extra parameters improve prediction, even if those parameters are not "real" effects.

Use BIC when you want to explain. BIC is derived from a Bayesian model comparison perspective and is consistent: as $n \to \infty$, BIC selects the true model with probability 1 (assuming it is among the candidates). BIC's heavier penalty makes it more conservative, preferring simpler models.

Comparison diagram showing AIC versus BIC, their formulas, use cases, and shared properties

In practice, when AIC and BIC agree (as they do here), you can be quite confident in the selection. When they disagree, AIC will choose the more complex model and BIC the simpler one. Neither is universally better.

Cross-Validation as an Alternative

Cross-validation is the main competitor to information criteria for model selection. Instead of penalising complexity analytically, it estimates out-of-sample performance directly by holding out data.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for degree in range(1, 11):
    pipe = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(pipe, x.reshape(-1, 1), y, cv=kf,
                             scoring='neg_mean_squared_error')
    print(f"Degree {degree:2d}: CV MSE = {-scores.mean():.3f} ± {scores.std():.3f}")
Enter fullscreen mode Exit fullscreen mode
Degree  1: CV MSE = 10.404 ± 3.612
Degree  2: CV MSE = 11.268 ± 3.149
Degree  3: CV MSE = 8.181 ± 2.801
Degree  4: CV MSE = 8.333 ± 2.661
Degree  5: CV MSE = 9.262 ± 3.775
Degree  6: CV MSE = 9.958 ± 4.022
Degree  7: CV MSE = 10.662 ± 4.458
Degree  8: CV MSE = 11.776 ± 5.469
Degree  9: CV MSE = 11.331 ± 6.141
Degree 10: CV MSE = 12.503 ± 7.651
Enter fullscreen mode Exit fullscreen mode

All three methods agree: degree 3 is the best model.

Normalised comparison of AIC, BIC, and 5-fold cross-validation scores across polynomial degrees, all reaching their minimum at degree 3

The key tradeoffs:

AIC / BIC Cross-Validation
Speed One model fit per candidate k model fits per candidate
Assumptions Requires likelihood function Model-agnostic
Small samples Works well (especially AICc) Noisy with few data points
Non-parametric models Not directly applicable Works with any model

For linear models and GLMs, AIC and BIC are fast and effective. For black-box models like random forests or neural networks, cross-validation is the standard approach.

Common Pitfalls

  1. Comparing models on different data. AIC and BIC values are only comparable across models fitted to the same dataset. If you subset or transform the data differently between models, the scores are meaningless.
  2. Forgetting to count parameters. For polynomial regression, $k$ includes all coefficients plus the variance parameter $\sigma^2$. A degree-3 polynomial has 4 coefficients (intercept, $x$, $x^2$, $x^3$) plus $\sigma$, so $k = 5$.
  3. Using AIC with small samples. When $n/k < 40$, use the corrected version AICc:

equation

This adds an extra penalty that vanishes as $n$ grows but prevents overfitting with small datasets.

Try It Yourself

  1. Change the noise level. Set the noise standard deviation to 1 instead of 3. Do AIC and BIC still agree? (Hint: with less noise, the signal is clearer and both criteria become more decisive.)
  2. Try AICc. Implement the small-sample correction and compare it to AIC for $n = 20$ data points. Which degrees does each select?
  3. Apply to real data. Load the Boston housing dataset from scikit-learn, fit models with different feature subsets, and use AIC/BIC to choose the best.

Where This Comes From

Hirotsugu Akaike presented his criterion at the Second International Symposium on Information Theory in 1973, with the paper published the following year. Working at the Institute of Statistical Mathematics in Tokyo, Akaike had a key insight: model selection could be framed as an information-theoretic problem. The Kullback-Leibler divergence measures the information lost when one probability distribution is used to approximate another. Akaike showed that $-2\ell + 2k$ provides an asymptotically unbiased estimate of this divergence (up to a constant that cancels when comparing models).

"The choice of model is treated as an estimation procedure. The criterion for the 'best approximating model' is the minimum of the expected entropy."

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.

Five years later, Gideon Schwarz at the Hebrew University of Jerusalem derived a competing criterion from a Bayesian perspective. He showed that under certain regularity conditions, $k \ln(n) - 2\ell$ approximates the log of the Bayes factor between two models. Where Akaike asked "which model predicts best?", Schwarz asked "which model is most probably true?"

"The dimension of the true model is determined with probability 1 as the sample size tends to infinity."

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464.

The Mathematical Connection

Both criteria share the same structure: a goodness-of-fit term ($-2\ell$) minus a complexity penalty. The difference lies entirely in the penalty:

equation

AIC's penalty is fixed regardless of sample size. BIC's penalty grows with $\ln(n)$, which means BIC becomes increasingly conservative as you collect more data. For our 50-point dataset, BIC penalises each parameter by 3.91 compared to AIC's 2.0, nearly twice as harsh.

This difference has a theoretical consequence: AIC is efficient (it minimises prediction error asymptotically) but not consistent (it may select overly complex models even with infinite data). BIC is consistent (it selects the true model as $n \to \infty$) but not efficient (it may underfit with finite data). There is no free lunch.

Further Reading

  • The original AIC paper: Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723. Read Section III for the entropy-based derivation.
  • The BIC paper: Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461-464. Remarkably short at only 4 pages.
  • Accessible comparison: Burnham, K. P. & Anderson, D. R. (2004). Multimodel Inference. Sociological Methods & Research, 33(2), 261-304. Excellent discussion of when to prefer AIC over BIC.
  • Next algorithm to learn: If you're interested in model selection for unsupervised learning, see our post on Gaussian Mixture Models, which uses BIC to choose the number of clusters.

Interactive Tools

  • Overfitting Explorer — Drag a polynomial degree slider and watch the bias-variance tradeoff in real time

Related Posts

Frequently Asked Questions

What is the difference between AIC and BIC?

Both penalise model complexity to prevent overfitting, but they differ in philosophy and penalty strength. AIC uses a fixed penalty of 2 per parameter and is optimised for prediction accuracy. BIC uses a penalty of ln(n) per parameter, which grows with sample size and is designed to identify the true generating model. For any dataset with 8 or more observations, BIC penalises complexity more harshly than AIC.

Can I compare AIC or BIC values across different datasets?

No. AIC and BIC values are only meaningful when comparing models fitted to the exact same dataset. If you change the number of observations, transform the response variable, or subset the data differently between models, the scores are not comparable. Always ensure identical data across all candidate models.

When should I use AICc instead of AIC?

Use the corrected version AICc when the ratio of sample size to parameters (n/k) is below 40. AICc adds an extra penalty term that prevents overfitting with small datasets but vanishes as n grows. In practice, AICc is never worse than AIC, so some practitioners use it by default regardless of sample size.

Can AIC and BIC be used with non-linear models like neural networks?

Not directly. Both criteria require a well-defined likelihood function and a countable number of parameters. For black-box models such as random forests or deep neural networks, cross-validation is the standard approach to model selection. AIC and BIC work best with linear models, generalised linear models, and other parametric statistical models.

What does it mean when AIC and BIC disagree on the best model?

When the two criteria disagree, AIC will favour the more complex model and BIC will favour the simpler one. This reflects their different goals: AIC prioritises predictive accuracy while BIC prioritises parsimony. If your goal is forecasting, lean toward the AIC-selected model. If your goal is understanding the true data-generating process, lean toward the BIC-selected model.

Top comments (0)