Berkan Sesen

Posted on Apr 10 • Originally published at sesen.ai

Hyperparameter Optimization: Grid vs Random vs Bayesian

#optimisation #supervisedlearning #bayesian

You've trained a Random Forest and it works — 85% accuracy out of the box. But you used the default hyperparameters. What if n_estimators=500 with max_features=0.3 and min_samples_leaf=10 pushes that to 91%? Only one way to find out: search.

The problem is combinatorial. Our Random Forest has 4 hyperparameters. If you try 10 values for each in a grid, that's $10^4 = 10{,}000$ combinations. Each combination requires 5-fold cross-validation. That's 50,000 model fits — and we only have 4 dimensions. Neural networks routinely have 10–20 tunable hyperparameters, where exhaustive search is physically impossible.

This post compares three strategies of increasing sophistication:

Grid Search — try every combination on a predefined grid
Random Search — sample combinations at random (surprisingly effective)
Bayesian Optimization — build a model of the objective and use it to choose the next point intelligently

We'll run all three on the same classification task, using the same Random Forest and the same hyperparameter ranges. You'll see that for an easy problem, all three reach ~90% accuracy — but the way they get there reveals fundamentally different philosophies about search. The real payoff of Bayesian optimization comes when evaluations are expensive: training a neural network for hours, running a simulation, or calling a paid API.

If you've read the genetic algorithms post, you've already seen one approach to gradient-free optimization. Hyperparameter optimization is another — and Bayesian optimization is arguably the most elegant solution.

Quick Win: Run All Three Methods

Click the badge to open the notebook and run everything yourself:

Setup: The Problem

We'll classify a synthetic dataset with 2,000 samples, 20 features, and 4 classes — a moderately challenging multi-class problem:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn import metrics
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical

# Generate dataset (tuned to match the original Kaggle mobile price dataset's ~90% RF accuracy)
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=15,
    n_clusters_per_class=1, flip_y=0.01,
    n_classes=4, random_state=42
)

print(f'{X.shape[1]} features, {len(np.unique(y))} classes, {X.shape[0]} samples')
# 20 features, 4 classes, 2000 samples

All three methods will tune the same 4 hyperparameters with 5-fold stratified cross-validation:

Hyperparameter	Type	Range	What it controls
`max_features`	Continuous	[0.1, 1.0]	Fraction of features per split
`n_estimators`	Integer	[100, 1000]	Number of trees
`min_samples_leaf`	Integer	[5, 25]	Minimum samples in a leaf
`criterion`	Categorical	{gini, entropy}	Split quality measure

Method 1: Grid Search

Grid search evaluates every combination on a predefined grid:

param_grid = {
    'n_estimators': [200, 400, 600, 800],
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [5, 10, 20],
    'max_features': [0.3, 0.5, 0.8]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
    param_grid=param_grid,
    scoring='accuracy',
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    verbose=0,
    n_jobs=-1
)

grid_search.fit(X, y)
print(f'Grid Search — Best accuracy: {grid_search.best_score_:.4f}')
print(f'Combinations evaluated: {len(grid_search.cv_results_["mean_test_score"])}')
print(f'Best params: {grid_search.best_params_}')

With $4 \times 2 \times 3 \times 3 = 72$ combinations and 5 folds each, that's 360 model fits.

Method 2: Random Search

Random search samples 15 combinations uniformly at random from the ranges:

param_distributions = {
    'n_estimators': np.arange(100, 1001),
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': np.arange(5, 26),
    'max_features': np.linspace(0.1, 1.0, 100)
}

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
    param_distributions=param_distributions,
    n_iter=15,
    scoring='accuracy',
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    verbose=0,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X, y)
print(f'Random Search — Best accuracy: {random_search.best_score_:.4f}')
print(f'Combinations evaluated: 15')
print(f'Best params: {random_search.best_params_}')

Only 15 combinations, 75 model fits — a fraction of grid search.

Method 3: Bayesian Optimization (Gaussian Process)

Bayesian optimization builds a probabilistic model of the objective and uses it to decide where to evaluate next:

def evaluate_params(params):
    """5-fold stratified CV accuracy for a RandomForest configuration.
    Returns negative accuracy (since gp_minimize minimises)."""
    max_features, n_estimators, min_samples_leaf, criterion = params
    model = RandomForestClassifier(
        max_features=max_features,
        n_estimators=n_estimators,
        min_samples_leaf=min_samples_leaf,
        criterion=criterion,
        n_jobs=-1,
        random_state=42
    )
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    accuracies = []
    for train_idx, val_idx in kf.split(X, y):
        model.fit(X[train_idx], y[train_idx])
        preds = model.predict(X[val_idx])
        accuracies.append(metrics.accuracy_score(y[val_idx], preds))
    return -np.mean(accuracies)

param_space = [
    Real(0.1, 1.0, prior='uniform', name='max_features'),
    Integer(100, 1000, name='n_estimators'),
    Integer(5, 25, name='min_samples_leaf'),
    Categorical(['gini', 'entropy'], name='criterion')
]

result = gp_minimize(
    evaluate_params,
    dimensions=param_space,
    n_calls=15,
    n_random_starts=10,
    random_state=42,
    verbose=False
)

best_params = dict(zip(
    ['max_features', 'n_estimators', 'min_samples_leaf', 'criterion'],
    result.x
))
print(f'Bayesian (GP) — Best accuracy: {-result.fun:.4f}')
print(f'Evaluations: 15 (10 random + 5 guided)')
print(f'Best params: {best_params}')

Same 15 evaluations, but the last 5 are informed by the GP model built from the first 10.

Results Comparison

Method	Best CV Accuracy	Evaluations	Strategy
Grid Search	~90%	72	Exhaustive (predefined grid)
Random Search	~90%	15	Uniform random sampling
Bayesian (GP)	~90%	15 (10 random + 5 guided)	Model-based sequential

All three methods converge to approximately the same accuracy — around 90%. For this well-behaved 4-dimensional problem, the objective landscape is smooth enough that random search finds good regions quickly.

The convergence chart tells the real story. Grid search plods through its predefined grid systematically. Random search jumps around but finds a good region early. Bayesian optimization starts random (first 10 points) then accelerates — the GP model narrows in on promising regions.

The punchline: the methods differ most when evaluations are expensive. For a Random Forest on 2,000 samples, each evaluation takes under a second and you can afford to be wasteful. Train a Transformer for 8 hours per evaluation, and the difference between 15 smart evaluations and 72 brute-force ones is the difference between two days and two weeks.

What Just Happened?

All three methods solve the same problem — find the hyperparameter combination that maximises cross-validated accuracy — but they navigate the search space in fundamentally different ways.

Grid Search: The Cartesian Product

Grid search evaluates every point on a regular lattice. If you specify 4 values for n_estimators, 2 for criterion, 3 for min_samples_leaf, and 3 for max_features, you get $4 \times 2 \times 3 \times 3 = 72$ combinations. It's the Cartesian product of your parameter lists.

Think of a tourist exploring a city by visiting every intersection on the street grid. Thorough? Yes. Efficient? Not even close. The problem is that most intersections are uninteresting — the accuracy surface for a Random Forest is usually smooth, so neighbouring grid points give nearly identical results.

The deeper issue is the curse of dimensionality. In $d$ dimensions with $k$ values per dimension, the grid has $k^d$ points. With 10 values per dimension: 2D gives 100 points (fine), 4D gives 10,000 (slow), 10D gives 10 billion (impossible). Grid search is fundamentally limited to low-dimensional problems with coarse grids.

Random Search: The Bergstra-Bengio Insight

Random search replaces the grid with uniform random samples. This seems like it should be worse — throwing darts blindfolded. But Bergstra and Bengio (2012) showed it's almost always better than grid search for the same computational budget.

The insight is elegant. Suppose only 2 of your 4 hyperparameters actually matter (a common situation). On a $4 \times 4$ grid in 2D, you get 16 unique points — but only 4 unique values per dimension. Random search with 16 points gives you 16 unique values per dimension. You're covering the important dimensions far more densely.

The figure makes this vivid. Grid search evaluates 9 points, but only 3 unique values along each axis. Random search with 9 points explores 9 unique values per axis. Bayesian optimization clusters its samples in the high-accuracy region (top-right), exploring broadly first and then exploiting the best area.

Bayesian Optimization: Learning from History

Grid and random search are memoryless — each evaluation is independent of the others. The 15th evaluation in random search knows nothing about the first 14.

Bayesian optimization is fundamentally different. It maintains a model of the objective function — a Gaussian process (GP) that predicts what the accuracy will be at any point in the hyperparameter space, along with an uncertainty estimate. After each evaluation, the model updates, and an acquisition function decides where to evaluate next.

The three components:

Surrogate model (Gaussian Process) — A probabilistic model that interpolates between observed points. At observed points, uncertainty is zero. Far from observations, uncertainty is high. Think of it as drawing a smooth surface through scattered data points, with error bars that widen between points.
Acquisition function — A cheap-to-evaluate function that balances exploration (high uncertainty) and exploitation (high predicted value). The most common is Expected Improvement (EI): "how much better than the current best do I expect this point to be?"
Optimize the acquisition function — Find the point that maximises EI (a standard optimization problem, but cheap because the GP is fast to evaluate), then run the expensive evaluation there.

This loop — fit GP, maximize acquisition, evaluate, repeat — is what makes Bayesian optimization sequential and adaptive. If you've read the Bayesian inference post, you'll recognise the pattern: start with a prior (the GP), observe data, update to a posterior, and use the posterior to make decisions. The GP generalises this from updating parameter estimates to updating an entire function estimate.

Going Deeper

The GP Surrogate in Action

The animation shows Bayesian optimization on a 1D function. The blue line is the true (unknown) objective. The black line is the GP's mean prediction, with the shaded region showing the 95% confidence interval. The green area below is the Expected Improvement — it peaks where the model expects to find values better than the current best.

Watch how the GP evolves:

Early frames: Few observations, wide uncertainty, EI spreads across unexplored regions (exploration)
Middle frames: The GP starts to capture the function's shape, uncertainty shrinks near data points
Late frames: EI concentrates around the global optimum as the model becomes confident elsewhere (exploitation)

GP Predictive Distribution

At any unobserved point $\mathbf{x}_*$ , the GP predicts a Gaussian distribution:

Where:

$\mu(\mathbf{x}_*)$ — the mean prediction, interpolating nearby observations
$\sigma^2(\mathbf{x}_*)$ — the predictive variance: high far from data, zero at observed points
The predictions are conditioned on all previous observations $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$

The mean and variance are computed in closed form from the kernel function (typically Matérn 5/2 in skopt):

Where $\mathbf{K}$ is the kernel matrix between all observed points, $\mathbf{k}_*$ is the kernel vector between $\mathbf{x}_*$ and observed points, and $\sigma_n^2$ is the observation noise.

In plain English: the GP prediction at a new point is a weighted average of observed values, where the weights come from how "similar" (in kernel space) the new point is to each observation.

Acquisition Functions: Deciding Where to Look Next

The acquisition function turns the GP's prediction into a decision: where should we evaluate next? Three common choices:

Expected Improvement (EI) — the default in skopt:

Where $f(\mathbf{x}^+)$ is the best value observed so far. EI asks: "in expectation, how much will this point improve on the current best?" Points with high predicted mean and high uncertainty score well — this naturally balances exploitation and exploration.

Lower Confidence Bound (LCB):

Minimize the mean minus a multiple of the standard deviation. The parameter $\kappa$ controls the exploration–exploitation trade-off directly: higher $\kappa$ means more exploration.

Probability of Improvement (PI):

The probability that a point improves on the current best. Simpler than EI, but tends to exploit too aggressively — it doesn't care how much better a point might be, only whether it's better at all.

Exploration vs Exploitation

This tension appears everywhere in machine learning. In Q-learning, epsilon-greedy balances trying new actions (exploration) with choosing the best-known action (exploitation). In MCMC, the proposal distribution must explore the parameter space while spending time in high-probability regions.

Bayesian optimization handles this automatically through the acquisition function. EI naturally favours:

Points with high predicted accuracy (exploitation)
Points with high uncertainty (exploration)

No manual schedule needed — the GP's uncertainty estimates do the work.

When NOT to Use Bayesian Optimization

Bayesian optimization isn't always the right tool:

Cheap evaluations — If each evaluation takes seconds (like our Random Forest), random search with 100 iterations is simpler and nearly as effective
High dimensions ( $d > 20$ ) — GPs scale poorly with dimensionality. The kernel becomes uninformative and the acquisition function has too many local optima
Massive parallelism — If you have 1,000 GPUs, you can evaluate 1,000 random configurations simultaneously. Bayesian optimization is inherently sequential (each evaluation depends on all previous ones)
Discrete/conditional spaces — GPs assume smooth, continuous objectives. Deeply nested conditional hyperparameters (e.g., "layer type" → "layer-specific params") are better handled by tree-based methods like Optuna's TPE

Rule of thumb: Use Bayesian optimization when each evaluation costs minutes to hours (neural network training, simulation runs, expensive API calls) and you're in 5–15 dimensions.

Hyperparameters

All values come directly from the original code in quant_code/python/HyperParameter_Optimization/src/:

Parameter	Source file	Value	Why
`max_features`	`gp_min_optim.py`	Real(0.1, 1.0)	Fraction of features per split
`n_estimators`	`gp_min_optim.py`	Integer(100, 1000)	Number of trees in the forest
`min_samples_leaf`	`gp_min_optim.py`	Integer(5, 25)	Minimum leaf samples (regularisation)
`criterion`	`grid_n_random_search.py`	{gini, entropy}	Split quality metric
`n_calls`	`gp_min_optim.py`	15	Total evaluations for GP
`n_random_starts`	`gp_min_optim.py`	10	Initial random exploration
`cv`	all files	5-fold stratified	Evaluation protocol

Deep Dive: The Papers

Bergstra & Bengio (2012) — Random Search for Hyper-Parameter Optimization

The key paper that changed how practitioners think about hyperparameter search. Random Search for Hyper-Parameter Optimization (JMLR) demonstrated that random search is more efficient than grid search for most practical problems.

The core theorem is deceptively simple. Define the effective dimensionality of a search problem as the number of hyperparameters that significantly affect performance. Bergstra and Bengio showed:

"Random search is more efficient than grid search because it allows each trial to explore a different value of every hyperparameter. For problems with low effective dimensionality, this results in dramatically better coverage of the important dimensions."

Their famous figure (reproduced in our sampling comparison above) shows a 2D search where only one dimension matters. Grid search with 9 points wastes 6 of them evaluating the same 3 values of the important dimension. Random search with 9 points gives 9 unique values along the important dimension.

The practical implication: for the same computational budget, random search achieves equal or better results than grid search in virtually all cases. There is essentially no scenario where grid search is preferable.

Snoek, Larochelle & Adams (2012) — Practical Bayesian Optimization

Practical Bayesian Optimization of Machine Learning Algorithms (NeurIPS) introduced the Spearmint system and demonstrated that Bayesian optimization could match or beat human experts at tuning neural networks.

Their algorithm:

Input: Search space S, objective function f, budget N
1. Initialize with n₀ random evaluations
2. For i = n₀+1 to N:
   a. Fit GP to all observations {(x₁,y₁), ..., (xᵢ₋₁,yᵢ₋₁)}
   b. Find xᵢ = argmax_x EI(x) using the GP
   c. Evaluate yᵢ = f(xᵢ)
3. Return x with best observed y

This is exactly what skopt.gp_minimize implements. The key contributions beyond earlier work:

Automatic Relevance Determination (ARD) kernels — learns which hyperparameters matter most
Integrated acquisition function — marginalises over GP hyperparameters rather than fixing them
Pending evaluations — supports parallel evaluations through "fantasised" outcomes

Snoek et al. demonstrated that Bayesian optimization with 30 evaluations outperformed a human expert who had months to tune the same neural networks. The quote that launched a thousand AutoML papers:

"Bayesian optimization spent 2 minutes of overhead in exchange for saving the researchers days of manual tuning."

Mockus et al. (1978) & Jones et al. (1998) — The Origins

Bayesian optimization predates machine learning. Mockus, Tiesis, and Žilinskas (1978) formalised the idea of using a Bayesian model to guide sequential optimization in their work on Bayesian Methods for Seeking the Extremum. They introduced the Expected Improvement criterion, proving it is the optimal policy for a one-step lookahead under certain assumptions.

Two decades later, Jones, Schonlau, and Welch (1998) brought these ideas to engineering design optimization with the EGO (Efficient Global Optimization) algorithm. Their paper Efficient Global Optimization of Expensive Black-Box Functions established the GP + EI framework that Snoek et al. later applied to ML hyperparameters.

Historical Timeline

Mockus et al. (1978) — Bayesian optimization and Expected Improvement
Jones et al. (1998) — EGO: GP + EI for engineering design
Hutter et al. (2011) — SMAC: random forests as surrogate instead of GPs (scales to high dimensions)
Bergstra et al. (2011) — Hyperopt and TPE: tree-structured Parzen estimators (handles conditional spaces)
Bergstra & Bengio (2012) — The random search paper
Snoek et al. (2012) — Practical Bayesian optimization (Spearmint)
Akiba et al. (2019) — Optuna: define-by-run API, pruning, modern TPE (now the most popular framework)

Modern Alternatives: Optuna and Hyperopt

The original source code also includes implementations in Optuna and Hyperopt. Both use Tree-structured Parzen Estimators (TPE) instead of Gaussian Processes:

# Optuna (from optuna_optim.py) — define-by-run API
import optuna

def objective(trial):
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
    n_estimators = trial.suggest_int('n_estimators', 100, 1000)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 5, 25)
    max_features = trial.suggest_float('max_features', 0.1, 1.0)

    model = RandomForestClassifier(
        n_estimators=n_estimators, criterion=criterion,
        max_features=max_features, min_samples_leaf=min_samples_leaf,
        n_jobs=-1, random_state=42
    )
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = []
    for train_idx, val_idx in kf.split(X, y):
        model.fit(X[train_idx], y[train_idx])
        scores.append(metrics.accuracy_score(y[val_idx], model.predict(X[val_idx])))
    return -np.mean(scores)

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=15)

TPE models $P(\mathbf{x} | y < y^*)$ and $P(\mathbf{x} | y \geq y^*)$ separately (two Parzen estimators), then selects points that maximise the ratio. This handles conditional and discrete parameters more naturally than GPs.

When to use which:

skopt (GP) — smooth, low-dimensional spaces; small evaluation budgets
Optuna (TPE) — large search spaces; conditional parameters; early pruning of bad trials
Hyperopt (TPE) — similar to Optuna, but older API; still widely used

Interactive Tools

Regression Playground — Experiment with model complexity and see how different hyperparameters affect the fit
Overfitting Explorer — Visualise the bias-variance tradeoff that hyperparameter tuning aims to optimise

Genetic Algorithms from Scratch — Another gradient-free optimizer for black-box functions, using evolutionary strategies instead of surrogate models
From MLE to Bayesian Inference — The Gaussian Process generalises Bayesian updating from parameters to entire functions
MCMC Island Hopping — Exploration vs exploitation in sampling — the same tension that drives acquisition functions
MLE from Scratch — Cross-validated accuracy is a likelihood proxy: we're maximising the probability that the model explains held-out data

Try It Yourself

The interactive notebook includes exercises:

More iterations — Increase n_calls to 50 for the Bayesian optimizer. Does the extra budget find meaningfully better configurations, or does it plateau?
Higher dimensions — Add max_depth (Integer, 5–50) and min_samples_split (Integer, 2–20) as hyperparameters. How do the methods scale with 6 dimensions?
Optuna comparison — Run the Optuna TPE sampler alongside GP-based optimization. Compare convergence curves. Does TPE find good configurations faster?
Acquisition function sweep — Try acq_func='LCB' and acq_func='PI' in gp_minimize. How does the exploration–exploitation balance change?
Simulated expensive evaluations — Add time.sleep(2) inside evaluate_params to simulate expensive evaluations. Now the wall-clock difference between 15 evaluations (Bayesian) and 72 (grid) becomes tangible

The three methods represent a progression in sophistication: grid search makes no assumptions, random search exploits the low effective dimensionality of most problems, and Bayesian optimization builds a model to make each evaluation count. For cheap models like Random Forests, random search is usually sufficient. But when each evaluation costs real time — training a Transformer, running a physics simulation, querying a paid API — the model-based approach of Bayesian optimization can save days of compute. The cost of fitting a GP is negligible compared to hours of training, and even 5 guided evaluations out of 15 total can find configurations that random search might need 50 iterations to reach.

Frequently Asked Questions

What is hyperparameter optimisation and why does it matter?

Hyperparameters are settings you choose before training a model (learning rate, tree depth, regularisation strength) that cannot be learned from the data. Poor choices lead to underfitting or overfitting. Systematic optimisation finds the best combination, often improving model performance significantly compared to manual tuning.

When should I use random search instead of grid search?

Almost always. Random search is more efficient because it explores more unique values of each hyperparameter. Grid search wastes evaluations on unimportant parameter combinations, especially when only one or two hyperparameters actually matter. Random search achieves the same or better results with fewer evaluations in most practical settings.

What is Bayesian optimisation and when is it worth the overhead?

Bayesian optimisation builds a probabilistic model (typically a Gaussian process) of the objective function and uses it to choose the next hyperparameters to evaluate intelligently. It is worth the overhead when each evaluation is expensive (training takes hours) and the search space has fewer than about 20 dimensions. For cheap evaluations, random search is often sufficient.

How many hyperparameter evaluations should I run?

For random search, 60 evaluations is usually enough to find a good configuration if only 1-3 hyperparameters matter (there is a 95% chance of sampling within the top 5% of the space). For Bayesian optimisation, 20-50 evaluations often suffice due to the intelligent search strategy. Scale up for higher-dimensional spaces.

Should I use cross-validation during hyperparameter optimisation?

Yes. Evaluating hyperparameters on a single train-test split is noisy and can lead to overfitting the validation set. K-fold cross-validation gives more reliable performance estimates. Use 5-fold as a default, or 3-fold if training is expensive. Always keep a final held-out test set that is never used during optimisation.

DEV Community