Imagine you're planning a road trip through 25 cities. The number of possible routes is 25!/2 — roughly 7.8 × 10²⁴, more than the number of stars in the observable universe. You can't try them all. And unlike fitting a line with gradient descent, there's no gradient to follow — the search space is discrete and rugged.
Nature solved this class of problem billions of years ago. Evolution doesn't need gradients. It works through variation and selection: generate diverse candidates, keep the ones that work, combine their traits, add random mutations, and repeat. Genetic algorithms (GAs) formalise this process into a general-purpose optimiser.
By the end of this post, you'll implement GAs from scratch for two problems: fitting a regression line (where we can verify the answer) and the Travelling Salesman Problem (where we can't). You'll understand the core operators — selection, crossover, and mutation — and see how a single evolutionary loop solves fundamentally different problems.
This tutorial draws on Holland's foundational Adaptation in Natural and Artificial Systems (1975) and Whitley's excellent A Genetic Algorithm Tutorial (1994).
Quick Win: Evolving a Line
Let's see evolution in action on a simple problem. We have 50 data points generated from $y = 3x + 8$ (with some noise), and we want to evolve the coefficients $(a, b)$ that best fit the data — no calculus required.
Click the badge to open the interactive notebook:
The GA starts with 100 random $(a, b)$ pairs — wild guesses from $[-10, 10]$. Each generation, the best-fitting individuals survive and breed. Within ~20 generations, the population converges to the true parameters.
import numpy as np
def ga_line_fit(x, y, pop_size=100, n_elites=20, mutation_rate=0.1,
generations=50):
"""Evolve (a, b) to minimise RMSE for y = ax + b."""
pop = np.random.uniform(-10, 10, size=(pop_size, 2))
history = []
for gen in range(generations):
# FITNESS: how well each individual fits the data
preds = pop[:, 0:1] * x + pop[:, 1:2]
rmse = np.sqrt(((preds - y) ** 2).mean(axis=1))
fitness = 1.0 / (rmse + 1e-10)
best_idx = np.argmax(fitness)
history.append((pop[best_idx, 0], pop[best_idx, 1], rmse[best_idx]))
# SELECTION: keep elite performers + sample by fitness
ranked = np.argsort(-fitness)
elites = pop[ranked[:n_elites]]
probs = fitness / fitness.sum()
parents = pop[np.random.choice(pop_size, pop_size - n_elites, p=probs)]
# CROSSOVER + MUTATION: average parents, add Gaussian noise
children = np.empty_like(parents)
for i in range(len(parents)):
child = (parents[i] + parents[-(i + 1)]) / 2
child *= 1 + mutation_rate * np.random.randn(2)
children[i] = child
pop = np.vstack([elites, children])
return history
# Generate noisy data: y = 3x + 8
np.random.seed(42)
x = np.random.rand(50)
y = 3 * x + 8 + 0.3 * np.random.randn(50)
history = ga_line_fit(x, y)
a, b, rmse = history[-1]
print(f"Evolved: y = {a:.2f}x + {b:.2f} (RMSE: {rmse:.4f})")
print(f"True: y = 3.00x + 8.00")
# Evolved: y = 2.92x + 8.03 (RMSE: 0.2722)
# True: y = 3.00x + 8.00
The result: Starting from 100 random guesses, the GA converges to $a \approx 2.92$, $b \approx 8.03$ — close to the true values. No gradients, no closed-form solution — just evolution.
Scaling Up: The Travelling Salesman
Now let's tackle a problem where gradients don't exist. Given 25 cities at random positions, find the shortest route that visits every city exactly once and returns to the start.
The representation changes everything. A line is defined by two real numbers — you can average them. A route is a permutation of cities — you can't average city 3 and city 7 to get city 5. This means we need different crossover and mutation operators.
import numpy as np
def ordered_crossover(p1, p2):
"""Take a slice from parent 1, fill gaps from parent 2's order."""
n = len(p1)
start, end = sorted(np.random.choice(n, 2, replace=False))
child = -np.ones(n, dtype=int)
child[start:end] = p1[start:end]
fill = [g for g in p2 if g not in child[start:end]]
child[child == -1] = fill
return child
def swap_mutate(route, rate):
"""Randomly swap cities with given probability per position."""
route = route.copy()
for i in range(len(route)):
if np.random.random() < rate:
j = np.random.randint(len(route))
route[i], route[j] = route[j], route[i]
return route
def ga_tsp(cities, pop_size=100, n_elites=20, mutation_rate=0.01,
generations=500):
"""Evolve shortest route through all cities."""
n = len(cities)
def total_distance(route):
shifts = np.roll(route, -1)
return np.sqrt(((cities[route] - cities[shifts]) ** 2).sum(axis=1)).sum()
pop = [np.random.permutation(n) for _ in range(pop_size)]
history = []
for gen in range(generations):
distances = np.array([total_distance(r) for r in pop])
fitness = 1.0 / distances
history.append((pop[np.argmin(distances)].copy(), distances.min()))
# Selection: elites + fitness-proportionate
ranked = np.argsort(-fitness)
elites = [pop[i].copy() for i in ranked[:n_elites]]
probs = fitness / fitness.sum()
idx = np.random.choice(pop_size, pop_size - n_elites, p=probs)
np.random.shuffle(idx)
# Crossover + mutation
children = []
for i in range(0, len(idx) - 1, 2):
for p, q in [(idx[i], idx[i+1]), (idx[i+1], idx[i])]:
children.append(swap_mutate(
ordered_crossover(pop[p], pop[q]), mutation_rate))
children = children[:pop_size - n_elites]
pop = elites + children
return history
# 25 random cities
np.random.seed(42)
cities = np.random.rand(25, 2) * 200
history = ga_tsp(cities)
best_route, best_dist = history[-1]
print(f"Initial distance: {history[0][1]:.0f}")
print(f"Evolved distance: {best_dist:.0f}")
# Initial distance: 2043
# Evolved distance: 978
The result: The route distance drops from 2043 to 978 — a 52% improvement. Watch the GIF — chaotic crossings disappear as evolution untangles the route, generation by generation.
What Just Happened?
Both problems follow the same evolutionary loop:
- Create a random population
- Evaluate fitness (how good is each individual?)
- Select the fittest parents
- Breed offspring via crossover
- Mutate for diversity
- Repeat
The crucial difference is the representation — how we encode a solution as a "chromosome." This determines which operators make sense.
| Line Fitting | TSP | |
|---|---|---|
| Individual |
$(a, b)$ pair |
City permutation |
| Gene | Real number | City index |
| Example | [3.2, 7.9] |
[0, 14, 3, 21, ...] |
| Crossover | Average parents | Ordered crossover |
| Mutation | Gaussian noise | Swap two cities |
Fitness: Measuring Quality
Fitness must be higher for better solutions. Both problems use inverse loss:
A line with RMSE 0.1 (fitness 10) is ten times "fitter" than one with RMSE 1.0 (fitness 1). This ratio drives the selection pressure.
Selection: Survival of the Fittest
We combine two mechanisms:
# Keep the top 20 unchanged (elitism)
elites = pop[ranked[:n_elites]]
# Fill remaining slots by fitness-proportionate sampling
probs = fitness / fitness.sum()
parents = pop[np.random.choice(pop_size, pop_size - n_elites, p=probs)]
Elitism guarantees we never lose our best solution. Roulette wheel selection gives fitter individuals a higher probability of being chosen as parents, but still allows less-fit individuals a chance — maintaining diversity.
Crossover: Where Representation Matters
This is the key insight. For line fitting, crossover is trivial — average the parents' genes:
child = (parent1 + parent2) / 2 # [3.5, 7.2] + [2.8, 8.1] → [3.15, 7.65]
For TSP, you can't average permutations — city 3.5 doesn't exist. Instead, we use ordered crossover (OX), which preserves relative ordering from both parents:
- Copy a random slice from Parent 1
- Fill remaining positions with cities from Parent 2, in order
Parent 1: [0, 1, 2, 3, 4, 5, 6, 7]
Parent 2: [2, 5, 0, 7, 1, 3, 6, 4]
┌─────┐
Slice (pos 3-6): [_, _, _, 3, 4, 5, _, _]
Fill from P2 (skip 3, 4, 5): [2, 0, 7, 1, 6]
Child: [2, 0, 7, 3, 4, 5, 1, 6] ← valid permutation
The child inherits a contiguous segment from Parent 1 and the relative ordering of remaining cities from Parent 2.
Mutation: Random Exploration
Without mutation, the population would eventually become clones of the fittest individual — stuck in a local optimum. Mutation injects diversity:
-
Line fitting: Gaussian noise —
child *= 1 + rate × N(0, 1) -
TSP: Swap two random cities —
route[i], route[j] = route[j], route[i]
The mutation rate controls the exploration-exploitation trade-off. Too high and you're doing random search. Too low and you converge prematurely.
Going Deeper
Hyperparameters
| Parameter | What it controls | Line fitting | TSP |
|---|---|---|---|
pop_size |
Search breadth | 100 | 100 |
n_elites |
Exploitation strength | 20 (20%) | 20 (20%) |
mutation_rate |
Exploration strength | 0.1 (10%) | 0.01 (1%) |
generations |
Computation budget | 50 | 500 |
Notice the mutation rate difference: line fitting uses 10% because Gaussian noise is gentle (multiplicative perturbation), while TSP uses 1% because each swap can dramatically change the route.
GA vs Gradient Descent
For the line-fitting problem, we can compare GA with gradient descent directly:
def gd_line_fit(x, y, lr=0.1, iterations=50):
"""Gradient descent to fit y = ax + b."""
a, b = 0.0, 0.0
history = []
for _ in range(iterations):
preds = a * x + b
error = preds - y
a -= lr * 2 * (error * x).mean()
b -= lr * 2 * error.mean()
history.append((a, b, np.sqrt((error ** 2).mean())))
return history
GD converges in ~15 iterations. GA takes ~30 generations with a population of 100 — that's 3,000 fitness evaluations vs 15. When gradients exist, use them. GA shines when they don't: combinatorial problems (TSP), black-box functions, or multi-modal landscapes where gradient descent gets stuck in local optima.
When NOT to Use Genetic Algorithms
- Differentiable objectives — gradient descent is orders of magnitude faster
- Small search spaces — just enumerate all solutions
- Need optimality guarantees — GA only guarantees "good" solutions, not the best
- Tight compute budget — evaluating the entire population each generation is expensive
Alternatives:
- Gradient descent for differentiable problems
- Simulated annealing for combinatorial optimisation with a single solution
- Bayesian optimisation for expensive black-box functions
- Ant colony optimisation for routing problems specifically
Deep Dive: The Papers
Holland (1975)
John Holland introduced genetic algorithms in Adaptation in Natural and Artificial Systems, published by the University of Michigan Press. Holland was interested in a fundamental question: how do complex adaptive systems — biological organisms, economies, ecosystems — improve over time?
His key insight was that adaptation can be studied as an algorithm. By abstracting the essentials of natural selection into a computational framework, he showed that a simple loop of selection, crossover, and mutation could solve optimisation problems across domains.
The Schema Theorem
Holland's most famous theoretical result is the Schema Theorem, sometimes called the Fundamental Theorem of Genetic Algorithms.
A schema is a template describing a subset of individuals. In a binary GA, the schema 1**0*1 matches any individual with a 1 in positions 0 and 5 and a 0 in position 3 (where * is a wildcard).
The theorem states:
Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations.
Formally:
Where:
-
$m(H, t)$= instances of schema H at generation t -
$f(H) / \bar{f}$= schema fitness relative to population average -
$\delta(H)$= defining length (distance between outermost fixed positions) -
$o(H)$= order (number of fixed positions) -
$p_c, p_m$= crossover and mutation probabilities
In plain English: patterns that are short (small $\delta$), simple (small $o$), and above average ($f(H) > \bar{f}$) spread exponentially through the population. This is why crossover works — it recombines short, successful building blocks.
Whitley (1994)
Darrell Whitley's A Genetic Algorithm Tutorial in Statistics and Computing provides an accessible theoretical treatment:
"A genetic algorithm is a model of machine learning which derives its behavior from a metaphor of the processes of evolution in nature. This is done by the creation within a machine of a population of individuals represented by chromosomes."
Whitley explains why different representations need different operators — exactly the issue we encountered with line fitting vs TSP. For permutation problems, he discusses the importance of operators that preserve feasibility (like our ordered crossover).
Our Implementation vs the Framework
| GA concept | Line fitting | TSP |
|---|---|---|
| Chromosome |
$(a, b)$ pair |
City permutation |
| Fitness function | $1 / \text{RMSE}$ |
$1 / \text{distance}$ |
| Selection | Elitism + roulette wheel | Elitism + roulette wheel |
| Crossover | Arithmetic average | Ordered crossover (OX) |
| Mutation | Gaussian perturbation | Swap mutation |
Historical Context
- Turing (1950) — Proposed "genetical or evolutionary search" for machine learning
- Rechenberg (1965) — Evolution strategies for engineering optimisation
- Holland (1975) — Formal GA framework with the Schema Theorem
- De Jong (1975) — First systematic empirical study of GA behaviour
- Goldberg (1989) — Popularised GAs in engineering with practical examples
- Koza (1992) — Extended GAs to genetic programming (evolving programs)
Further Reading
- Holland (1975) — The foundational book on genetic algorithms
- Whitley (1994) — Excellent tutorial in Statistics and Computing
- Goldberg (1989) — Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley
- Mitchell (1996) — An Introduction to Genetic Algorithms, MIT Press
- Bishop's PRML Section 14.2 — Modern coverage of evolutionary methods
Interactive Tools
- Overfitting Explorer — See how model complexity affects generalisation, the same tradeoff GAs navigate when fitting functions
Related Posts
- Backpropagation Demystified — gradient-based optimisation (the alternative for differentiable problems)
- The EM Algorithm — iterative optimisation with convergence guarantees
- From K-Means to GMM — another iterative optimisation algorithm
- Maximum Likelihood Estimation — parameter estimation (what GA replaces here)
Try It Yourself
The interactive notebook includes exercises:
- Noise sensitivity — Add increasing noise to the line data. How does GA cope compared to MLE?
- More cities — Try 50 or 100 cities in the TSP. How does convergence change?
- Tournament selection — Replace roulette wheel with tournament selection (pick k random individuals, keep the best). Compare convergence speed.
- GA for neural nets — Use GA instead of backprop to train the XOR network from the backpropagation post. How many generations does it take?
Understanding genetic algorithms opens the door to evolutionary computation — a family of methods including evolution strategies, genetic programming, and neuroevolution. The core insight — that selection and variation can optimise anything with a fitness function — is one of the most versatile ideas in computer science.
Frequently Asked Questions
What is a genetic algorithm and when should I use it?
A genetic algorithm (GA) is an optimisation method inspired by natural selection. It maintains a population of candidate solutions that evolve through selection, crossover, and mutation over generations. Use GAs when you have a complex optimisation problem with no useful gradient information, many local optima, or a discrete/combinatorial search space.
What is the difference between genetic algorithms and gradient descent?
Gradient descent follows the gradient of the objective function to find a minimum, requiring the function to be differentiable. GAs are gradient-free: they only need to evaluate a fitness function, so they work on non-differentiable, discrete, or black-box problems. GAs explore more broadly but converge more slowly than gradient descent for smooth problems.
How do I choose the population size and mutation rate?
Larger populations explore more of the search space but are slower per generation. Start with 50-200 individuals for most problems. Mutation rate controls exploration: too low and the population converges prematurely; too high and it becomes random search. Typical rates are 1-5% per gene. Tune both by observing convergence curves.
What is crossover and why does it help?
Crossover combines parts of two parent solutions to create offspring, exploiting the idea that good solutions may share useful building blocks. Single-point crossover splits parents at one random position; uniform crossover mixes genes independently. Crossover accelerates search by combining useful traits from different parts of the population.
Can genetic algorithms solve the travelling salesman problem optimally?
GAs can find good solutions to TSP but do not guarantee optimality. For small instances (under 20-30 cities), exact methods are feasible. For larger instances, GAs with specialised operators (ordered crossover, inversion mutation) find near-optimal solutions efficiently. For the best results, combine GAs with local search heuristics like 2-opt.



Top comments (0)