Berkan Sesen

Posted on Apr 3 • Originally published at sesen.ai

Genetic Algorithms: From Line Fitting to the Travelling Salesman

#optimisation #supervisedlearning

Imagine you're planning a road trip through 25 cities. The number of possible routes is 25!/2 — roughly 7.8 × 10²⁴, more than the number of stars in the observable universe. You can't try them all. And unlike fitting a line with gradient descent, there's no gradient to follow — the search space is discrete and rugged.

Nature solved this class of problem billions of years ago. Evolution doesn't need gradients. It works through variation and selection: generate diverse candidates, keep the ones that work, combine their traits, add random mutations, and repeat. Genetic algorithms (GAs) formalise this process into a general-purpose optimiser.

By the end of this post, you'll implement GAs from scratch for two problems: fitting a regression line (where we can verify the answer) and the Travelling Salesman Problem (where we can't). You'll understand the core operators — selection, crossover, and mutation — and see how a single evolutionary loop solves fundamentally different problems.

This tutorial draws on Holland's foundational Adaptation in Natural and Artificial Systems (1975) and Whitley's excellent A Genetic Algorithm Tutorial (1994).

Quick Win: Evolving a Line

Let's see evolution in action on a simple problem. We have 50 data points generated from $y = 3x + 8$ (with some noise), and we want to evolve the coefficients $(a, b)$ that best fit the data — no calculus required.

Click the badge to open the interactive notebook:

The GA starts with 100 random $(a, b)$ pairs — wild guesses from $[-10, 10]$ . Each generation, the best-fitting individuals survive and breed. Within ~20 generations, the population converges to the true parameters.

import numpy as np

def ga_line_fit(x, y, pop_size=100, n_elites=20, mutation_rate=0.1,
                generations=50):
    """Evolve (a, b) to minimise RMSE for y = ax + b."""
    pop = np.random.uniform(-10, 10, size=(pop_size, 2))
    history = []

    for gen in range(generations):
        # FITNESS: how well each individual fits the data
        preds = pop[:, 0:1] * x + pop[:, 1:2]
        rmse = np.sqrt(((preds - y) ** 2).mean(axis=1))
        fitness = 1.0 / (rmse + 1e-10)

        best_idx = np.argmax(fitness)
        history.append((pop[best_idx, 0], pop[best_idx, 1], rmse[best_idx]))

        # SELECTION: keep elite performers + sample by fitness
        ranked = np.argsort(-fitness)
        elites = pop[ranked[:n_elites]]
        probs = fitness / fitness.sum()
        parents = pop[np.random.choice(pop_size, pop_size - n_elites, p=probs)]

        # CROSSOVER + MUTATION: average parents, add Gaussian noise
        children = np.empty_like(parents)
        for i in range(len(parents)):
            child = (parents[i] + parents[-(i + 1)]) / 2
            child *= 1 + mutation_rate * np.random.randn(2)
            children[i] = child

        pop = np.vstack([elites, children])

    return history

# Generate noisy data: y = 3x + 8
np.random.seed(42)
x = np.random.rand(50)
y = 3 * x + 8 + 0.3 * np.random.randn(50)

history = ga_line_fit(x, y)
a, b, rmse = history[-1]
print(f"Evolved: y = {a:.2f}x + {b:.2f}  (RMSE: {rmse:.4f})")
print(f"True:    y = 3.00x + 8.00")
# Evolved: y = 2.92x + 8.03  (RMSE: 0.2722)
# True:    y = 3.00x + 8.00

The result: Starting from 100 random guesses, the GA converges to $a \approx 2.92$ , $b \approx 8.03$ — close to the true values. No gradients, no closed-form solution — just evolution.

Scaling Up: The Travelling Salesman

Now let's tackle a problem where gradients don't exist. Given 25 cities at random positions, find the shortest route that visits every city exactly once and returns to the start.

The representation changes everything. A line is defined by two real numbers — you can average them. A route is a permutation of cities — you can't average city 3 and city 7 to get city 5. This means we need different crossover and mutation operators.

import numpy as np

def ordered_crossover(p1, p2):
    """Take a slice from parent 1, fill gaps from parent 2's order."""
    n = len(p1)
    start, end = sorted(np.random.choice(n, 2, replace=False))
    child = -np.ones(n, dtype=int)
    child[start:end] = p1[start:end]
    fill = [g for g in p2 if g not in child[start:end]]
    child[child == -1] = fill
    return child

def swap_mutate(route, rate):
    """Randomly swap cities with given probability per position."""
    route = route.copy()
    for i in range(len(route)):
        if np.random.random() < rate:
            j = np.random.randint(len(route))
            route[i], route[j] = route[j], route[i]
    return route

def ga_tsp(cities, pop_size=100, n_elites=20, mutation_rate=0.01,
           generations=500):
    """Evolve shortest route through all cities."""
    n = len(cities)

    def total_distance(route):
        shifts = np.roll(route, -1)
        return np.sqrt(((cities[route] - cities[shifts]) ** 2).sum(axis=1)).sum()

    pop = [np.random.permutation(n) for _ in range(pop_size)]
    history = []

    for gen in range(generations):
        distances = np.array([total_distance(r) for r in pop])
        fitness = 1.0 / distances
        history.append((pop[np.argmin(distances)].copy(), distances.min()))

        # Selection: elites + fitness-proportionate
        ranked = np.argsort(-fitness)
        elites = [pop[i].copy() for i in ranked[:n_elites]]
        probs = fitness / fitness.sum()
        idx = np.random.choice(pop_size, pop_size - n_elites, p=probs)
        np.random.shuffle(idx)

        # Crossover + mutation
        children = []
        for i in range(0, len(idx) - 1, 2):
            for p, q in [(idx[i], idx[i+1]), (idx[i+1], idx[i])]:
                children.append(swap_mutate(
                    ordered_crossover(pop[p], pop[q]), mutation_rate))
        children = children[:pop_size - n_elites]
        pop = elites + children

    return history

# 25 random cities
np.random.seed(42)
cities = np.random.rand(25, 2) * 200

history = ga_tsp(cities)
best_route, best_dist = history[-1]
print(f"Initial distance: {history[0][1]:.0f}")
print(f"Evolved distance: {best_dist:.0f}")
# Initial distance: 2043
# Evolved distance: 978

The result: The route distance drops from 2043 to 978 — a 52% improvement. Watch the GIF — chaotic crossings disappear as evolution untangles the route, generation by generation.

What Just Happened?

Both problems follow the same evolutionary loop:

Create a random population
Evaluate fitness (how good is each individual?)
Select the fittest parents
Breed offspring via crossover
Mutate for diversity
Repeat

The crucial difference is the representation — how we encode a solution as a "chromosome." This determines which operators make sense.

	Line Fitting	TSP
Individual	$(a, b)$ pair	City permutation
Gene	Real number	City index
Example	`[3.2, 7.9]`	`[0, 14, 3, 21, ...]`
Crossover	Average parents	Ordered crossover
Mutation	Gaussian noise	Swap two cities

Fitness: Measuring Quality

Fitness must be higher for better solutions. Both problems use inverse loss:

A line with RMSE 0.1 (fitness 10) is ten times "fitter" than one with RMSE 1.0 (fitness 1). This ratio drives the selection pressure.

Selection: Survival of the Fittest

We combine two mechanisms:

# Keep the top 20 unchanged (elitism)
elites = pop[ranked[:n_elites]]

# Fill remaining slots by fitness-proportionate sampling
probs = fitness / fitness.sum()
parents = pop[np.random.choice(pop_size, pop_size - n_elites, p=probs)]

Elitism guarantees we never lose our best solution. Roulette wheel selection gives fitter individuals a higher probability of being chosen as parents, but still allows less-fit individuals a chance — maintaining diversity.

Crossover: Where Representation Matters

This is the key insight. For line fitting, crossover is trivial — average the parents' genes:

child = (parent1 + parent2) / 2  # [3.5, 7.2] + [2.8, 8.1] → [3.15, 7.65]

For TSP, you can't average permutations — city 3.5 doesn't exist. Instead, we use ordered crossover (OX), which preserves relative ordering from both parents:

Copy a random slice from Parent 1
Fill remaining positions with cities from Parent 2, in order

Parent 1: [0, 1, 2, 3, 4, 5, 6, 7]
Parent 2: [2, 5, 0, 7, 1, 3, 6, 4]
                  ┌─────┐
Slice (pos 3-6):  [_, _, _, 3, 4, 5, _, _]

Fill from P2 (skip 3, 4, 5): [2, 0, 7, 1, 6]

Child:            [2, 0, 7, 3, 4, 5, 1, 6]  ← valid permutation

The child inherits a contiguous segment from Parent 1 and the relative ordering of remaining cities from Parent 2.

Mutation: Random Exploration

Without mutation, the population would eventually become clones of the fittest individual — stuck in a local optimum. Mutation injects diversity:

Line fitting: Gaussian noise — child *= 1 + rate × N(0, 1)
TSP: Swap two random cities — route[i], route[j] = route[j], route[i]

The mutation rate controls the exploration-exploitation trade-off. Too high and you're doing random search. Too low and you converge prematurely.

Going Deeper

Hyperparameters

Parameter	What it controls	Line fitting	TSP
`pop_size`	Search breadth	100	100
`n_elites`	Exploitation strength	20 (20%)	20 (20%)
`mutation_rate`	Exploration strength	0.1 (10%)	0.01 (1%)
`generations`	Computation budget	50	500

Notice the mutation rate difference: line fitting uses 10% because Gaussian noise is gentle (multiplicative perturbation), while TSP uses 1% because each swap can dramatically change the route.

GA vs Gradient Descent

For the line-fitting problem, we can compare GA with gradient descent directly:

def gd_line_fit(x, y, lr=0.1, iterations=50):
    """Gradient descent to fit y = ax + b."""
    a, b = 0.0, 0.0
    history = []
    for _ in range(iterations):
        preds = a * x + b
        error = preds - y
        a -= lr * 2 * (error * x).mean()
        b -= lr * 2 * error.mean()
        history.append((a, b, np.sqrt((error ** 2).mean())))
    return history

GD converges in ~15 iterations. GA takes ~30 generations with a population of 100 — that's 3,000 fitness evaluations vs 15. When gradients exist, use them. GA shines when they don't: combinatorial problems (TSP), black-box functions, or multi-modal landscapes where gradient descent gets stuck in local optima.

When NOT to Use Genetic Algorithms

Differentiable objectives — gradient descent is orders of magnitude faster
Small search spaces — just enumerate all solutions
Need optimality guarantees — GA only guarantees "good" solutions, not the best
Tight compute budget — evaluating the entire population each generation is expensive

Alternatives:

Gradient descent for differentiable problems
Simulated annealing for combinatorial optimisation with a single solution
Bayesian optimisation for expensive black-box functions
Ant colony optimisation for routing problems specifically

Deep Dive: The Papers

Holland (1975)

John Holland introduced genetic algorithms in Adaptation in Natural and Artificial Systems, published by the University of Michigan Press. Holland was interested in a fundamental question: how do complex adaptive systems — biological organisms, economies, ecosystems — improve over time?

His key insight was that adaptation can be studied as an algorithm. By abstracting the essentials of natural selection into a computational framework, he showed that a simple loop of selection, crossover, and mutation could solve optimisation problems across domains.

The Schema Theorem

Holland's most famous theoretical result is the Schema Theorem, sometimes called the Fundamental Theorem of Genetic Algorithms.

A schema is a template describing a subset of individuals. In a binary GA, the schema 1**0*1 matches any individual with a 1 in positions 0 and 5 and a 0 in position 3 (where * is a wildcard).

The theorem states:

Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations.

Formally:

Where:

$m(H, t)$ = instances of schema H at generation t
$f(H) / \bar{f}$ = schema fitness relative to population average
$\delta(H)$ = defining length (distance between outermost fixed positions)
$o(H)$ = order (number of fixed positions)
$p_c, p_m$ = crossover and mutation probabilities

In plain English: patterns that are short (small $\delta$ ), simple (small $o$ ), and above average ( $f(H) > \bar{f}$ ) spread exponentially through the population. This is why crossover works — it recombines short, successful building blocks.

Whitley (1994)

Darrell Whitley's A Genetic Algorithm Tutorial in Statistics and Computing provides an accessible theoretical treatment:

"A genetic algorithm is a model of machine learning which derives its behavior from a metaphor of the processes of evolution in nature. This is done by the creation within a machine of a population of individuals represented by chromosomes."

Whitley explains why different representations need different operators — exactly the issue we encountered with line fitting vs TSP. For permutation problems, he discusses the importance of operators that preserve feasibility (like our ordered crossover).

Our Implementation vs the Framework

GA concept	Line fitting	TSP
Chromosome	$(a, b)$ pair	City permutation
Fitness function	$1 / \text{RMSE}$	$1 / \text{distance}$
Selection	Elitism + roulette wheel	Elitism + roulette wheel
Crossover	Arithmetic average	Ordered crossover (OX)
Mutation	Gaussian perturbation	Swap mutation

Historical Context

Turing (1950) — Proposed "genetical or evolutionary search" for machine learning
Rechenberg (1965) — Evolution strategies for engineering optimisation
Holland (1975) — Formal GA framework with the Schema Theorem
De Jong (1975) — First systematic empirical study of GA behaviour
Goldberg (1989) — Popularised GAs in engineering with practical examples
Koza (1992) — Extended GAs to genetic programming (evolving programs)

Interactive Tools

Overfitting Explorer — See how model complexity affects generalisation, the same tradeoff GAs navigate when fitting functions

Backpropagation Demystified — gradient-based optimisation (the alternative for differentiable problems)
The EM Algorithm — iterative optimisation with convergence guarantees
From K-Means to GMM — another iterative optimisation algorithm
Maximum Likelihood Estimation — parameter estimation (what GA replaces here)

Try It Yourself

The interactive notebook includes exercises:

Noise sensitivity — Add increasing noise to the line data. How does GA cope compared to MLE?
More cities — Try 50 or 100 cities in the TSP. How does convergence change?
Tournament selection — Replace roulette wheel with tournament selection (pick k random individuals, keep the best). Compare convergence speed.
GA for neural nets — Use GA instead of backprop to train the XOR network from the backpropagation post. How many generations does it take?

Understanding genetic algorithms opens the door to evolutionary computation — a family of methods including evolution strategies, genetic programming, and neuroevolution. The core insight — that selection and variation can optimise anything with a fitness function — is one of the most versatile ideas in computer science.

Frequently Asked Questions

What is a genetic algorithm and when should I use it?

A genetic algorithm (GA) is an optimisation method inspired by natural selection. It maintains a population of candidate solutions that evolve through selection, crossover, and mutation over generations. Use GAs when you have a complex optimisation problem with no useful gradient information, many local optima, or a discrete/combinatorial search space.

What is the difference between genetic algorithms and gradient descent?

Gradient descent follows the gradient of the objective function to find a minimum, requiring the function to be differentiable. GAs are gradient-free: they only need to evaluate a fitness function, so they work on non-differentiable, discrete, or black-box problems. GAs explore more broadly but converge more slowly than gradient descent for smooth problems.

How do I choose the population size and mutation rate?

Larger populations explore more of the search space but are slower per generation. Start with 50-200 individuals for most problems. Mutation rate controls exploration: too low and the population converges prematurely; too high and it becomes random search. Typical rates are 1-5% per gene. Tune both by observing convergence curves.

What is crossover and why does it help?

Crossover combines parts of two parent solutions to create offspring, exploiting the idea that good solutions may share useful building blocks. Single-point crossover splits parents at one random position; uniform crossover mixes genes independently. Crossover accelerates search by combining useful traits from different parts of the population.

Can genetic algorithms solve the travelling salesman problem optimally?

GAs can find good solutions to TSP but do not guarantee optimality. For small instances (under 20-30 cities), exact methods are feasible. For larger instances, GAs with specialised operators (ordered crossover, inversion mutation) find near-optimal solutions efficiently. For the best results, combine GAs with local search heuristics like 2-opt.

DEV Community