DEV Community

Cover image for Genetic Algorithms: From Line Fitting to the Travelling Salesman
Berkan Sesen
Berkan Sesen

Posted on • Originally published at sesen.ai

Genetic Algorithms: From Line Fitting to the Travelling Salesman

Imagine you're planning a road trip through 25 cities. The number of possible routes is 25!/2 — roughly 7.8 × 10²⁴, more than the number of stars in the observable universe. You can't try them all. And unlike fitting a line with gradient descent, there's no gradient to follow — the search space is discrete and rugged.

Nature solved this class of problem billions of years ago. Evolution doesn't need gradients. It works through variation and selection: generate diverse candidates, keep the ones that work, combine their traits, add random mutations, and repeat. Genetic algorithms (GAs) formalise this process into a general-purpose optimiser.

By the end of this post, you'll implement GAs from scratch for two problems: fitting a regression line (where we can verify the answer) and the Travelling Salesman Problem (where we can't). You'll understand the core operators — selection, crossover, and mutation — and see how a single evolutionary loop solves fundamentally different problems.

This tutorial draws on Holland's foundational Adaptation in Natural and Artificial Systems (1975) and Whitley's excellent A Genetic Algorithm Tutorial (1994).

Quick Win: Evolving a Line

Let's see evolution in action on a simple problem. We have 50 data points generated from $y = 3x + 8$ (with some noise), and we want to evolve the coefficients $(a, b)$ that best fit the data — no calculus required.

Click the badge to open the interactive notebook:

Open In Colab

Genetic algorithm evolving a population of lines — random guesses converge to the true fit y = 3x + 8 over 30 generations

The GA starts with 100 random $(a, b)$ pairs — wild guesses from $[-10, 10]$. Each generation, the best-fitting individuals survive and breed. Within ~20 generations, the population converges to the true parameters.

import numpy as np

def ga_line_fit(x, y, pop_size=100, n_elites=20, mutation_rate=0.1,
                generations=50):
    """Evolve (a, b) to minimise RMSE for y = ax + b."""
    pop = np.random.uniform(-10, 10, size=(pop_size, 2))
    history = []

    for gen in range(generations):
        # FITNESS: how well each individual fits the data
        preds = pop[:, 0:1] * x + pop[:, 1:2]
        rmse = np.sqrt(((preds - y) ** 2).mean(axis=1))
        fitness = 1.0 / (rmse + 1e-10)

        best_idx = np.argmax(fitness)
        history.append((pop[best_idx, 0], pop[best_idx, 1], rmse[best_idx]))

        # SELECTION: keep elite performers + sample by fitness
        ranked = np.argsort(-fitness)
        elites = pop[ranked[:n_elites]]
        probs = fitness / fitness.sum()
        parents = pop[np.random.choice(pop_size, pop_size - n_elites, p=probs)]

        # CROSSOVER + MUTATION: average parents, add Gaussian noise
        children = np.empty_like(parents)
        for i in range(len(parents)):
            child = (parents[i] + parents[-(i + 1)]) / 2
            child *= 1 + mutation_rate * np.random.randn(2)
            children[i] = child

        pop = np.vstack([elites, children])

    return history

# Generate noisy data: y = 3x + 8
np.random.seed(42)
x = np.random.rand(50)
y = 3 * x + 8 + 0.3 * np.random.randn(50)

history = ga_line_fit(x, y)
a, b, rmse = history[-1]
print(f"Evolved: y = {a:.2f}x + {b:.2f}  (RMSE: {rmse:.4f})")
print(f"True:    y = 3.00x + 8.00")
# Evolved: y = 2.92x + 8.03  (RMSE: 0.2722)
# True:    y = 3.00x + 8.00
Enter fullscreen mode Exit fullscreen mode

The result: Starting from 100 random guesses, the GA converges to $a \approx 2.92$, $b \approx 8.03$ — close to the true values. No gradients, no closed-form solution — just evolution.

Scaling Up: The Travelling Salesman

Now let's tackle a problem where gradients don't exist. Given 25 cities at random positions, find the shortest route that visits every city exactly once and returns to the start.

The representation changes everything. A line is defined by two real numbers — you can average them. A route is a permutation of cities — you can't average city 3 and city 7 to get city 5. This means we need different crossover and mutation operators.

Genetic algorithm evolving TSP routes — chaotic initial routes untangle into a short path over 500 generations

import numpy as np

def ordered_crossover(p1, p2):
    """Take a slice from parent 1, fill gaps from parent 2's order."""
    n = len(p1)
    start, end = sorted(np.random.choice(n, 2, replace=False))
    child = -np.ones(n, dtype=int)
    child[start:end] = p1[start:end]
    fill = [g for g in p2 if g not in child[start:end]]
    child[child == -1] = fill
    return child

def swap_mutate(route, rate):
    """Randomly swap cities with given probability per position."""
    route = route.copy()
    for i in range(len(route)):
        if np.random.random() < rate:
            j = np.random.randint(len(route))
            route[i], route[j] = route[j], route[i]
    return route

def ga_tsp(cities, pop_size=100, n_elites=20, mutation_rate=0.01,
           generations=500):
    """Evolve shortest route through all cities."""
    n = len(cities)

    def total_distance(route):
        shifts = np.roll(route, -1)
        return np.sqrt(((cities[route] - cities[shifts]) ** 2).sum(axis=1)).sum()

    pop = [np.random.permutation(n) for _ in range(pop_size)]
    history = []

    for gen in range(generations):
        distances = np.array([total_distance(r) for r in pop])
        fitness = 1.0 / distances
        history.append((pop[np.argmin(distances)].copy(), distances.min()))

        # Selection: elites + fitness-proportionate
        ranked = np.argsort(-fitness)
        elites = [pop[i].copy() for i in ranked[:n_elites]]
        probs = fitness / fitness.sum()
        idx = np.random.choice(pop_size, pop_size - n_elites, p=probs)
        np.random.shuffle(idx)

        # Crossover + mutation
        children = []
        for i in range(0, len(idx) - 1, 2):
            for p, q in [(idx[i], idx[i+1]), (idx[i+1], idx[i])]:
                children.append(swap_mutate(
                    ordered_crossover(pop[p], pop[q]), mutation_rate))
        children = children[:pop_size - n_elites]
        pop = elites + children

    return history

# 25 random cities
np.random.seed(42)
cities = np.random.rand(25, 2) * 200

history = ga_tsp(cities)
best_route, best_dist = history[-1]
print(f"Initial distance: {history[0][1]:.0f}")
print(f"Evolved distance: {best_dist:.0f}")
# Initial distance: 2043
# Evolved distance: 978
Enter fullscreen mode Exit fullscreen mode

The result: The route distance drops from 2043 to 978 — a 52% improvement. Watch the GIF — chaotic crossings disappear as evolution untangles the route, generation by generation.

What Just Happened?

Both problems follow the same evolutionary loop:

  1. Create a random population
  2. Evaluate fitness (how good is each individual?)
  3. Select the fittest parents
  4. Breed offspring via crossover
  5. Mutate for diversity
  6. Repeat

The crucial difference is the representation — how we encode a solution as a "chromosome." This determines which operators make sense.

Line Fitting TSP
Individual $(a, b)$ pair City permutation
Gene Real number City index
Example [3.2, 7.9] [0, 14, 3, 21, ...]
Crossover Average parents Ordered crossover
Mutation Gaussian noise Swap two cities

Fitness: Measuring Quality

Fitness must be higher for better solutions. Both problems use inverse loss:

equation

A line with RMSE 0.1 (fitness 10) is ten times "fitter" than one with RMSE 1.0 (fitness 1). This ratio drives the selection pressure.

Selection: Survival of the Fittest

We combine two mechanisms:

# Keep the top 20 unchanged (elitism)
elites = pop[ranked[:n_elites]]

# Fill remaining slots by fitness-proportionate sampling
probs = fitness / fitness.sum()
parents = pop[np.random.choice(pop_size, pop_size - n_elites, p=probs)]
Enter fullscreen mode Exit fullscreen mode

Elitism guarantees we never lose our best solution. Roulette wheel selection gives fitter individuals a higher probability of being chosen as parents, but still allows less-fit individuals a chance — maintaining diversity.

Crossover: Where Representation Matters

This is the key insight. For line fitting, crossover is trivial — average the parents' genes:

child = (parent1 + parent2) / 2  # [3.5, 7.2] + [2.8, 8.1] → [3.15, 7.65]
Enter fullscreen mode Exit fullscreen mode

For TSP, you can't average permutations — city 3.5 doesn't exist. Instead, we use ordered crossover (OX), which preserves relative ordering from both parents:

  1. Copy a random slice from Parent 1
  2. Fill remaining positions with cities from Parent 2, in order
Parent 1: [0, 1, 2, 3, 4, 5, 6, 7]
Parent 2: [2, 5, 0, 7, 1, 3, 6, 4]
                  ┌─────┐
Slice (pos 3-6):  [_, _, _, 3, 4, 5, _, _]

Fill from P2 (skip 3, 4, 5): [2, 0, 7, 1, 6]

Child:            [2, 0, 7, 3, 4, 5, 1, 6]  ← valid permutation
Enter fullscreen mode Exit fullscreen mode

The child inherits a contiguous segment from Parent 1 and the relative ordering of remaining cities from Parent 2.

Mutation: Random Exploration

Without mutation, the population would eventually become clones of the fittest individual — stuck in a local optimum. Mutation injects diversity:

  • Line fitting: Gaussian noise — child *= 1 + rate × N(0, 1)
  • TSP: Swap two random cities — route[i], route[j] = route[j], route[i]

The mutation rate controls the exploration-exploitation trade-off. Too high and you're doing random search. Too low and you converge prematurely.

Going Deeper

Hyperparameters

Parameter What it controls Line fitting TSP
pop_size Search breadth 100 100
n_elites Exploitation strength 20 (20%) 20 (20%)
mutation_rate Exploration strength 0.1 (10%) 0.01 (1%)
generations Computation budget 50 500

Notice the mutation rate difference: line fitting uses 10% because Gaussian noise is gentle (multiplicative perturbation), while TSP uses 1% because each swap can dramatically change the route.

GA vs Gradient Descent

For the line-fitting problem, we can compare GA with gradient descent directly:

def gd_line_fit(x, y, lr=0.1, iterations=50):
    """Gradient descent to fit y = ax + b."""
    a, b = 0.0, 0.0
    history = []
    for _ in range(iterations):
        preds = a * x + b
        error = preds - y
        a -= lr * 2 * (error * x).mean()
        b -= lr * 2 * error.mean()
        history.append((a, b, np.sqrt((error ** 2).mean())))
    return history
Enter fullscreen mode Exit fullscreen mode

GA vs gradient descent convergence on the same line-fitting problem — GD converges faster per iteration but GA works without requiring gradients

GD converges in ~15 iterations. GA takes ~30 generations with a population of 100 — that's 3,000 fitness evaluations vs 15. When gradients exist, use them. GA shines when they don't: combinatorial problems (TSP), black-box functions, or multi-modal landscapes where gradient descent gets stuck in local optima.

When NOT to Use Genetic Algorithms

  1. Differentiable objectives — gradient descent is orders of magnitude faster
  2. Small search spaces — just enumerate all solutions
  3. Need optimality guarantees — GA only guarantees "good" solutions, not the best
  4. Tight compute budget — evaluating the entire population each generation is expensive

Alternatives:

  • Gradient descent for differentiable problems
  • Simulated annealing for combinatorial optimisation with a single solution
  • Bayesian optimisation for expensive black-box functions
  • Ant colony optimisation for routing problems specifically

Deep Dive: The Papers

Holland (1975)

John Holland introduced genetic algorithms in Adaptation in Natural and Artificial Systems, published by the University of Michigan Press. Holland was interested in a fundamental question: how do complex adaptive systems — biological organisms, economies, ecosystems — improve over time?

His key insight was that adaptation can be studied as an algorithm. By abstracting the essentials of natural selection into a computational framework, he showed that a simple loop of selection, crossover, and mutation could solve optimisation problems across domains.

The Schema Theorem

Holland's most famous theoretical result is the Schema Theorem, sometimes called the Fundamental Theorem of Genetic Algorithms.

A schema is a template describing a subset of individuals. In a binary GA, the schema 1**0*1 matches any individual with a 1 in positions 0 and 5 and a 0 in position 3 (where * is a wildcard).

The theorem states:

Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations.

Formally:

equation

Where:

  • $m(H, t)$ = instances of schema H at generation t
  • $f(H) / \bar{f}$ = schema fitness relative to population average
  • $\delta(H)$ = defining length (distance between outermost fixed positions)
  • $o(H)$ = order (number of fixed positions)
  • $p_c, p_m$ = crossover and mutation probabilities

In plain English: patterns that are short (small $\delta$), simple (small $o$), and above average ($f(H) > \bar{f}$) spread exponentially through the population. This is why crossover works — it recombines short, successful building blocks.

Whitley (1994)

Darrell Whitley's A Genetic Algorithm Tutorial in Statistics and Computing provides an accessible theoretical treatment:

"A genetic algorithm is a model of machine learning which derives its behavior from a metaphor of the processes of evolution in nature. This is done by the creation within a machine of a population of individuals represented by chromosomes."

Whitley explains why different representations need different operators — exactly the issue we encountered with line fitting vs TSP. For permutation problems, he discusses the importance of operators that preserve feasibility (like our ordered crossover).

Our Implementation vs the Framework

GA concept Line fitting TSP
Chromosome $(a, b)$ pair City permutation
Fitness function $1 / \text{RMSE}$ $1 / \text{distance}$
Selection Elitism + roulette wheel Elitism + roulette wheel
Crossover Arithmetic average Ordered crossover (OX)
Mutation Gaussian perturbation Swap mutation

Historical Context

  • Turing (1950) — Proposed "genetical or evolutionary search" for machine learning
  • Rechenberg (1965) — Evolution strategies for engineering optimisation
  • Holland (1975) — Formal GA framework with the Schema Theorem
  • De Jong (1975) — First systematic empirical study of GA behaviour
  • Goldberg (1989) — Popularised GAs in engineering with practical examples
  • Koza (1992) — Extended GAs to genetic programming (evolving programs)

Further Reading

  • Holland (1975) — The foundational book on genetic algorithms
  • Whitley (1994) — Excellent tutorial in Statistics and Computing
  • Goldberg (1989)Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley
  • Mitchell (1996)An Introduction to Genetic Algorithms, MIT Press
  • Bishop's PRML Section 14.2 — Modern coverage of evolutionary methods

Interactive Tools

  • Overfitting Explorer — See how model complexity affects generalisation, the same tradeoff GAs navigate when fitting functions

Related Posts

Try It Yourself

The interactive notebook includes exercises:

  1. Noise sensitivity — Add increasing noise to the line data. How does GA cope compared to MLE?
  2. More cities — Try 50 or 100 cities in the TSP. How does convergence change?
  3. Tournament selection — Replace roulette wheel with tournament selection (pick k random individuals, keep the best). Compare convergence speed.
  4. GA for neural nets — Use GA instead of backprop to train the XOR network from the backpropagation post. How many generations does it take?

Understanding genetic algorithms opens the door to evolutionary computation — a family of methods including evolution strategies, genetic programming, and neuroevolution. The core insight — that selection and variation can optimise anything with a fitness function — is one of the most versatile ideas in computer science.

Frequently Asked Questions

What is a genetic algorithm and when should I use it?

A genetic algorithm (GA) is an optimisation method inspired by natural selection. It maintains a population of candidate solutions that evolve through selection, crossover, and mutation over generations. Use GAs when you have a complex optimisation problem with no useful gradient information, many local optima, or a discrete/combinatorial search space.

What is the difference between genetic algorithms and gradient descent?

Gradient descent follows the gradient of the objective function to find a minimum, requiring the function to be differentiable. GAs are gradient-free: they only need to evaluate a fitness function, so they work on non-differentiable, discrete, or black-box problems. GAs explore more broadly but converge more slowly than gradient descent for smooth problems.

How do I choose the population size and mutation rate?

Larger populations explore more of the search space but are slower per generation. Start with 50-200 individuals for most problems. Mutation rate controls exploration: too low and the population converges prematurely; too high and it becomes random search. Typical rates are 1-5% per gene. Tune both by observing convergence curves.

What is crossover and why does it help?

Crossover combines parts of two parent solutions to create offspring, exploiting the idea that good solutions may share useful building blocks. Single-point crossover splits parents at one random position; uniform crossover mixes genes independently. Crossover accelerates search by combining useful traits from different parts of the population.

Can genetic algorithms solve the travelling salesman problem optimally?

GAs can find good solutions to TSP but do not guarantee optimality. For small instances (under 20-30 cities), exact methods are feasible. For larger instances, GAs with specialised operators (ordered crossover, inversion mutation) find near-optimal solutions efficiently. For the best results, combine GAs with local search heuristics like 2-opt.

Top comments (0)