Emrah G.

Posted on Feb 8 • Originally published at routebot.com

How We Solve the Vehicle Routing Problem: Clustering, TSP Heuristics, and Real-World Constraints

#algorithms #routeoptimization #travelingsalesman #programming

If you've ever looked at the Traveling Salesman Problem and thought "interesting, but when would I actually need this?" — fleet route optimization is one of those real-world domains where combinatorial optimization isn't academic. It's the difference between a transport operation that burns cash and one that runs efficiently.

I work on RouteBot, a transport management platform for school buses and personnel shuttles. Every day, our system distributes thousands of passengers across hundreds of vehicles and sequences their stops. This article walks through the algorithmic pipeline we use — and the engineering tradeoffs involved.

For a broader, less code-focused comparison of manual vs. algorithmic route planning (including operational scenarios and implementation roadmaps), see the full guide on our blog.

The Problem, Formally

Given:

A set of n passengers, each with a geographic coordinate (lat/lng)
A set of k vehicles, each with a capacity c
A destination (school, factory, office)

Find:

An assignment of passengers to vehicles such that no vehicle exceeds capacity and total intra-cluster distance is minimized (Capacitated Clustering)
For each vehicle, a stop sequence that minimizes total travel distance (TSP variant)

This is a variant of the Capacitated Vehicle Routing Problem (CVRP), which is NP-hard. Exact solutions are computationally infeasible for real-world fleet sizes. We need heuristics that produce near-optimal solutions in bounded time.

Step 1: Capacitated Clustering

The first stage assigns passengers to vehicles. This is not standard k-means — we have hard capacity constraints.

Why k-means Doesn't Work Directly

Standard k-means minimizes intra-cluster variance without any constraint on cluster size. In transport routing, an unconstrained cluster might assign 40 passengers to a 25-seat vehicle. You could run k-means and then redistribute overflow, but that post-hoc rebalancing tends to break geographic compactness.

Our Approach: Constrained k-means with Multi-Restart

We use a modified k-means that enforces capacity during assignment, combined with a swap-based local search to escape local minima.

def capacitated_clustering(passengers, k, capacity, n_restarts=12, max_iter=100):
    best_assignment = None
    best_cost = float('inf')

    for _ in range(n_restarts):
        # Initialize centroids (k-means++ seeding)
        centroids = kmeans_plus_plus_init(passengers, k)
        assignment = [None] * len(passengers)

        for iteration in range(max_iter):
            cluster_sizes = [0] * k
            # Assignment step: nearest centroid with capacity check
            # Sort passengers by distance to nearest centroid (farthest first)
            # to ensure hard-to-place passengers get priority
            passengers_by_difficulty = sorted(
                range(len(passengers)),
                key=lambda i: min(
                    haversine(passengers[i], centroids[j]) for j in range(k)
                ),
                reverse=True
            )

            for i in passengers_by_difficulty:
                # Find nearest centroid with available capacity
                candidates = sorted(
                    range(k),
                    key=lambda j: haversine(passengers[i], centroids[j])
                )
                for j in candidates:
                    if cluster_sizes[j] < capacity:
                        assignment[i] = j
                        cluster_sizes[j] += 1
                        break

            # Update step: recompute centroids
            new_centroids = compute_centroids(passengers, assignment, k)
            if centroids_converged(centroids, new_centroids):
                break
            centroids = new_centroids

        # Local search: try swapping passengers between clusters
        assignment = swap_optimization(passengers, assignment, k, capacity)

        cost = total_intra_cluster_distance(passengers, assignment, k)
        if cost < best_cost:
            best_cost = cost
            best_assignment = assignment[:]

    return best_assignment

Key decisions:

k-means++ initialization prevents poor starting configurations. Random init can produce wildly different results between runs.
Farthest-first assignment during the constrained step ensures outlier passengers (who have the fewest viable cluster options) get assigned before the easy cases. Without this, outliers end up in distant clusters because all nearby ones filled up.
Multi-restart (typically 10–15 restarts) is critical because the constrained assignment step makes the algorithm more sensitive to initial conditions than standard k-means.
Swap optimization as a post-processing step tries all pairwise swaps between clusters. If swapping passenger A from cluster X with passenger B from cluster Y reduces total distance without violating capacity, the swap is accepted. This catches improvements that the greedy assignment misses.

Complexity

Each restart runs in O(max_iter × n × k) for the assignment step. The swap optimization is O(n² × k) in the worst case but is fast in practice because we only evaluate swaps between geographically adjacent clusters. Total: O(n_restarts × (max_iter × n × k + n²)).

For 1,200 passengers and 40 vehicles with 12 restarts, this completes in under 2 seconds on commodity hardware.

Step 2: Distance Matrix Computation

Before we can sequence stops, we need the pairwise driving distances between all stops in a cluster. This is where the problem stops being pure geometry and becomes an engineering challenge.

Haversine vs. Driving Distance

Great-circle distance (haversine) is fine for clustering — it's a fast proxy for geographic proximity. But for stop sequencing, you need actual road distances. Two stops might be 500m apart as the crow flies but 3km by road if a river separates them.

Batching API Calls

Mapping APIs (Google Maps, Mapbox, HERE) provide distance matrix endpoints. For a route with s stops, we need an s × s matrix — that's s² distance pairs. Most APIs cap matrix requests at 25×25 or similar.

async function computeDistanceMatrix(
  stops: LatLng[],
  batchSize: number = 25
): Promise<number[][]> {
  const n = stops.length;
  const matrix: number[][] = Array.from({ length: n }, () =>
    new Array(n).fill(0)
  );

  // For small matrices, single API call
  if (n <= batchSize) {
    const result = await mapsApi.distanceMatrix({
      origins: stops,
      destinations: stops,
    });
    return parseMatrixResponse(result);
  }

  // For larger matrices, partition into sub-matrices
  for (let i = 0; i < n; i += batchSize) {
    for (let j = 0; j < n; j += batchSize) {
      const originSlice = stops.slice(i, Math.min(i + batchSize, n));
      const destSlice = stops.slice(j, Math.min(j + batchSize, n));

      const result = await mapsApi.distanceMatrix({
        origins: originSlice,
        destinations: destSlice,
      });

      // Fill the corresponding sub-matrix
      const subMatrix = parseMatrixResponse(result);
      for (let oi = 0; oi < originSlice.length; oi++) {
        for (let di = 0; di < destSlice.length; di++) {
          matrix[i + oi][j + di] = subMatrix[oi][di];
        }
      }
    }
  }

  return matrix;
}

Caching Strategy

Distance matrices are expensive (both in latency and API cost). We cache them aggressively:

L1 cache: In-memory LRU for the current optimization session
L2 cache: Database-backed cache keyed on sorted coordinate pairs with a configurable TTL (we use 30 days — road networks don't change that often)
Cache invalidation: Manual purge when an operator reports a new road or closure

For routes with 15–20 stops, the matrix has 225–400 entries. With caching, repeat optimizations (e.g., after adding one passenger) hit ~90%+ cache rate.

Step 3: Stop Sequencing (TSP)

With passengers clustered and the distance matrix computed, each route needs an optimal stop order.

Exact Solution for Small Routes

For routes with ≤ 11 stops, we evaluate all permutations. 11! = 39,916,800 — that's tight but feasible if we prune aggressively:

def tsp_exact(distance_matrix, depot):
    """Exact TSP via branch-and-bound for small instances."""
    n = len(distance_matrix)
    stops = [i for i in range(n) if i != depot]
    best_distance = float('inf')
    best_order = None

    def branch_and_bound(current, remaining, cost_so_far):
        nonlocal best_distance, best_order

        if not remaining:
            total = cost_so_far + distance_matrix[current][depot]
            if total < best_distance:
                best_distance = total
                best_order = [depot] + visited + [depot]
            return

        # Pruning: if current cost already exceeds best, abandon
        if cost_so_far >= best_distance:
            return

        # Lower bound: minimum remaining edge cost
        min_remaining = sum(
            min(distance_matrix[r][j] for j in remaining if j != r)
            for r in remaining
        ) if len(remaining) > 1 else 0
        if cost_so_far + min_remaining >= best_distance:
            return

        for next_stop in sorted(
            remaining,
            key=lambda s: distance_matrix[current][s]
        ):
            visited.append(next_stop)
            branch_and_bound(
                next_stop,
                remaining - {next_stop},
                cost_so_far + distance_matrix[current][next_stop]
            )
            visited.pop()

    visited = []
    branch_and_bound(depot, set(stops), 0)
    return best_order, best_distance

Heuristic for Larger Routes

For routes with 12+ stops, exact solutions become impractical. We fall back to a nearest-neighbor heuristic followed by 2-opt improvement:

def tsp_heuristic(distance_matrix, depot):
    n = len(distance_matrix)
    unvisited = set(range(n)) - {depot}
    tour = [depot]
    current = depot

    # Phase 1: Nearest-neighbor construction
    while unvisited:
        nearest = min(unvisited, key=lambda s: distance_matrix[current][s])
        tour.append(nearest)
        unvisited.remove(nearest)
        current = nearest
    tour.append(depot)

    # Phase 2: 2-opt improvement
    improved = True
    while improved:
        improved = False
        for i in range(1, len(tour) - 2):
            for j in range(i + 1, len(tour) - 1):
                # Cost of current edges
                d_current = (
                    distance_matrix[tour[i-1]][tour[i]] +
                    distance_matrix[tour[j]][tour[j+1]]
                )
                # Cost if we reverse the segment between i and j
                d_new = (
                    distance_matrix[tour[i-1]][tour[j]] +
                    distance_matrix[tour[i]][tour[j+1]]
                )
                if d_new < d_current:
                    tour[i:j+1] = reversed(tour[i:j+1])
                    improved = True

    return tour, tour_distance(tour, distance_matrix)

Why 2-opt? It's O(n²) per pass and typically converges in 2–5 passes. For routes with 15–25 stops, it reliably gets within 2–5% of optimal. More sophisticated approaches (3-opt, LKH) offer diminishing returns for this problem size and don't justify the added complexity.

Mapping API Waypoint Optimization

As an alternative (or validation), we can delegate stop ordering to the mapping API itself. Google Maps' Directions API, for example, accepts optimizeWaypoints: true:

const response = await mapsApi.directions({
  origin: depot,
  destination: depot,
  waypoints: stops.map(s => ({ location: s })),
  optimizeWaypoints: true,
});

// response.routes[0].waypoint_order gives the optimized sequence
const optimizedOrder: number[] = response.routes[0].waypoint_order;

This has the advantage of using real-time traffic data and actual road geometry. The downside: it's a black box, you're rate-limited by the API, and most providers cap waypoints at 23–25 per request.

In practice, we use our own TSP solver as the primary method and cross-reference against the mapping API's ordering for quality validation.

Putting It All Together: The Pipeline

┌─────────────────────────────────────────────────────┐
│  Input: Passenger list (CSV/Excel)                  │
│  [name, address, school, vehicle_capacity]           │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Geocoding                                          │
│  address → (lat, lng) via Maps API                  │
│  Flag invalid/ambiguous addresses                   │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Capacitated Clustering                             │
│  Constrained k-means + swap optimization            │
│  12 restarts × 100 iterations                       │
│  Output: passenger → vehicle assignment             │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Distance Matrix Computation                        │
│  Per-cluster pairwise driving distances             │
│  Batched API calls + multi-layer caching            │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Stop Sequencing (TSP)                              │
│  ≤11 stops → exact (branch & bound)                 │
│  >11 stops → nearest-neighbor + 2-opt               │
│  Cross-validate with mapping API                    │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│  Output                                             │
│  Optimized routes with stop sequences               │
│  Distance & time estimates per route                │
│  Route polylines for map visualization              │
└─────────────────────────────────────────────────────┘

For a concrete example: 1,200 passengers across 40 vehicles. The clustering stage runs in ~2 seconds. Distance matrix computation takes 10–30 seconds (depending on cache hit rate). TSP for 40 routes runs in parallel and completes in <1 second. Total wall time: under a minute.

A manual planner doing the same work? 8–10 working days, with routes typically 15–30% longer than our output.

The Hard Parts Aren't the Algorithms

The algorithms above are well-understood CS fundamentals. The genuinely hard engineering problems are:

1. Address Data Quality

Real-world passenger data is messy. Addresses are misspelled, incomplete, or ambiguous. Geocoding "123 Main St" might return 15 results across different cities. We run a multi-pass geocoding pipeline:

Geocode with the full address
If confidence is low, retry with normalized components (street, city, postal code separately)
If still ambiguous, flag for manual review and skip from optimization

Bad coordinates silently poison the entire optimization. A single passenger geocoded to the wrong city can distort an entire cluster.

2. Re-optimization vs. Full Optimization

When 5 new passengers join mid-term, should you re-optimize everything from scratch or surgically insert them into existing routes?

Full re-optimization produces mathematically better results but reshuffles routes that drivers and parents have memorized. We offer both modes:

Full re-optimization: Run the entire pipeline. Best total distance, but maximum disruption.
Incremental insertion: For each new passenger, find the route with available capacity where inserting them increases total distance the least. O(new_passengers × routes × stops_per_route).

def incremental_insert(new_passengers, existing_routes, capacity):
    for passenger in new_passengers:
        best_route = None
        best_position = None
        best_cost_increase = float('inf')

        for route in existing_routes:
            if len(route.passengers) >= capacity:
                continue

            for pos in range(1, len(route.stops)):
                # Cost of inserting at this position
                cost_increase = (
                    distance(route.stops[pos-1], passenger.location) +
                    distance(passenger.location, route.stops[pos]) -
                    distance(route.stops[pos-1], route.stops[pos])
                )
                if cost_increase < best_cost_increase:
                    best_cost_increase = cost_increase
                    best_route = route
                    best_position = pos

        if best_route:
            best_route.insert(best_position, passenger)

3. The Human Override Layer

Algorithmic output is a starting point. Planners need to drag passengers between routes, reorder stops, and apply constraints the algorithm doesn't know about ("this child shouldn't be on the same bus as that child", "this street is under construction").

Every manual override needs to be validated: does the swap violate capacity? Does reordering increase total distance? The system needs to provide instant feedback on the cost impact of every manual change.

Results in Practice

On production data from operations we manage:

Metric	Manual Planning	Algorithmic + Human Review
Planning time (40 routes)	8–10 days	< 2 hours
Avg route distance vs optimal	+20–25%	+2–4%
Capacity utilization std dev	8.3 passengers	2.1 passengers
Routes requiring post-review changes	N/A	~12% get minor tweaks

The algorithmic approach doesn't eliminate human involvement — it reduces it from days of tedious manual work to a focused hour of review and refinement.

DEV Community