If you've ever looked at the Traveling Salesman Problem and thought "interesting, but when would I actually need this?" — fleet route optimization is one of those real-world domains where combinatorial optimization isn't academic. It's the difference between a transport operation that burns cash and one that runs efficiently.
I work on RouteBot, a transport management platform for school buses and personnel shuttles. Every day, our system distributes thousands of passengers across hundreds of vehicles and sequences their stops. This article walks through the algorithmic pipeline we use — and the engineering tradeoffs involved.
For a broader, less code-focused comparison of manual vs. algorithmic route planning (including operational scenarios and implementation roadmaps), see the full guide on our blog.
The Problem, Formally
Given:
- A set of
npassengers, each with a geographic coordinate (lat/lng) - A set of
kvehicles, each with a capacityc - A destination (school, factory, office)
Find:
- An assignment of passengers to vehicles such that no vehicle exceeds capacity and total intra-cluster distance is minimized (Capacitated Clustering)
- For each vehicle, a stop sequence that minimizes total travel distance (TSP variant)
This is a variant of the Capacitated Vehicle Routing Problem (CVRP), which is NP-hard. Exact solutions are computationally infeasible for real-world fleet sizes. We need heuristics that produce near-optimal solutions in bounded time.
Step 1: Capacitated Clustering
The first stage assigns passengers to vehicles. This is not standard k-means — we have hard capacity constraints.
Why k-means Doesn't Work Directly
Standard k-means minimizes intra-cluster variance without any constraint on cluster size. In transport routing, an unconstrained cluster might assign 40 passengers to a 25-seat vehicle. You could run k-means and then redistribute overflow, but that post-hoc rebalancing tends to break geographic compactness.
Our Approach: Constrained k-means with Multi-Restart
We use a modified k-means that enforces capacity during assignment, combined with a swap-based local search to escape local minima.
def capacitated_clustering(passengers, k, capacity, n_restarts=12, max_iter=100):
best_assignment = None
best_cost = float('inf')
for _ in range(n_restarts):
# Initialize centroids (k-means++ seeding)
centroids = kmeans_plus_plus_init(passengers, k)
assignment = [None] * len(passengers)
for iteration in range(max_iter):
cluster_sizes = [0] * k
# Assignment step: nearest centroid with capacity check
# Sort passengers by distance to nearest centroid (farthest first)
# to ensure hard-to-place passengers get priority
passengers_by_difficulty = sorted(
range(len(passengers)),
key=lambda i: min(
haversine(passengers[i], centroids[j]) for j in range(k)
),
reverse=True
)
for i in passengers_by_difficulty:
# Find nearest centroid with available capacity
candidates = sorted(
range(k),
key=lambda j: haversine(passengers[i], centroids[j])
)
for j in candidates:
if cluster_sizes[j] < capacity:
assignment[i] = j
cluster_sizes[j] += 1
break
# Update step: recompute centroids
new_centroids = compute_centroids(passengers, assignment, k)
if centroids_converged(centroids, new_centroids):
break
centroids = new_centroids
# Local search: try swapping passengers between clusters
assignment = swap_optimization(passengers, assignment, k, capacity)
cost = total_intra_cluster_distance(passengers, assignment, k)
if cost < best_cost:
best_cost = cost
best_assignment = assignment[:]
return best_assignment
Key decisions:
- k-means++ initialization prevents poor starting configurations. Random init can produce wildly different results between runs.
- Farthest-first assignment during the constrained step ensures outlier passengers (who have the fewest viable cluster options) get assigned before the easy cases. Without this, outliers end up in distant clusters because all nearby ones filled up.
- Multi-restart (typically 10–15 restarts) is critical because the constrained assignment step makes the algorithm more sensitive to initial conditions than standard k-means.
- Swap optimization as a post-processing step tries all pairwise swaps between clusters. If swapping passenger A from cluster X with passenger B from cluster Y reduces total distance without violating capacity, the swap is accepted. This catches improvements that the greedy assignment misses.
Complexity
Each restart runs in O(max_iter × n × k) for the assignment step. The swap optimization is O(n² × k) in the worst case but is fast in practice because we only evaluate swaps between geographically adjacent clusters. Total: O(n_restarts × (max_iter × n × k + n²)).
For 1,200 passengers and 40 vehicles with 12 restarts, this completes in under 2 seconds on commodity hardware.
Step 2: Distance Matrix Computation
Before we can sequence stops, we need the pairwise driving distances between all stops in a cluster. This is where the problem stops being pure geometry and becomes an engineering challenge.
Haversine vs. Driving Distance
Great-circle distance (haversine) is fine for clustering — it's a fast proxy for geographic proximity. But for stop sequencing, you need actual road distances. Two stops might be 500m apart as the crow flies but 3km by road if a river separates them.
Batching API Calls
Mapping APIs (Google Maps, Mapbox, HERE) provide distance matrix endpoints. For a route with s stops, we need an s × s matrix — that's s² distance pairs. Most APIs cap matrix requests at 25×25 or similar.
async function computeDistanceMatrix(
stops: LatLng[],
batchSize: number = 25
): Promise<number[][]> {
const n = stops.length;
const matrix: number[][] = Array.from({ length: n }, () =>
new Array(n).fill(0)
);
// For small matrices, single API call
if (n <= batchSize) {
const result = await mapsApi.distanceMatrix({
origins: stops,
destinations: stops,
});
return parseMatrixResponse(result);
}
// For larger matrices, partition into sub-matrices
for (let i = 0; i < n; i += batchSize) {
for (let j = 0; j < n; j += batchSize) {
const originSlice = stops.slice(i, Math.min(i + batchSize, n));
const destSlice = stops.slice(j, Math.min(j + batchSize, n));
const result = await mapsApi.distanceMatrix({
origins: originSlice,
destinations: destSlice,
});
// Fill the corresponding sub-matrix
const subMatrix = parseMatrixResponse(result);
for (let oi = 0; oi < originSlice.length; oi++) {
for (let di = 0; di < destSlice.length; di++) {
matrix[i + oi][j + di] = subMatrix[oi][di];
}
}
}
}
return matrix;
}
Caching Strategy
Distance matrices are expensive (both in latency and API cost). We cache them aggressively:
- L1 cache: In-memory LRU for the current optimization session
- L2 cache: Database-backed cache keyed on sorted coordinate pairs with a configurable TTL (we use 30 days — road networks don't change that often)
- Cache invalidation: Manual purge when an operator reports a new road or closure
For routes with 15–20 stops, the matrix has 225–400 entries. With caching, repeat optimizations (e.g., after adding one passenger) hit ~90%+ cache rate.
Step 3: Stop Sequencing (TSP)
With passengers clustered and the distance matrix computed, each route needs an optimal stop order.
Exact Solution for Small Routes
For routes with ≤ 11 stops, we evaluate all permutations. 11! = 39,916,800 — that's tight but feasible if we prune aggressively:
def tsp_exact(distance_matrix, depot):
"""Exact TSP via branch-and-bound for small instances."""
n = len(distance_matrix)
stops = [i for i in range(n) if i != depot]
best_distance = float('inf')
best_order = None
def branch_and_bound(current, remaining, cost_so_far):
nonlocal best_distance, best_order
if not remaining:
total = cost_so_far + distance_matrix[current][depot]
if total < best_distance:
best_distance = total
best_order = [depot] + visited + [depot]
return
# Pruning: if current cost already exceeds best, abandon
if cost_so_far >= best_distance:
return
# Lower bound: minimum remaining edge cost
min_remaining = sum(
min(distance_matrix[r][j] for j in remaining if j != r)
for r in remaining
) if len(remaining) > 1 else 0
if cost_so_far + min_remaining >= best_distance:
return
for next_stop in sorted(
remaining,
key=lambda s: distance_matrix[current][s]
):
visited.append(next_stop)
branch_and_bound(
next_stop,
remaining - {next_stop},
cost_so_far + distance_matrix[current][next_stop]
)
visited.pop()
visited = []
branch_and_bound(depot, set(stops), 0)
return best_order, best_distance
Heuristic for Larger Routes
For routes with 12+ stops, exact solutions become impractical. We fall back to a nearest-neighbor heuristic followed by 2-opt improvement:
def tsp_heuristic(distance_matrix, depot):
n = len(distance_matrix)
unvisited = set(range(n)) - {depot}
tour = [depot]
current = depot
# Phase 1: Nearest-neighbor construction
while unvisited:
nearest = min(unvisited, key=lambda s: distance_matrix[current][s])
tour.append(nearest)
unvisited.remove(nearest)
current = nearest
tour.append(depot)
# Phase 2: 2-opt improvement
improved = True
while improved:
improved = False
for i in range(1, len(tour) - 2):
for j in range(i + 1, len(tour) - 1):
# Cost of current edges
d_current = (
distance_matrix[tour[i-1]][tour[i]] +
distance_matrix[tour[j]][tour[j+1]]
)
# Cost if we reverse the segment between i and j
d_new = (
distance_matrix[tour[i-1]][tour[j]] +
distance_matrix[tour[i]][tour[j+1]]
)
if d_new < d_current:
tour[i:j+1] = reversed(tour[i:j+1])
improved = True
return tour, tour_distance(tour, distance_matrix)
Why 2-opt? It's O(n²) per pass and typically converges in 2–5 passes. For routes with 15–25 stops, it reliably gets within 2–5% of optimal. More sophisticated approaches (3-opt, LKH) offer diminishing returns for this problem size and don't justify the added complexity.
Mapping API Waypoint Optimization
As an alternative (or validation), we can delegate stop ordering to the mapping API itself. Google Maps' Directions API, for example, accepts optimizeWaypoints: true:
const response = await mapsApi.directions({
origin: depot,
destination: depot,
waypoints: stops.map(s => ({ location: s })),
optimizeWaypoints: true,
});
// response.routes[0].waypoint_order gives the optimized sequence
const optimizedOrder: number[] = response.routes[0].waypoint_order;
This has the advantage of using real-time traffic data and actual road geometry. The downside: it's a black box, you're rate-limited by the API, and most providers cap waypoints at 23–25 per request.
In practice, we use our own TSP solver as the primary method and cross-reference against the mapping API's ordering for quality validation.
Putting It All Together: The Pipeline
┌─────────────────────────────────────────────────────┐
│ Input: Passenger list (CSV/Excel) │
│ [name, address, school, vehicle_capacity] │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Geocoding │
│ address → (lat, lng) via Maps API │
│ Flag invalid/ambiguous addresses │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Capacitated Clustering │
│ Constrained k-means + swap optimization │
│ 12 restarts × 100 iterations │
│ Output: passenger → vehicle assignment │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Distance Matrix Computation │
│ Per-cluster pairwise driving distances │
│ Batched API calls + multi-layer caching │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Stop Sequencing (TSP) │
│ ≤11 stops → exact (branch & bound) │
│ >11 stops → nearest-neighbor + 2-opt │
│ Cross-validate with mapping API │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Output │
│ Optimized routes with stop sequences │
│ Distance & time estimates per route │
│ Route polylines for map visualization │
└─────────────────────────────────────────────────────┘
For a concrete example: 1,200 passengers across 40 vehicles. The clustering stage runs in ~2 seconds. Distance matrix computation takes 10–30 seconds (depending on cache hit rate). TSP for 40 routes runs in parallel and completes in <1 second. Total wall time: under a minute.
A manual planner doing the same work? 8–10 working days, with routes typically 15–30% longer than our output.
The Hard Parts Aren't the Algorithms
The algorithms above are well-understood CS fundamentals. The genuinely hard engineering problems are:
1. Address Data Quality
Real-world passenger data is messy. Addresses are misspelled, incomplete, or ambiguous. Geocoding "123 Main St" might return 15 results across different cities. We run a multi-pass geocoding pipeline:
- Geocode with the full address
- If confidence is low, retry with normalized components (street, city, postal code separately)
- If still ambiguous, flag for manual review and skip from optimization
Bad coordinates silently poison the entire optimization. A single passenger geocoded to the wrong city can distort an entire cluster.
2. Re-optimization vs. Full Optimization
When 5 new passengers join mid-term, should you re-optimize everything from scratch or surgically insert them into existing routes?
Full re-optimization produces mathematically better results but reshuffles routes that drivers and parents have memorized. We offer both modes:
- Full re-optimization: Run the entire pipeline. Best total distance, but maximum disruption.
-
Incremental insertion: For each new passenger, find the route with available capacity where inserting them increases total distance the least.
O(new_passengers × routes × stops_per_route).
def incremental_insert(new_passengers, existing_routes, capacity):
for passenger in new_passengers:
best_route = None
best_position = None
best_cost_increase = float('inf')
for route in existing_routes:
if len(route.passengers) >= capacity:
continue
for pos in range(1, len(route.stops)):
# Cost of inserting at this position
cost_increase = (
distance(route.stops[pos-1], passenger.location) +
distance(passenger.location, route.stops[pos]) -
distance(route.stops[pos-1], route.stops[pos])
)
if cost_increase < best_cost_increase:
best_cost_increase = cost_increase
best_route = route
best_position = pos
if best_route:
best_route.insert(best_position, passenger)
3. The Human Override Layer
Algorithmic output is a starting point. Planners need to drag passengers between routes, reorder stops, and apply constraints the algorithm doesn't know about ("this child shouldn't be on the same bus as that child", "this street is under construction").
Every manual override needs to be validated: does the swap violate capacity? Does reordering increase total distance? The system needs to provide instant feedback on the cost impact of every manual change.
Results in Practice
On production data from operations we manage:
| Metric | Manual Planning | Algorithmic + Human Review |
|---|---|---|
| Planning time (40 routes) | 8–10 days | < 2 hours |
| Avg route distance vs optimal | +20–25% | +2–4% |
| Capacity utilization std dev | 8.3 passengers | 2.1 passengers |
| Routes requiring post-review changes | N/A | ~12% get minor tweaks |
The algorithmic approach doesn't eliminate human involvement — it reduces it from days of tedious manual work to a focused hour of review and refinement.
Further Reading
- Full non-technical guide: Manual vs Algorithmic Route Planning — covers operational scenarios, implementation roadmap, and decision framework
- Google OR-Tools — Vehicle Routing — open-source solver for CVRP variants
- TSPLIB — benchmark instances for TSP research
- RouteBot Live Demo — see the pipeline above in action (no signup required)
- RouteBot Docs — API reference and workflow documentation
If you're working on a similar problem or have questions about the approach, I'd be happy to discuss in the comments. And if you manage a transport operation and the scenarios above sound familiar, check out RouteBot.
Top comments (0)