Rajkiran

Posted on Jun 15

System Design - 24. Geospatial Indexing: How Uber Finds the Nearest Driver Among Millions in Milliseconds

#softwaredevelopment #google #systemdesign #productivity

Geospatial Indexing: How Uber Finds the Nearest Driver Among Millions in Milliseconds

Series: System Design Mastery — Day 8 of 15
Reading time: 11 min
Covers: Quadtrees, Geohash, Google S2, Uber H3, Real-Time Index Updates

The Query That SQL Was Never Built For

You open Uber. The app needs to answer: "Which drivers are within 2km of this exact latitude/longitude, right now, out of the millions of drivers active worldwide?"

A naive SQL approach:

SELECT driver_id, latitude, longitude
FROM drivers
WHERE 
  SQRT(POW(latitude - :user_lat, 2) + POW(longitude - :user_lng, 2)) < 0.02

This computes the distance for every single row in the table — every active driver on Earth — for every search. At Uber's scale (millions of drivers, location updates every 3-5 seconds), this is computationally impossible to run in real time.

The fundamental problem: standard database indexes (B+ trees, hash indexes) are designed for 1-dimensional data — sort by a single value (a timestamp, an ID, a name). Latitude and longitude are 2-dimensional. A B+ tree index on latitude alone, or longitude alone, doesn't help you efficiently find "nearby" points — nearby in 2D space doesn't mean nearby in either dimension individually.

Geospatial indexing solves this by transforming 2D space into something that CAN be indexed efficiently in 1D — and there are three major approaches.

Approach 1: Quadtrees — Recursive Spatial Division

A Quadtree recursively divides 2D space into four equal quadrants, continuing to subdivide quadrants that contain "too much" data.

Start: entire map (1 region)

┌─────────────┬─────────────┐
│             │             │
│   NW        │    NE       │
│             │             │
├─────────────┼─────────────┤
│             │             │
│   SW        │    SE       │
│             │             │
└─────────────┴─────────────┘

If NE quadrant has too many points (e.g., dense city center), 
subdivide it further:

┌─────────────┬──────┬──────┐
│             │ NE-NW│ NE-NE│
│   NW        ├──────┼──────┤
│             │ NE-SW│ NE-SE│
├─────────────┼──────┴──────┤
│             │             │
│   SW        │    SE       │
│             │             │
└─────────────┴─────────────┘

This creates a tree structure where densely populated areas (city centers, with thousands of drivers per square km) have many small leaf nodes, while sparsely populated areas (rural regions, with few drivers per 100 square km) have few large leaf nodes.

Finding nearby drivers:

1. Locate the leaf node(s) containing the user's location
2. Drivers in that leaf node are candidates
3. If not enough candidates, check neighboring leaf nodes
   (a leaf representing a small area in a dense city center might
   need to check 1-2 neighbors; a leaf representing a large rural 
   area might already have all nearby drivers)

Advantages:

Naturally adapts to data density — dense areas get fine-grained subdivision automatically
Conceptually simple — a tree structure most engineers already understand

Disadvantages:

Tree traversal for neighbor lookups can be complex — finding "all leaves adjacent to this leaf" isn't trivial when leaves are different sizes
Rebalancing as data shifts (rush hour moves driver density from residential areas to business districts) requires tree restructuring

Used by: Many GIS (Geographic Information Systems) applications, MongoDB's older geospatial indexes (2dsphere indexes use a similar recursive subdivision).

Approach 2: Geohash — Encoding Location as a String

Geohash takes a completely different approach: encode a latitude/longitude pair into a single string, where the key property is — the longer the shared prefix between two geohashes, the closer the locations are.

How Geohash Encoding Works

Latitude/Longitude: (37.7749, -122.4194)  ← San Francisco

The encoding interleaves bits from latitude and longitude ranges,
recursively halving the search space:

Step 1: Is longitude in [-180, 0] or [0, 180]? 
        -122.4194 is in [-180, 0] → bit = 0
Step 2: Is latitude in [-90, 0] or [0, 90]?
        37.7749 is in [0, 90] → bit = 1
Step 3: Continue alternating longitude/latitude, halving each time...

After enough bits, group into base32 characters:
Result: "9q8yyk8ytpxr" ← this is the Geohash for San Francisco

The Prefix Property

"9q8yyk8ytpxr"  → San Francisco (precise location)
"9q8yyk8ytpx"   → San Francisco (slightly larger area, ~1m precision)
"9q8yyk8yt"     → San Francisco area (~150m precision)
"9q8yyk"        → San Francisco region (~5km precision)
"9q8y"          → Bay Area (~80km precision)
"9q"            → California / Nevada region (~1700km precision)

Each character you remove from the end roughly multiplies the represented area by 32 (since base32 has 32 possible characters per position).

Finding Nearby Points with Geohash

-- Find all locations with geohash starting with "9q8yyk" 
-- (same ~5km grid cell as our target location)
SELECT * FROM drivers WHERE geohash LIKE '9q8yyk%'

This is a prefix match — something B+ tree indexes (Topic 25) handle extremely efficiently! You've converted a 2D "nearby" query into a 1D "string prefix" query that any standard database index can answer fast.

The Edge Case: Boundary Problem

Two points that are physically VERY close to each other can have 
COMPLETELY DIFFERENT geohash prefixes if they're on opposite sides 
of a grid cell boundary:

Point A: geohash "9q8yyk..." (just inside one grid cell)
Point B: geohash "9q8yym..." (just across the boundary, 
         physically 10 meters from Point A)

A prefix search for "9q8yyk%" would MISS Point B entirely, 
even though it's very close.

The fix: When searching, also check the 8 neighboring grid cells (the cells surrounding the cell containing the query point) — not just the cell containing the point itself. Most Geohash implementations include a "neighbors" function for exactly this purpose.

Approach 3: Google S2 and Uber H3 — Cell-Based Indexing at Scale

Geohash has a subtle issue: its grid cells are rectangular, and rectangles on a sphere (the Earth) have wildly varying actual sizes depending on latitude — a "1-degree by 1-degree" cell near the equator covers a much larger physical area than the same cell near the poles. This distortion complicates distance calculations.

Google S2: Projecting the Sphere onto a Cube

S2 projects the Earth's surface onto the 6 faces of a cube, then recursively subdivides each face into a hierarchy of cells — similar in spirit to a Quadtree, but designed specifically to minimize the distortion of spherical geometry.

Earth (sphere) → projected onto cube faces → each face recursively 
subdivided into a hierarchy of cells (S2 cells)

Each S2 cell has a unique 64-bit ID.
Cells at the same "level" have roughly similar areas, 
regardless of where on Earth they are — solving the 
rectangular-distortion problem of simple lat/lng grids.

Used by: Google Maps internally for spatial indexing and proximity queries at global scale.

Uber H3: Hexagonal Grid System

Uber open-sourced H3, which divides the Earth into a hierarchical grid of hexagonal cells.

Why hexagons (and not squares/rectangles)?

Square grid:
┌───┬───┬───┐
│   │   │   │   Each cell has 8 neighbors, but 4 are "diagonal" 
├───┼───┼───┤   (share only a corner) and 4 are "adjacent" 
│   │ X │   │   (share an edge) — DIFFERENT distances to 
├───┼───┼───┤   center, creating asymmetry
│   │   │   │
└───┴───┴───┘

Hexagonal grid:
  ╱ ╲ ╱ ╲ ╱ ╲
 │   │ X │   │    Each cell has exactly 6 neighbors, 
  ╲ ╱ ╲ ╱ ╲ ╱     ALL at the SAME distance from the center.
   ╲ ╱ ╲ ╱ ╲      Uniform adjacency — no diagonal/edge distinction.

Why uniform adjacency matters for ride-hailing: When Uber's matching algorithm asks "find drivers in cells near this rider," with hexagons, "near" has a consistent, uniform meaning in every direction. With square grids, a driver in a "diagonal" cell is actually farther away than a driver in an "edge-adjacent" cell — even though both appear as "1 cell away" — creating subtle biases in matching.

H3's hierarchical resolution:

H3 has 16 resolution levels:
  Resolution 0: ~4,250,000 km² per cell (continent-scale)
  Resolution 7: ~5.2 km² per cell (city neighborhood-scale)
  Resolution 9: ~0.1 km² per cell (city block-scale)
  Resolution 15: ~0.9 m² per cell (single building-scale)

Uber typically uses resolution 7-9 for driver matching — 
city-block to neighborhood granularity.

Real Uber matching flow using H3:

1. Rider requests a trip at location (lat, lng)
2. Convert to H3 cell at resolution 9: cell_id = "8928308280fffff"
3. Query: "which drivers are in THIS cell, or its 6 neighbors?"
   (a simple lookup — drivers' current cell_ids are indexed)
4. Rank candidate drivers by actual road-network distance/ETA
5. Dispatch the best match

The H3 cell lookup (step 3) is essentially O(1) — a hash lookup by cell ID, returning a small set of candidates. The expensive part (step 4 — actual routing/ETA calculation) only runs on this small candidate set, not on every driver in the city.

Comparing the Three Approaches

	Quadtree	Geohash	H3 / S2
Shape	Rectangular, variable size (adaptive)	Rectangular, fixed grid	Hexagonal (H3) / roughly square (S2), fixed hierarchical grid
Adapts to density?	Yes — subdivides dense areas	No — fixed grid regardless of density	No — fixed grid, but multiple resolution levels available
Neighbor queries	Complex (variable-size neighbors)	Simple but has boundary issues	Simple, uniform (especially H3's hexagons)
Distortion	N/A (works in 2D plane)	High at extreme latitudes	Low (designed for spherical geometry)
Used by	GIS systems, older spatial databases	General-purpose location encoding	Uber (H3), Google Maps (S2)

The practical guidance: For most system design interviews, Geohash is the easiest to explain and implement (prefix matching on a string — leverages existing database indexes). H3/S2 are the "I know what real companies use at scale" answer — mentioning them, especially H3's hexagonal uniform-adjacency property, signals deeper knowledge.

Handling Real-Time Updates: The Hard Part

Static geospatial indexing (indexing restaurant locations, which rarely move) is one problem. Indexing millions of moving objects, each updating position every few seconds, is a much harder operational challenge.

The challenge:
  - 1 million active drivers
  - Each sends a GPS update every 3-5 seconds
  - = roughly 200,000-300,000 location updates PER SECOND globally

For each update:
  1. Remove driver from their OLD geo-index cell
  2. Add driver to their NEW geo-index cell
  (most updates: old cell == new cell, since drivers move slowly 
   relative to update frequency — but the system must still process it)

Why this rules out traditional databases for the "live" index:
A SQL database with a spatial index, receiving 250,000 writes/second to update positions, would be overwhelmed — especially since most "writes" are actually small UPDATEs to existing rows, which still require index maintenance.

The production pattern:

Live driver locations → Redis (in-memory, ephemeral)
  - Redis GEOADD / GEORADIUS commands provide built-in geospatial 
    indexing using geohash internally
  - In-memory writes handle 250K updates/sec easily
  - TTL on location keys — if a driver stops sending updates 
    (app crashed, phone died), they automatically disappear 
    from the index after a short timeout (e.g., 30 seconds)

Historical location data (for analytics, ETAs, surge calculation)
  → Cassandra — write-heavy, time-series friendly

Redis GEO commands (built on Geohash internally):

GEOADD drivers:active -122.4194 37.7749 "driver_12345"
GEORADIUS drivers:active -122.4194 37.7749 2 km
  → returns all drivers within 2km, computed using geohash 
    prefix matching under the hood

This is why "Redis for live location, Cassandra for history" is such a common pattern in ride-hailing and delivery system designs — it directly addresses the write-throughput requirement that a traditional spatial database can't meet.

Interview Scenario: "Design 'Find Nearby Drivers'"

The structured answer:

"I'd use a geospatial index based on Geohash or H3 — I'll go with H3 for this since it's what Uber actually uses, and its uniform hexagonal adjacency avoids the corner-case distance distortions of square grids.

Each driver's current location gets converted to an H3 cell ID at a resolution appropriate for city-block granularity — roughly resolution 8 or 9. This cell ID becomes part of the key when storing the driver's location.

For live location data, I'd use Redis — drivers send GPS updates every few seconds, and at scale that's hundreds of thousands of writes per second, which Redis handles in-memory easily. Each driver's entry has a short TTL, so a driver who stops sending updates (crashed app, dead phone) automatically disappears from the active index.

To find nearby drivers: convert the rider's location to its H3 cell, then query that cell plus its 6 neighboring cells — a simple set of lookups. This gives a small candidate set, typically tens of drivers even in a dense city, not thousands.

Only THEN do I run the expensive operation — actual routing/ETA calculation via a routing service — on this small candidate set, not on every driver in the city. The geo-index's job is to cheaply narrow millions of drivers down to dozens; the routing service's job is to rank those dozens accurately."

Key Takeaways

Standard B+ tree indexes can't efficiently answer 2D "nearby" queries — geospatial indexing transforms 2D proximity into something indexable.
Quadtree: recursive spatial subdivision, adapts to data density, but complex neighbor queries.
Geohash: encodes lat/lng as a string where shared prefixes = proximity — leverages standard database prefix indexes, but has a boundary problem (check neighboring cells too).
Google S2 projects the sphere onto a cube to minimize distortion; Uber H3 uses hexagonal cells for uniform adjacency in all directions.
Real-time location updates (hundreds of thousands/second) require in-memory storage — Redis GEO commands (built on Geohash) are the standard production pattern.
The architectural pattern: geo-index narrows millions of candidates to dozens cheaply; expensive ranking (routing/ETA) only runs on that small set.

What's Next

Topic 25 — with a grab-bag of essential data structures: Skip Lists (how Redis sorted sets work), HyperLogLog (counting a billion unique visitors with almost no memory), Tries (autocomplete), and LSM Trees vs B+ Trees (the fundamental write-optimized vs read-optimized database internals choice).

Topic 24 of the System Design Mastery series.

Tags: system-design geospatial uber data-structures backend algorithms interview-prep

DEV Community