shubham pandey (Connoisseur)

Posted on Mar 25

Designing Google Maps / Location Service at Scale A System Design Deep Dive — Question by Question

#algorithms #architecture #interview #systemdesign

Introduction

Google Maps gives 1 billion users turn by turn navigation, real time traffic, and instant place search simultaneously. On the surface it is just a map with directions. Underneath it is one of the most sophisticated distributed systems ever built — spanning graph algorithms, real time data pipelines, tile rendering, geographic search, and live traffic computation all working together seamlessly. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Representing Earth's Road Network

Interview Question: The entire road network of Earth has billions of roads and intersections. Finding the shortest path between two points on a graph this size seems computationally impossible in under 1 second. How do you represent the road network and what algorithm finds the shortest path?

Solution: Weighted directed graph with Dijkstra algorithm as the foundation.

Road network as a graph:

Every intersection is a node — roughly 50 million nodes globally
Every road segment between intersections is a directed edge — roughly 100 million edges
Each edge has weights — distance, speed limit, road type, turn restrictions
One way roads are directed edges — two way roads are two edges in opposite directions

Dijkstra algorithm finds shortest path by exploring nodes in order of cumulative cost from source. It guarantees the optimal path in a graph with non-negative edge weights.

Why Dijkstra alone fails at Earth scale:

Graph has 50 million nodes and 100 million edges
Dijkstra complexity is O(E log V) — O(100 million × log 50 million)
On a single machine this takes several minutes per query
Google Maps responds in under 1 second for 1 billion daily queries
Dijkstra on raw Earth graph is completely unworkable at this scale

Key Insight: Dijkstra is the correct algorithmic foundation but cannot scale to Earth's road network without significant optimization. The graph structure and search space must be dramatically reduced.

Challenge 2: Fast Routing with Contraction Hierarchies

Interview Question: Dijkstra exploring all roads equally for a Mumbai to Delhi query wastes enormous computation on tiny side streets that could never be part of the optimal route. How do you make routing dramatically faster?

Navigation: The key insight is that humans navigate hierarchically — for long distances you immediately think of major highways, not local streets. The routing algorithm should do the same. Structure the road network into importance levels and only search relevant levels for each query distance.

Solution: Contraction Hierarchies — hierarchical road network with pre-computed shortcuts.

Road hierarchy levels:

Level 1 — Local roads — walking distance queries
Level 2 — City roads — intra-city queries
Level 3 — State highways — inter-city queries
Level 4 — National highways — long distance queries

Routing by distance:

Mumbai to Delhi — 1400km — search Level 4 national highways only — find path instantly
Mumbai to Pune — 150km — Level 4 has no direct path — search Level 3 state highways — find path
Street to nearby mall — search Level 2 city roads — find path
Walking to neighbor — search Level 1 local roads — find path

Pre-computed shortcuts:
Google does not compute hierarchies at query time. Shortcuts between important nodes are pre-computed offline during map processing. The algorithm contracts less important nodes and creates direct shortcut edges between important nodes that bypass them. Query time just looks up these shortcuts rather than exploring the full graph.

Result: Route query time drops from several minutes to milliseconds. A Mumbai to Delhi query explores only a few thousand highway nodes instead of 50 million total nodes.

Key Insight: Contraction Hierarchies exploit the natural importance hierarchy of road networks. Long distance routing never touches local roads. Pre-computed shortcuts transform an intractable global graph problem into a fast hierarchical lookup.

Challenge 3: Real Time Traffic Data Pipeline

Interview Question: 500 million Android phones with Google Maps open send GPS location every few seconds. Google must process this to detect traffic patterns in near real time. How do you design the pipeline that converts billions of location updates into real time traffic conditions?

Solution: Kafka streaming pipeline with Traffic Computation Service and Redis storage.

Full traffic pipeline:

500 million phones send GPS coordinates every few seconds
Kafka receives billions of location events per day
Traffic Computation Service consumes from Kafka
Groups location updates by road segment using map matching
Computes average speed of all phones on each road segment
Compares against normal free flow speed for that segment
Classifies traffic status — FREE FLOW above 80 percent of speed limit, SLOW between 40 and 80 percent, CONGESTED below 40 percent
Updates Redis with current traffic status per segment
Pushes traffic updates to affected users via WebSocket

Redis traffic data structure:

Key is road segment ID
Value contains current speed, traffic status, and last updated timestamp
Sub-millisecond reads for route computation and map rendering

WebSocket push to users:

Traffic status changes on a segment
Notification Service identifies users currently navigating through that segment
Pushes rerouting suggestion via WebSocket instantly
User sees updated route without refreshing

Key Insight: Kafka decouples data collection from processing. Traffic Computation Service processes billions of events asynchronously without ever blocking the phones sending data. Redis provides sub-millisecond traffic status reads for real time route computation.

Challenge 4: GPS Noise Filtering for Accurate Traffic

Interview Question: Phone GPS has 5 to 15 meter accuracy. A stationary phone looks identical to a traffic jam. A phone moving slowly through a school zone looks like congestion. With billions of noisy GPS points how do you ensure traffic computation reflects actual road conditions?

Solution: Three layer noise filtering pipeline.

Layer 1 — Crowd sourced statistical averaging:

Single phone showing slow speed on a segment — could be anything — ignore
Minimum threshold of phones needed before declaring traffic status — typically 5 to 10 phones
Statistical outlier trimming removes readings that deviate significantly from segment average
Genuine congestion shows up across many phones simultaneously — impossible to fake with noise

Layer 2 — Red light detection versus genuine jam:

All phones on segment stopped for 30 to 60 seconds then resuming normal speed — red light — ignore, not congestion
All phones stopped for 5 or more minutes with sustained slow movement — genuine traffic jam — flag as congested
Pattern recognition on stop duration and subsequent speed distinguishes red lights from jams reliably

Layer 3 — Map matching from GPS coordinates to road segments:

Raw GPS coordinates snapped to nearest road segment using road network graph
Eliminates readings from parallel footpaths, buildings, parking lots
Only road-snapped coordinates contribute to traffic computation
Same Viterbi map matching algorithm from Uber design applies here

Key Insight: No single GPS reading is trusted. Only crowd sourced patterns across many phones over time produce reliable traffic signals. Statistical averaging, temporal pattern recognition, and map matching together filter noise to near zero.

Challenge 5: Map Tile Rendering

Interview Question: Earth has billions of geographic features. At different zoom levels users see different detail. How do you structure map data so you serve only the small portion a user is currently viewing at their specific zoom level?

Solution: Tile pyramid with pre-rendered cached tiles.

Tile pyramid structure:

World map divided into a grid of square tiles — each 256 by 256 pixels
At zoom level 1 — 4 tiles cover entire Earth
Each zoom level quadruples the number of tiles
At zoom level 15 — roughly 1 billion tiles — individual streets visible
At zoom level 20 — 35 trillion tiles — individual buildings visible
Each tile identified by three coordinates — zoom level, x position, y position

Serving tiles:

User opens Google Maps centered on Mumbai at zoom level 15
App calculates which tile coordinates cover current viewport — typically 9 to 12 tiles
Requests only those specific tiles from CDN
Downloads maybe 500KB of tile images instead of petabytes
User pans map — new tiles requested only for newly visible area
Zoom in — higher zoom level tiles requested for same area

Pre-rendering and CDN caching:

Google pre-renders all tiles at all zoom levels offline during map processing
Tiles stored on CDN servers globally
Tile request hits nearest CDN server — sub-millisecond response
Base map tiles change infrequently — roads and buildings rarely move
CDN cache TTL can be days or weeks for base tiles

Key Insight: The tile pyramid transforms a petabyte scale global dataset into millions of small independent cacheable images. Users only ever download the tiny fraction of tiles visible in their current viewport.

Challenge 6: Real Time Traffic Overlay Without Re-rendering Tiles

Interview Question: You have pre-rendered static tiles cached on CDN. Traffic conditions change every few minutes. Re-rendering and re-caching billions of tiles every few minutes is too slow and expensive. How do you overlay dynamic traffic data on static tiles?

Solution: Two layer rendering — static base tiles plus dynamic vector data rendered client side.

Layer 1 — Static base tiles from CDN:

Pre-rendered map imagery showing roads, buildings, parks, labels
Never changes — cached with long TTL on CDN
Served instantly from nearest CDN server
No re-rendering ever needed for base map

Layer 2 — Dynamic traffic vector data from server:

Not images — just data — road segment IDs mapped to traffic status
Tiny JSON payload — a few kilobytes for entire city
Updated every few minutes from Redis traffic data
Pushed to app via WebSocket when significant changes occur

Client side rendering:

App fetches base tiles from CDN
App fetches traffic vector data from server
App renders colored traffic overlay on top of base tiles locally
Green segments for free flow, yellow for slow, red for congested
Traffic changes — server pushes tiny update — app re-renders only affected segments
Base tiles never invalidated — CDN cache always fresh

Benefits:

Base tile CDN cache never invalidated by traffic changes
Traffic updates are tiny data payloads not images
Client renders overlay locally — zero server rendering cost per user
Smooth real time traffic visualization with minimal bandwidth

Key Insight: Separating static geographic data from dynamic traffic data allows each to be optimized independently. Static tiles cached forever on CDN. Dynamic data pushed as tiny updates. Client side rendering combines both seamlessly.

Challenge 7: Location Search — Text Plus Proximity

Interview Question: User searches "Italian restaurants near me." Results must consider both text relevance and geographic proximity simultaneously. A Starbucks 200 meters away is more relevant than a better rated one 20km away. How do you design location search?

Solution: Two phase search combining Redis GEORADIUS and Elasticsearch.

Phase 1 — Geographic filter via Redis GEO:

User location known from GPS
GEORADIUS places user_longitude user_latitude 2km returns all place IDs within radius
Returns maybe 500 place IDs — fast O(log n) proximity query
Pure geographic filter — no text matching yet

Phase 2 — Text search and ranking via Elasticsearch:

Pass 500 place IDs to Elasticsearch as filter
Elasticsearch searches name and category fields for "Italian restaurant"
Typo tolerance — "Italain" still matches "Italian"
Synonym matching — "eatery" matches "restaurant"
Returns 50 matching places with text relevance scores

Final ranking combining three signals:

Final Score = Text Relevance multiplied by 0.4 plus Proximity Score multiplied by 0.4 plus Rating Score multiplied by 0.2
Top 10 results returned to user

Why this weighting:

Text relevance and proximity equally important for location search
Rating matters but a highly rated place far away should not outrank a good place nearby
Weights tunable based on query type — "best restaurant in Mumbai" weights rating higher

Key Insight: Location search is a two dimensional problem — text relevance and geographic proximity. Redis GEO handles the geographic dimension efficiently. Elasticsearch handles the text dimension. A scoring function combines both into a single ranked result list.

Challenge 8: Keeping Data Stores in Sync

Interview Question: Place data lives in three stores — PostgreSQL as source of truth, Redis GEO for proximity queries, and Elasticsearch for text search. Millions of place updates happen daily — new businesses opening, closing, changing addresses. How do you keep all three in sync?

Solution: Kafka event driven async synchronization.

Update flow:

New restaurant opens — Place Service writes to PostgreSQL — source of truth committed
Place Service publishes event to Kafka instantly
Three independent consumers read from Kafka in parallel

Consumer 1 — Redis GEO updater:
GEOADD places longitude latitude placeID — place immediately available for proximity queries

Consumer 2 — Elasticsearch indexer:
Index new place document with name, category, rating, and metadata — immediately searchable

Consumer 3 — CDN cache invalidator:
Identify map tiles containing this location — invalidate cached tiles — trigger re-render of affected tiles — new business appears on map

Failure handling:

PostgreSQL write succeeds — Kafka event published
Redis consumer fails — Kafka retains event — consumer retries automatically on restart
Elasticsearch consumer fails — same retry pattern
All three stores eventually consistent with PostgreSQL source of truth
No data loss possible — Kafka durably stores events until all consumers acknowledge

Key Insight: Kafka decouples the Place Service from all downstream data stores. A single write to PostgreSQL propagates asynchronously to Redis, Elasticsearch, and CDN without the Place Service knowing or caring about any of them. Failures in any consumer self-heal via Kafka retry.

Full Architecture Summary

Road network — Weighted directed graph with 50 million nodes and 100 million edges
Fast routing — Contraction Hierarchies with pre-computed shortcuts per road level
Traffic pipeline — Kafka streaming to Traffic Computation Service to Redis
GPS noise filtering — Crowd sourced averaging plus red light detection plus map matching
Map rendering — Pre-rendered tile pyramid cached on global CDN
Traffic overlay — Static base tiles plus dynamic vector data rendered client side
Location search — Two phase Redis GEORADIUS plus Elasticsearch with combined scoring
Data sync — Kafka event driven updates to Redis, Elasticsearch, and CDN in parallel

Final Thoughts

Google Maps is a system where every layer has a fundamentally different performance characteristic. Routing needs millisecond graph traversal. Traffic needs near real time stream processing. Map rendering needs globally distributed static caching. Search needs both geographic and text indexing simultaneously. Data sync needs eventual consistency across multiple specialized stores.

The recurring theme is that no single data store or algorithm solves all problems. The right architecture uses each tool for what it does best — Redis for proximity and traffic state, Elasticsearch for text search, CDN for static tiles, Kafka for event propagation, and pre-computed hierarchical graphs for fast routing. The skill is knowing which tool fits which problem.

Happy building. 🚀

DEV Community