Year: 2015–2020 · Crisis: Event traffic too massive for any monolith
The Problem: A "Trip Started" Signal Reaching 47 Services
When a ride starts on Uber, it's not just the "Driver" and "Rider" services that are affected. Here's what actually happens:
- Pricing service: Locks the fare
- Billing service: Prepares to generate an invoice
- Maps service: Starts route optimization
- ETA service: Calculates estimated arrival
- Payment service: Pre-authorizes the card
- Insurance service: Activates the policy
- Customer support service: Makes ticket creation available
- Analytics service: Starts data streaming
- Fraud service: Monitors for suspicious activity
- ...and 37 more services
If each one called the next via synchronous HTTP, a single "start trip" button would require 50 HTTP calls. If one slows down, the whole system slows. If one crashes, everything crashes.
Architectural Decision: Event-Driven Architecture with Kafka
Uber built a massive event-driven architecture on top of Apache Kafka. Every state change is published as an event. Interested services subscribe to it.
┌─────────────────────────────────────────────────────────┐
│ DRIVER APP: "I started the trip" (trip_id: 12345) │
└─────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ KAFKA BROKER — topic: "trip-events" │
│ { │
│ "event_type": "TRIP_STARTED", │
│ "trip_id": "12345", │
│ "driver_id": "drv_789", │
│ "rider_id": "rdr_456", │
│ "timestamp": 1705244400000, │
│ "pickup_location": { "lat": 40.7128, "lng": -74.0060 }│
│ } │
└─────────────┬───────────────────────────────────────────┘
│
┌──────────┼──────────┬──────────┬──────────┐
▼ ▼ ▼ ▼ ▼
[Billing] [Pricing] [Analytics] [Insurance] [Fraud]
(async) (async) (async) (async) (async)
Schema Registry: The Key to Preventing Chaos
With thousands of event types, thousands of producers and consumers on Kafka — if Billing expects amount as a number while Pricing sends it as a string, Billing silently breaks.
Uber's solution: Schema Registry (developed by Confluent).
// Avro Schema — for trip-events
{
"type": "record",
"name": "TripStartedEvent",
"namespace": "com.uber.events",
"fields": [
{ "name": "event_id", "type": "string" },
{ "name": "trip_id", "type": "string" },
{ "name": "driver_id", "type": "string" },
{ "name": "rider_id", "type": "string" },
{ "name": "timestamp", "type": "long" },
{ "name": "pickup_location", "type": {
"type": "record",
"name": "GeoPoint",
"fields": [
{ "name": "lat", "type": "double" },
{ "name": "lng", "type": "double" }
]
}}
]
}
Rules:
- Every event type has a versioned schema
- Producers register with the Schema Registry before publishing events
- Consumers deserialize events according to the schema
- Non-backward-compatible changes are REJECTED
This way, when one service changes its schema, it doesn't break the other 47 services.
Uber's "Domain Gateway" Architecture
Uber grouped its microservices around domains:
- Rider Domain: All rider-related services
- Driver Domain: All driver-related services
- Trip Domain: All trip-related services
- Payment Domain: All payment-related services
Each domain has a Domain Gateway — the single entry point to the outside world.
┌──────────────────────────────────────────┐
│ TRIP DOMAIN GATEWAY │
│ (trip.uber.com — single entry point) │
└──────┬───────────────────────────────────┘
│
├── /trip-service (trip CRUD)
├── /eta-service (estimated arrival)
├── /route-service (route optimization)
└── /dispatch-service (driver matching)
Advantages:
- External services don't know about internal domain details
- Services within a domain can be freely refactored
- Rate limiting, auth, and caching are managed centrally at the gateway
Schemaless DB: Uber's Database Revolution
Uber initially used PostgreSQL. But as they grew, vertical scaling wasn't enough. PostgreSQL's sharding capabilities were insufficient for horizontal scaling.
Uber developed Schemaless, their own storage layer. It was built on MySQL but used MySQL as a "key-value store":
-- Schemaless's simple but powerful schema
CREATE TABLE entity (
uuid BINARY(16) PRIMARY KEY,
body MEDIUMBLOB, -- All data here, as JSON
entity_type VARCHAR(64),
created_at TIMESTAMP,
updated_at TIMESTAMP,
KEY (entity_type, created_at)
);
Why?
- MySQL has strong transaction and replication capabilities
- But schema changes (ALTER TABLE) are extremely slow on large tables
- Schemaless moved the schema to the application layer, using MySQL purely as a storage engine
This architecture allowed Uber to store trillions of entities.
Trade-offs
✅ Gains:
- Loose coupling: One service crashing doesn't affect the rest
- Scalability: Each service scales independently
- Development speed: Teams can ship without waiting for each other
❌ Costs:
- Eventual consistency: "How many active trips are in the system right now?" doesn't always have a clear answer
- Debugging difficulty: Finding why an event wasn't processed means digging through Kafka, Schema Registry, and consumer logs
- Data duplication: Each service maintains its own data → duplication and synchronization challenges
🛠️ Takeaways
Adding Kafka to a small project is like using a sledgehammer to drive a nail — but at scale, it becomes essential. Without schema management in event-driven systems, chaos is inevitable; Schema Registry or similar tools are a must. Uber's Domain Gateway approach is a textbook application of Conway's Law (org structure = system architecture). And Uber's Schemaless is living proof that "one database can't do everything."
Next up — Chapter 3: Amazon's API Mandate and the memo that changed everything. 📦
Top comments (0)