Scale Wars #2 — Uber: How They Processed 100 Billion Events Per Day

#systemdesign #architecture #backend #programming

Year: 2015–2020 · Crisis: Event traffic too massive for any monolith

The Problem: A "Trip Started" Signal Reaching 47 Services

When a ride starts on Uber, it's not just the "Driver" and "Rider" services that are affected. Here's what actually happens:

Pricing service: Locks the fare
Billing service: Prepares to generate an invoice
Maps service: Starts route optimization
ETA service: Calculates estimated arrival
Payment service: Pre-authorizes the card
Insurance service: Activates the policy
Customer support service: Makes ticket creation available
Analytics service: Starts data streaming
Fraud service: Monitors for suspicious activity
...and 37 more services

If each one called the next via synchronous HTTP, a single "start trip" button would require 50 HTTP calls. If one slows down, the whole system slows. If one crashes, everything crashes.

Architectural Decision: Event-Driven Architecture with Kafka

Uber built a massive event-driven architecture on top of Apache Kafka. Every state change is published as an event. Interested services subscribe to it.

┌─────────────────────────────────────────────────────────┐
│  DRIVER APP: "I started the trip" (trip_id: 12345)      │
└─────────────┬───────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│  KAFKA BROKER — topic: "trip-events"                    │
│  {                                                       │
│    "event_type": "TRIP_STARTED",                        │
│    "trip_id": "12345",                                  │
│    "driver_id": "drv_789",                              │
│    "rider_id": "rdr_456",                               │
│    "timestamp": 1705244400000,                          │
│    "pickup_location": { "lat": 40.7128, "lng": -74.0060 }│
│  }                                                       │
└─────────────┬───────────────────────────────────────────┘
              │
   ┌──────────┼──────────┬──────────┬──────────┐
   ▼          ▼          ▼          ▼          ▼
[Billing]  [Pricing]  [Analytics]  [Insurance]  [Fraud]
 (async)    (async)     (async)      (async)     (async)

Schema Registry: The Key to Preventing Chaos

With thousands of event types, thousands of producers and consumers on Kafka — if Billing expects amount as a number while Pricing sends it as a string, Billing silently breaks.

Uber's solution: Schema Registry (developed by Confluent).

// Avro Schema — for trip-events
{
  "type": "record",
  "name": "TripStartedEvent",
  "namespace": "com.uber.events",
  "fields": [
    { "name": "event_id", "type": "string" },
    { "name": "trip_id", "type": "string" },
    { "name": "driver_id", "type": "string" },
    { "name": "rider_id", "type": "string" },
    { "name": "timestamp", "type": "long" },
    { "name": "pickup_location", "type": {
        "type": "record",
        "name": "GeoPoint",
        "fields": [
          { "name": "lat", "type": "double" },
          { "name": "lng", "type": "double" }
        ]
    }}
  ]
}

Rules:

Every event type has a versioned schema
Producers register with the Schema Registry before publishing events
Consumers deserialize events according to the schema
Non-backward-compatible changes are REJECTED

This way, when one service changes its schema, it doesn't break the other 47 services.

Uber's "Domain Gateway" Architecture

Uber grouped its microservices around domains:

Rider Domain: All rider-related services
Driver Domain: All driver-related services
Trip Domain: All trip-related services
Payment Domain: All payment-related services

Each domain has a Domain Gateway — the single entry point to the outside world.

┌──────────────────────────────────────────┐
│           TRIP DOMAIN GATEWAY            │
│  (trip.uber.com — single entry point)    │
└──────┬───────────────────────────────────┘
       │
       ├── /trip-service       (trip CRUD)
       ├── /eta-service        (estimated arrival)
       ├── /route-service      (route optimization)
       └── /dispatch-service   (driver matching)

Advantages:

External services don't know about internal domain details
Services within a domain can be freely refactored
Rate limiting, auth, and caching are managed centrally at the gateway

Schemaless DB: Uber's Database Revolution

Uber initially used PostgreSQL. But as they grew, vertical scaling wasn't enough. PostgreSQL's sharding capabilities were insufficient for horizontal scaling.

Uber developed Schemaless, their own storage layer. It was built on MySQL but used MySQL as a "key-value store":

-- Schemaless's simple but powerful schema
CREATE TABLE entity (
    uuid        BINARY(16) PRIMARY KEY,
    body        MEDIUMBLOB,        -- All data here, as JSON
    entity_type VARCHAR(64),
    created_at  TIMESTAMP,
    updated_at  TIMESTAMP,
    KEY (entity_type, created_at)
);

Why?

MySQL has strong transaction and replication capabilities
But schema changes (ALTER TABLE) are extremely slow on large tables
Schemaless moved the schema to the application layer, using MySQL purely as a storage engine

This architecture allowed Uber to store trillions of entities.

Trade-offs

✅ Gains:

Loose coupling: One service crashing doesn't affect the rest
Scalability: Each service scales independently
Development speed: Teams can ship without waiting for each other

❌ Costs:

Eventual consistency: "How many active trips are in the system right now?" doesn't always have a clear answer
Debugging difficulty: Finding why an event wasn't processed means digging through Kafka, Schema Registry, and consumer logs
Data duplication: Each service maintains its own data → duplication and synchronization challenges

🛠️ Takeaways

Adding Kafka to a small project is like using a sledgehammer to drive a nail — but at scale, it becomes essential. Without schema management in event-driven systems, chaos is inevitable; Schema Registry or similar tools are a must. Uber's Domain Gateway approach is a textbook application of Conway's Law (org structure = system architecture). And Uber's Schemaless is living proof that "one database can't do everything."