DEV Community: Ishaan Gaba

Designing Uber: A Real-Time Ride Matching System at Scale

Ishaan Gaba — Sun, 22 Mar 2026 17:26:38 +0000

A user opens Uber. Taps a destination. Hits confirm.

Within 10 seconds, a driver is assigned, a route is calculated, and an ETA appears on screen.

That moment feels instant. It is not. Behind it sits one of the most demanding real-time distributed systems ever built — coordinating millions of moving devices, unpredictable networks, and a matching problem that must resolve in seconds or users abandon the app entirely.

This is not a tutorial on how Uber works. This is an engineering breakdown of why it's hard, and how you'd design it if you were the one responsible for keeping it running.

Why Uber Is a Hard System

Most systems are hard because of scale. Uber is hard because of scale plus real-time constraints plus physical world unreliability — all at once.

Consider what has to be true simultaneously:

Millions of drivers are broadcasting their GPS coordinates every 4–5 seconds
Riders expect a match response in under 10 seconds
Driver locations are stale the moment they're recorded
Networks drop. Phones die. Drivers cancel mid-assignment.
The matching decision has to be globally fair — not just fast

A ride-matching system is not a database lookup. It's a real-time geospatial optimization problem running at internet scale, with human behavior injected at every layer.

Miss the latency target and riders churn. Miss the consistency target and two riders get the same driver.

The Flow, Before the Diagrams

Before any architecture discussion, walk through what actually happens when a ride is requested.

A rider opens the app. The client is already maintaining a WebSocket connection to Uber's backend — location updates, surge data, and nearby driver markers are streaming in continuously. The rider is not idle; they're already consuming data before they book anything.

The rider submits a trip request. This hits the API Gateway, gets authenticated, and routes to the Dispatch Service — the core orchestrator of the entire ride lifecycle.

The Dispatch Service doesn't look up drivers in a SQL table. It queries a geospatial index — a live, in-memory store of driver locations organized by geography. It pulls a candidate set of nearby available drivers, ranks them by proximity, ETA, and acceptance rate, and sends a trip offer to the top candidate.

The driver receives a push notification and has roughly 15 seconds to accept. If they accept, the match is confirmed. If they don't — timeout, decline, or network failure — the system moves to the next candidate.

The whole thing has to complete in under 10 seconds from the rider's perspective. Every second beyond that degrades conversion.

High-Level Architecture

The API Gateway handles auth, rate limiting, and routing. The Location Service is a high-throughput write path — it needs to ingest millions of location pings per minute and keep the geospatial index fresh. The Dispatch Service owns the match lifecycle. The Notification Service handles the real-time push to drivers.

These are not monolithic. Each is a horizontally scalable service with its own failure boundary.

Real-Time Location Tracking

The problem: Every active driver is a moving data point. The system needs to know where they are, right now, not 30 seconds ago. At 5 million active drivers globally, that's roughly 1 million location updates per minute hitting your infrastructure.

Why it's hard: You can't write every GPS ping to a relational database. The write throughput would bury it. But you also can't lose location data — stale coordinates mean wrong ETAs and bad matches.

Solution approach:

Driver apps send a location update every 4–5 seconds over a persistent connection. These updates hit the Location Service, which does two things in parallel:

Updates the in-memory geospatial index for live querying
Publishes to a Kafka topic for downstream consumers — ETA computation, analytics, surge pricing

# Simplified location update handler
def handle_location_update(driver_id, lat, lng, timestamp):
    # Update in-memory geo index (sub-millisecond)
    geo_index.update(driver_id, lat, lng)

    # Publish to Kafka async — don't block the write path
    kafka_producer.publish("driver-locations", {
        "driver_id": driver_id,
        "lat": lat,
        "lng": lng,
        "ts": timestamp
    })

For the geospatial index, Uber uses S2 geometry — a library that maps the Earth's surface to a hierarchical grid of cells. Each driver is indexed by their current cell. A proximity query becomes a cell lookup, which is O(1) in the average case.

Trade-off: In-memory indexing is fast but volatile. If the Location Service crashes, the index needs to rebuild from Kafka replay. Acceptable — the rebuild is fast, and correctness beats stale data.

The Matching Engine

The problem: Given a rider at a location, find the best available driver, send them an offer, handle their response, and confirm the match — all within the latency budget.

Why it's hard: "Best available driver" is not a simple nearest-neighbor query. You're optimizing across proximity, ETA, driver rating, acceptance rate, vehicle type, and surge zone simultaneously. And the candidate set is changing every second as drivers move.

Solution approach:

The matching engine pulls a candidate pool from the geospatial index — typically drivers within a configurable radius. It scores each candidate using a weighted function that factors in ETA (dominant), acceptance rate (secondary), and trip efficiency. The top-scored driver gets the offer first.

The offer timeout matters. 15 seconds is long enough for a driver to react, short enough that the rider's session doesn't expire. If the driver doesn't respond, the system moves to the next candidate immediately — no waiting, no retry on the same driver.

Trade-off: Sequential offering is fair but slow under high demand. Uber has experimented with batched offers — sending to multiple drivers simultaneously and accepting the first response. This reduces latency but requires careful deduplication to prevent double-assignment.

Real-Time Communication

The problem: Both the rider and driver need live updates — driver moving on map, ETA countdown, trip status changes. Polling is not acceptable.

Why it's hard: WebSocket connections are stateful. You can't just add servers and route arbitrarily — the connection has to be maintained on a specific node. At millions of concurrent connections, connection management becomes infrastructure.

Solution approach:

Uber uses long-lived WebSocket connections for both rider and driver apps. These are maintained by a dedicated connection management layer — a stateful service that maps user IDs to open connections.

When a location update needs to be pushed to a rider (driver is moving), the flow is:

Trade-off: WebSockets are more operationally complex than HTTP polling. Connection drops require reconnect logic with exponential backoff. But the UX difference is stark — smooth driver movement on the map is a core product experience, not a nice-to-have.

End-to-End Ride Flow: From Booking to Completion

A user opens Uber, enters a destination, and taps "Book Ride."

From that single tap, a coordinated chain of services fires in sequence — some in milliseconds, some running in parallel, all of them invisible to the person staring at their phone waiting for a match.

Step 1 — Ride Request

The rider submits a trip request from the client app over HTTPS.

The API Gateway handles authentication, rate limiting, and routes the request to the Trip Service, which initializes a trip record with state set to pending.

Data flowing: rider ID, pickup coordinates, destination, vehicle type preference, timestamp.

Key challenge: The trip record must be created atomically. A partial write — trip exists but state is undefined — causes inconsistency downstream. The Trip Service writes to a persistent store with a single transaction before anything else proceeds.

Step 2 — Driver Discovery

The Matching Engine receives the trip request and immediately queries the Geospatial Index for available drivers within the configured radius.

This is not a SQL SELECT WHERE distance < X query. The index is an in-memory structure built on S2 geometry cells — the query resolves in under a millisecond and returns a ranked candidate pool, not a flat list.

Data flowing: pickup coordinates, search radius, driver availability flags, vehicle type filters.

Key challenge: The candidate pool is stale the moment it's returned. Drivers move. The system must account for position drift between query time and offer delivery — typically handled by padding ETA estimates conservatively.

Step 3 — Matching and Dispatch

The Matching Engine scores each candidate in the pool against a weighted function: ETA carries the most weight, followed by acceptance rate, driver rating, and trip efficiency.

The top-scored driver gets the first offer. This is not random. It is a deliberate optimization — sending offers to drivers most likely to accept reduces total match latency across the system, which matters at scale.

The offer is dispatched through the Notification Service via push notification and WebSocket simultaneously.

Data flowing: driver ID, trip ID, pickup location, estimated pickup distance, offer expiry timestamp.

Key challenge: The offer window is 15 seconds. Too short and drivers miss it. Too long and the rider's patience expires. This number is not arbitrary — it is derived from real conversion data.

Step 4 — Driver Acceptance

The driver taps accept. The acceptance hits the Matching Engine, which immediately attempts to acquire a distributed lock on the trip ID.

First write wins. If two drivers somehow accept within the same window — possible in a batched offer scenario — only one lock acquisition succeeds. The second driver receives a "trip no longer available" response. No double-assignment. Ever.

Data flowing: driver ID, trip ID, acceptance timestamp, current driver location.

Key challenge: The lock must be acquired, the trip state updated, and the rider notified — all before the offer window closes on another candidate. This is the tightest latency window in the entire flow.

Step 5 — Trip Start

The driver navigates to the pickup location. Both the rider and driver are now exchanging state through the Trip Service.

When the driver arrives and taps "Arrived," the trip moves to awaiting_rider state. When the rider boards and the driver taps "Start Trip," state transitions to active.

Data flowing: driver GPS coordinates, trip state transitions, ETA updates.

Key challenge: State transitions must be idempotent. A driver tapping "Start Trip" twice — due to a slow network and a frustrated re-tap — must not corrupt the trip record or trigger duplicate downstream events.

Step 6 — Ride in Progress

This is the longest phase and, architecturally, the most continuous.

The Location Service ingests driver GPS updates every 4–5 seconds. These are published to a Kafka topic, consumed by the Trip Update Service, and pushed to the rider's app over a persistent WebSocket connection. The rider sees the driver moving on the map in near real-time.

Data flowing: driver GPS stream, trip ID, rider connection reference, ETA recalculations.

Key challenge: WebSocket connections drop. The client must implement reconnect logic with exponential backoff, and the server must resume the stream without duplicating or dropping location events. The Kafka offset acts as the source of truth for where the stream left off.

Step 7 — Trip Completion

The driver taps "End Trip." The Trip Service captures the final GPS coordinate, marks the trip completed, and calculates the billable distance using the full GPS trace recorded during the ride — not straight-line origin-to-destination.

Data flowing: final GPS coordinate, full route trace, trip duration, computed fare.

Key challenge: GPS traces are noisy. A driver on a highway doesn't teleport — but raw coordinates sometimes suggest otherwise. Fare calculation uses map-snapped routes, not raw GPS, to prevent billing anomalies.

Step 8 — Payment Processing

The Payment Service receives the computed fare and initiates the charge against the rider's stored payment method.

This is an asynchronous operation. The rider sees "Trip Complete" immediately — the payment processes in the background. If the charge fails, the system retries with exponential backoff. Persistent failures are queued for manual resolution and the rider is notified separately.

Data flowing: rider ID, payment method token, fare amount, trip ID, idempotency key.

Key challenge: Idempotency is non-negotiable. A retry must never result in a double charge. Every payment request carries a unique idempotency key tied to the trip ID — the payment processor uses it to deduplicate retries at the transaction level.

Step 9 — Post-Ride Actions

Trip data fans out to multiple downstream consumers asynchronously:

Rating prompts are pushed to both rider and driver
The receipt is generated and emailed
The trip record is written to the analytics pipeline
Driver earnings are updated in the earnings ledger
Surge pricing models are updated with the completed trip signal

None of these block the core flow. They consume from the same Kafka event that marked the trip completed.

Key challenge: Downstream consumers must be designed for eventual consistency. The analytics pipeline does not need to reflect the trip in real-time. The earnings ledger does — a driver who completes a trip expects their balance to update promptly.

Where the Flow Breaks

Every handoff above is a potential failure point. The ones that matter most:

Driver doesn't respond within 15 seconds. The Matching Engine moves to the next candidate immediately. From the rider's perspective, the search continues. The timed-out driver sees the offer expire silently.

Multiple drivers accept simultaneously. Distributed lock on the trip ID ensures only one match is confirmed. The losing driver gets a rejection with no trip context exposed.

Network failure mid-ride. Driver GPS stream pauses. The rider's map freezes. The system buffers the location gap and resumes when connectivity returns. Fare calculation uses the recovered trace — not the gap.

Payment failure at completion. Trip is marked complete regardless. Payment retries asynchronously. The rider is not held at the end screen waiting for a payment confirmation that might never come.

The end-to-end flow is where the system's real complexity lives — not in any single service, but in the handoffs between them, the state transitions under failure, and the guarantees the system must maintain while moving faster than the user can perceive.

Failure Scenarios

This is where most system design discussions go shallow. Real systems fail in real ways.

Driver accepts but network drops:
The driver taps accept, but the acknowledgment never reaches the server. From the server's perspective, the offer timed out. The system moves to the next driver. The first driver's app eventually reconnects and tries to confirm — the system rejects it because the match is already assigned elsewhere. The first driver sees an error. Rider experience is unaffected.

Multiple drivers accept simultaneously (race condition):
In a batched offer scenario, two drivers both accept within milliseconds of each other. The system must guarantee only one match. This is solved with distributed locking on the trip ID — first write wins, second write is rejected with a conflict response. The second driver gets a "trip no longer available" notification.

No drivers available:
The geo query returns an empty candidate set. The Dispatch Service doesn't fail — it enters a retry loop with expanding radius, checking every few seconds as new drivers become available or existing drivers complete trips. The rider sees a "finding your driver" state. If no driver is found within a threshold, the request fails gracefully with a clear user message.

The Trade-offs That Matter

Consistency vs. availability: For matching, you want consistency — two riders must not get the same driver. Uber leans toward consistency here, accepting that under extreme load, some match attempts will fail and retry rather than produce a double-assignment. The cost is occasional latency. The alternative — a confused driver with two riders — is worse.

Latency vs. accuracy: Driver location is inherently stale. A location updated 4 seconds ago is already wrong for a moving vehicle. ETA calculations account for this by using historical speed models per road segment rather than pure straight-line distance. Accepting slight inaccuracy in location enables the throughput needed to serve the system at scale.

Cost vs. performance: In-memory geospatial indexes are fast but expensive. You're paying for RAM at massive scale. The alternative — disk-based geospatial queries — is cheaper but too slow for the latency target. For a system where matching latency directly drives revenue conversion, the memory cost is justified. This is a deliberate engineering economics decision, not an oversight.

Key Insights

The matching problem is not a database problem. It's a real-time geospatial optimization problem. Design accordingly.
Separate your write path from your query path. Location ingestion and location querying are different workloads with different requirements. Don't conflate them.
Failure handling is not error handling. Design every step of the match lifecycle assuming the next step will fail. What does the system do? That answer needs to be encoded, not improvised.
Stateful services are unavoidable. WebSocket connections, geospatial indexes, distributed locks — you cannot build this system as purely stateless services. Accept the operational complexity and design for it.
Latency budgets are product decisions, not engineering ones. The 10-second match window isn't arbitrary — it's derived from rider churn data. Your architecture must be built to serve that number, not the other way around.
Every second of that 10-second match window has a system decision behind it. The end-to-end flow is where the system's real complexity lives — not in any single service, but in the handoffs, the state transitions, and the guarantees maintained under failure.

Closing

A user opens Uber and gets a driver in 10 seconds.

What they don't see: a geospatial index serving thousands of queries per second, a matching engine racing through candidate ranking, a WebSocket layer maintaining millions of live connections, a distributed lock preventing double-assignment, and a failure handler quietly retrying in the background.

The system's job is to make complexity invisible.

That's the real benchmark for production-grade distributed systems — not whether they work when everything goes right, but whether users never notice when things go wrong.

Build for the failure path first. The happy path takes care of itself.

Why Most AI Agents Fail (And How to Design Them Right)

Ishaan Gaba — Tue, 17 Mar 2026 16:54:46 +0000

Most AI agents shipped to production are not agents. They are dressed-up chatbots with a tool list and a prayer.

That's a provocative claim, but after building and reviewing LLM-powered systems across customer support, internal tooling, and real-time messaging platforms, the pattern is impossible to ignore. Teams integrate an LLM, wire up a few API calls, and call it an "agent." Then latency spikes, context breaks down, the agent calls the wrong tool, and suddenly the engineering post-mortem is asking: what went wrong?

This post breaks down exactly why AI agents fail in production — and how to engineer them so they don't.

The Hype vs. The Reality

The demo looks flawless. The agent reads a user message, reasons over it, calls a function, and returns a clean response. Three minutes to build. Everyone applauds.

Production is different. Real users are unpredictable. Messages are ambiguous. Tool calls fail. Latency matters. And the agent, designed for linear tasks in controlled demos, collapses under the weight of real-world variability.

The problem is not the LLM. The problem is architecture.

Agents are not just LLMs with tools attached. They are autonomous reasoning systems that must handle state, uncertainty, failure, and feedback — often across multiple steps. Treating them otherwise is where most teams go wrong.

A Real-World Case: The Chat Platform Agent

Imagine you're building an AI-powered layer on top of a Slack/WhatsApp-style messaging system. The agent is supposed to:

Generate smart reply suggestions
Detect and flag inappropriate messages
Summarize long threads
Trigger message actions (react, reply, archive)

This is a realistic scope. Here is what the naive architecture looks like:

Simple enough. But now trace through what actually happens in production.

Scenario 1 — Context Collapse: A user sends a sarcastic message: "Oh great, another outage." The smart reply agent, lacking thread history, suggests: "Glad to hear things are going well!" The agent had no context. It made up a coherent but catastrophically wrong response.

Scenario 2 — Tool Ambiguity: The user says, "React to that with a thumbs up." The agent calls send_message instead of add_reaction because both tools accept similar inputs and the descriptions are vague. The action is wrong. The user is confused.

Scenario 3 — Latency Death Spiral: The agent decides it needs to summarize the thread, moderate the content, and suggest a reply — all in a single turn. Three LLM calls, two API calls, and a 14-second response time. Users abandon it.

These are not edge cases. These are the first three weeks of production.

The Six Ways Agents Fail

1. Treating the Agent Like a Chatbot

What goes wrong: Engineers build a single-turn request-response loop. User sends a message, LLM responds. No planning, no persistence, no feedback.

Why it happens: The chatbot mental model is deeply ingrained. Most LLM tutorials are structured this way.

How to fix it: Design agents around a Think → Plan → Act → Validate loop. The agent should reason about what it needs to do before doing it. This means separating the reasoning step from the execution step.

A chatbot responds. An agent decides, then responds.

2. Poor Tool Design

What goes wrong: Tools are defined as thin wrappers around APIs with no semantic clarity. Names like do_action or process_data give the LLM no guidance. Overlapping tool signatures cause the model to pick the wrong one.

Why it happens: Engineers think about tools as functions — not as cognitive affordances for the LLM.

How to fix it: Treat tool design like API design for a developer who has never seen your codebase. Names, descriptions, and parameter schemas must be unambiguous.

// BAD
{
  "name": "message_action",
  "description": "Do something with a message",
  "parameters": {
    "type": "string",
    "action": "string"
  }
}

// GOOD
{
  "name": "add_reaction",
  "description": "Adds an emoji reaction to a specific message in a channel. Use this when the user wants to react to a message, not when they want to send a reply.",
  "parameters": {
    "message_id": "string — The unique ID of the target message",
    "emoji": "string — Emoji shortcode, e.g. ':thumbsup:'"
  }
}

The test: If you gave these tool definitions to a junior engineer with no context, would they know exactly when to call each one? If not, neither will the LLM.

3. Lack of Context Structuring

What goes wrong: The agent receives a raw message and nothing else. Or worse — it receives a 4,000-token chat dump with no structure. Either way, the LLM reasons from noise.

Why it happens: Teams focus on getting the tool wiring right and treat context as an afterthought.

How to fix it: Structure context as a deliberate input, not a log dump. Separate signal from noise.

context = {
  "current_message": {
    "id": "msg_991",
    "text": "React to that with a thumbs up",
    "sender": "user_42",
    "timestamp": "2025-03-15T10:42:00Z"
  },
  "recent_thread": [
    {"sender": "user_77", "text": "The deploy broke staging again"},
    {"sender": "user_42", "text": "Yeah I saw that — not great"}
  ],
  "available_actions": ["add_reaction", "send_reply", "flag_message"],
  "user_permissions": ["react", "reply"]
}

The agent now knows the message, the thread context, what it's allowed to do, and who it's talking to. This is the minimum viable context for a messaging agent.

4. No Execution Control

What goes wrong: The agent acts immediately. No retry logic, no rollback, no confirmation for destructive operations. A moderation agent deletes a message it misclassified. No undo.

Why it happens: Execution is treated as a side effect, not a first-class concern.

How to fix it: Classify tool calls by risk level and enforce execution gates accordingly.

Risk Level	Examples	Execution Policy
Low	Read thread, summarize, suggest reply	Execute directly
Medium	Send message, add reaction	Execute with logging
High	Delete message, ban user, bulk action	Require confirmation or human review

For high-risk actions, surface a confirmation step before execution. In async systems, push high-risk operations into a review queue with a timeout.

5. Weak Memory Handling

What goes wrong: Every conversation starts cold. The agent has no memory of previous interactions, user preferences, or prior decisions. Users repeat themselves. The agent contradicts itself across sessions.

Why it happens: Stateless is the default. Teams don't architect memory as a subsystem.

How to fix it: Build a layered memory model:

Long-term memory does not have to be complex. A vector store with summarized user interactions and outcomes is sufficient for most production use cases. What matters is that memory is retrieved, not assumed.

6. Missing Guardrails

What goes wrong: The agent hallucinates a tool parameter. It calls a tool outside its defined scope. It enters an infinite planning loop. It generates a reply that violates policy. None of this is caught before it reaches the user.

Why it happens: Guardrails are seen as post-launch polish, not core architecture.

How to fix it: Guardrails are not optional. They are the difference between a demo and a production system. Implement them at three layers:

Input validation: Classify and sanitize user input before it reaches the agent.
Output validation: Check the agent's planned actions and tool calls against a schema and policy ruleset before execution.
Response filtering: Run final responses through a lightweight classifier before delivery.

For the messaging agent, this means: if the agent attempts to call delete_message outside of a moderation flow, reject the action and route to a human reviewer.

The Right Architecture

Here is what a production-grade chat agent architecture looks like when these principles are applied:

This is not overengineering. Every node in this diagram corresponds to a failure mode described above. Each one exists because something broke in production without it.

The key design properties:

Reasoning is separate from execution. The LLM plans; execution is deterministic and validated.
Context is structured and retrieved, not raw and assumed.
Every tool call passes through a risk gate before execution.
Memory is a subsystem, not an afterthought.
Human-in-the-loop is built in, not bolted on.

Key Takeaways

On architecture:

Design the Think → Plan → Act → Validate loop first. Wire up tools second.
Separate reasoning from execution. Treat them as distinct subsystems.

On tooling:

Write tool descriptions for an LLM, not for a developer reading docs.
Every tool should have a clear, non-overlapping purpose.

On context:

Structure context deliberately. Signal-to-noise ratio matters more than raw token count.
Retrieval beats injection — pull what's needed, don't dump everything.

On reliability:

Risk-gate every tool call. Not all actions are reversible.
Guardrails are not polish — they are architecture.

On memory:

Build layered memory from day one. Cold-start agents fail users.

On trade-offs:

Every additional reasoning step costs latency. Profile your loop and set token budgets per step.
Human-in-the-loop adds latency but prevents catastrophic errors for high-stakes actions. Make this a deliberate design choice, not an oversight.

Closing

The gap between a demo agent and a production agent is not a gap in capability — it is a gap in systems thinking.

LLMs have given engineers a powerful new primitive. But primitives do not build reliable systems. Architecture does. The teams shipping agents that actually work in production are not the ones who found the best prompt. They are the ones who treated context, memory, tooling, and execution control as first-class engineering concerns from day one.

Build the loop. Structure the context. Gate the execution. Ship the guardrails.

An AI agent is only as reliable as the system it runs inside.