Siddhesh Surve

Posted on Feb 11

🍔 Stop Building "Wrappers": How Yelp Architected a Real-World AI Assistant

#systemdesign #ai #yelp #softwareengineering

Most AI tutorials lie to you.
They show you import openai, a few lines of Python, and a vector database, and call it a "Product."
But if you deploy that to production, you get:

Latency that makes users quit.
Hallucinations that destroy trust.
API bills that bankrupt your startup.

Building a chat interface is easy. Building a System that handles millions of queries about restaurants, reservations, and reviews is hard engineering.

Yelp recently pulled back the curtain on how they built Yelp Assistant, and it is a masterclass in modern System Design.
They moved beyond the "Chatbot" phase and built an Agentic Orchestrator.

Here is the breakdown of their architecture, why "Function Calling" is the new SQL, and how you can apply these patterns to your own apps.

🏗️ The High-Level Architecture (The "Brain")

Yelp didn't just hook up a frontend to GPT-4. They built a sophisticated middleware layer.
When you ask Yelp: "Find me a cozy Italian spot for a date night in San Francisco," here is what actually happens:

Interaction Service: Handles the WebSocket/SSE connection (for speed).
Orchestrator (The Manager): The brain that holds state and decides what to do next.
LLM Gateway: A unified interface to talk to models (OpenAI, Anthropic, or open-source).
Tools (The Hands): APIs for Search, Reservations, and Reviews.

💡 The Big Shift: "Routing" over "Reasoning"

Novice AI engineers send everything to the LLM.
Yelp realized that is inefficient.
They implemented Intent Classification (Routing) early in the pipeline.

Why this matters:
If a user says "Hi," you don't need a Retrieval Augmented Generation (RAG) pipeline. You just say "Hi."
If a user says "Book a table," you don't need to search for reviews.

Adding value to your stack:
Instead of one giant prompt, use a small, fast model (like GPT-4o-mini or Claude Haiku) to classify the intent first.

# Pseudo-code for Intent Routing
async def route_request(user_query):
    intent = await classifier_model.predict(user_query)

    if intent == "GREETING":
        return quick_reply(user_query)
    elif intent == "SEARCH":
        return search_orchestrator(user_query)
    elif intent == "ACTION_BOOKING":
        return booking_tool(user_query)

🛠️ RAG 2.0: Function Calling & Hybrid Search

Yelp sits on a goldmine of data: Reviews.
But feeding millions of reviews into a context window is impossible.

The Problem with Basic Vector Search

Standard RAG (Vector Search) is great for "general vibes," but bad for specifics.
If you search for "Pizza place open now," a vector search might return a place that has great pizza but is currently closed, because "open now" is a hard filter, not a semantic similarity.

The Solution: Function Calling

Yelp uses the LLM as a Translator, not a Database.
The LLM converts natural language into a structured API call to Yelp's existing (and very powerful) search engine.

User: "Cheap sushi in SoHo open now."
LLM (Function Call):

{
  "tool": "business_search",
  "parameters": {
    "cuisine": "sushi",
    "location": "SoHo, NY",
    "price_tier": 1,
    "open_now": true
  }
}

Viral Takeaway:
Stop trying to teach your LLM facts. Teach it how to use Tools.
Your existing SQL/Elasticsearch database is smarter than a Vector DB for structured queries. Use the LLM to write the query, not answer the question.

🚦 The "LLM Gateway" Pattern

This is the most critical infrastructure piece that 90% of developers skip.
Yelp doesn't let services call OpenAI directly. They go through a centralized LLM Gateway.

Why you need this immediately:

Model Swapping: Did OpenAI go down? Switch to Anthropic instantly without redeploying code.
Observability: Track token usage, latency, and costs per user.
Rate Limiting: Prevent one user from draining your API credits.
PII Redaction: The Gateway can automatically scrub emails/phone numbers before they leave your servers.

If you are building for production, do not hardcode API keys. Build a Gateway.

⚡ Latency is the UX Killer (Streaming)

Yelp uses Server-Sent Events (SSE) to stream the response.
In the world of LLMs, "Time to First Token" (TTFT) is the only metric that matters.

If your backend takes 5 seconds to "think" (Search -> RAG -> Generation), the user will leave.
By streaming, Yelp displays:

"Thinking..."
"Searching for Italian places..." (Tool output)
"Here are a few options..." (LLM Token stream)

This Optimistic UI keeps the user engaged even while the heavy lifting is happening in the background.

🔮 Future-Proofing: What We Learned

Yelp's architecture proves that the future of AI isn't about better models—it's about Better Systems.

To build like Yelp, follow these rules:

Don't Monolith: Break your AI logic into Router, Tools, and Generator.
Trust your Legacy Search: Hybrid search (Keyword + Vector) beats pure Vector search every time for e-commerce.
Guardrails are Mandatory: Sanitize inputs and outputs.
Observability Wins: You can't fix what you can't measure.

Are you building a Chatbot or an Agent? There is a difference.

🗣️ Discussion

Do you trust "Function Calling" to handle critical business logic (like booking a reservation), or do you prefer traditional UI flows? Let me know in the comments! 👇

DEV Community