Seung Park

Posted on Mar 20

How I Built an AI Voice Agent That Answers Restaurant Phone Calls

#ai #typescript #openai #startup

Last year, I started building an AI voice agent designed to answer phone calls for restaurants. The idea came from a simple observation: restaurants miss a lot of phone calls during busy hours, and those missed calls translate directly into lost revenue.

I want to share the technical journey — what worked, what didn't, and the architecture I ended up with.

The Problem

I talked to about 30 restaurant owners before writing a single line of code. The pattern was consistent: during peak hours (lunch rush, dinner rush, weekends), staff are too busy serving in-house customers to answer the phone. Calls go to voicemail, and callers almost never leave messages. They just call the next place.

One owner told me he tracked missed calls for a week and found he was missing 8-12 calls per day during peak hours. At an average table value of $75, the math gets painful fast.

Tech Stack

Here's what I landed on after several iterations:

Runtime: Node.js with TypeScript (Fastify framework)
Voice/Telephony: Twilio for call handling and phone numbers
AI: OpenAI's Realtime API for conversational AI
Calendar: Google Calendar API for reservation management
Database: PostgreSQL for tenant configuration and menu data
Hosting: AWS (ECS Fargate for containers)
POS Integration: Square and Toast APIs for menu sync

Architecture Overview

The system uses a multi-agent architecture. When a call comes in through Twilio, a triage agent first determines the caller's intent: are they trying to make a reservation, place a takeout order, ask about hours, or something else?

Based on intent classification, the call gets routed to a specialized agent:

Reservation Agent: Handles booking, modification, and cancellation. Checks Google Calendar for real-time availability.
Order Agent: Takes takeout/delivery orders with full menu knowledge pulled from the POS system.
Inquiry Agent: Answers questions about hours, menu items, specials, location, parking.
Feedback Agent: Records customer complaints and feedback for the owner.

Each agent has its own system prompt optimized for its specific task, with access to the restaurant's actual data (menu, hours, table configuration, etc.).

The Hardest Parts

1. Handling Natural Conversation

Restaurant callers don't speak in clean, structured sentences. They say things like "uh yeah can we get a table for like... four? no wait, five, my mother-in-law is coming too, so five, on Friday around sevenish?"

Getting the AI to handle this gracefully required a lot of prompt engineering and a robust confirmation step. The agent always repeats back what it understood before finalizing: "Let me confirm — a table for 5, this Friday at 7 PM?"

2. Multi-Language Support

In cities like Vancouver (where I'm based), Toronto, or Los Angeles, callers speak dozens of languages. The OpenAI Realtime API handles language detection automatically, which was a huge win. The agent detects the caller's language and responds accordingly — no configuration needed.

3. Table Management Logic

Smart table management turned out to be more complex than expected. You need to handle table joining (a party of 8 might need two 4-tops pushed together), prevent double-booking, account for buffer time between seatings, and handle the restaurant's specific preferences about which tables can be combined.

4. Menu OCR

Most restaurant owners don't have their menu in a structured digital format. They have a PDF or a photo. I built a pipeline that takes a menu image/PDF, runs OCR, and extracts items with prices and descriptions into structured data. This was necessary to make onboarding fast — the goal was under 30 minutes from signup to live.

What I Learned

Start with the narrowest possible use case. I initially tried to build a general-purpose voice agent. It was mediocre at everything. When I narrowed the focus to restaurants specifically, I could optimize every part of the experience for that domain.

Latency matters more than accuracy. In voice conversations, a 2-second pause feels like an eternity. I spent more time optimizing response latency than improving response accuracy. The OpenAI Realtime API was a game-changer here — previous approaches using speech-to-text → LLM → text-to-speech had too much latency.

Restaurants are a surprisingly good market for AI. The problem is clear (missed calls = lost money), the ROI is measurable, and the target users (restaurant owners) don't need to be technical to benefit. Setup takes about 30 minutes, and the system handles calls autonomously from day one.

Current State

The system is live and handling calls for restaurants across the US and Canada. Average call handling rate is around 85-90% fully automated, with the rest getting transferred to a human. The most common reason for transfer is a request that requires human judgment, like negotiating a large private event.

If you're interested in the project, I've been building it at RingFoods. Happy to answer questions about the architecture or implementation details in the comments.

This is my first post on DEV. I'll be writing more about the technical challenges of building voice AI systems for real-world applications. Follow along if that sounds interesting.

DEV Community