Seung Park

Posted on Mar 21

Building a Real-Time Voice AI Agent for Restaurants with OpenAI and Twilio

#ai #openai #webdev #tutorial

I've been building voice AI systems for small businesses, and I wanted to share the architecture behind a real-time voice agent designed specifically for restaurants. This post walks through how we connected OpenAI's Realtime API with Twilio to create an AI that answers phone calls, handles reservations, and takes orders — all without human intervention.

The Problem

Restaurants miss 15-25% of incoming calls. During peak hours, that number can hit 40%. Every missed call is a lost reservation, a missed takeout order, or a frustrated customer who calls your competitor instead.

Hiring a dedicated phone person costs $2,500-4,000/month and only covers one shift. Answering services cost $500-1,500/month and mostly just take messages. We wanted to build something that could actually handle the calls — book reservations, take orders, answer questions — 24/7.

Architecture Overview

The system uses a triage agent pattern with specialized sub-agents:

Incoming Call (Twilio)
    → WebSocket Connection
        → Triage Agent (classifies intent)
            → Reservation Agent (books/modifies/cancels)
            → Order Agent (takes takeout/delivery orders)
            → Inquiry Agent (hours, menu, location)
            → Feedback Agent (complaints, suggestions)

Key Components

Twilio Voice + Media Streams — Handles the telephony layer. When a call comes in, Twilio establishes a WebSocket connection and streams raw audio.
OpenAI Realtime API — Processes audio in real-time. We use function calling to give the AI structured tools for booking reservations, checking availability, etc.
Google Calendar Integration — Real-time sync for reservations. The AI checks availability before confirming any booking.
Menu OCR Pipeline — Restaurant owners upload a PDF or photo of their menu. We extract items, prices, and descriptions automatically.

The Triage Pattern

The most important architectural decision was the triage pattern. Instead of one monolithic prompt trying to handle everything, we route calls to specialized agents:

// Simplified triage logic
async function triageCall(transcript: string): Promise<AgentType> {
  const intent = await classifyIntent(transcript);

  switch (intent) {
    case 'reservation':
      return new ReservationAgent(calendarService);
    case 'order':
      return new OrderAgent(menuService, posIntegration);
    case 'inquiry':
      return new InquiryAgent(restaurantInfo);
    case 'feedback':
      return new FeedbackAgent(notificationService);
    default:
      return new GeneralAgent();
  }
}

Each agent has its own system prompt, tool definitions, and context. This keeps responses focused and reduces hallucination significantly.

Handling Reservations

The reservation agent validates everything before confirming:

const reservationTools = [
  {
    name: 'check_availability',
    description: 'Check if a specific date/time has open tables',
    parameters: {
      date: { type: 'string', format: 'YYYY-MM-DD' },
      time: { type: 'string', format: 'HH:MM' },
      party_size: { type: 'number' }
    }
  },
  {
    name: 'create_reservation',
    description: 'Book a confirmed reservation',
    parameters: {
      date: { type: 'string' },
      time: { type: 'string' },
      party_size: { type: 'number' },
      customer_name: { type: 'string' },
      phone: { type: 'string' },
      special_requests: { type: 'string' }
    }
  }
];

The AI naturally confirms details back to the caller: "Let me confirm — party of 4, this Friday at 7pm, under the name Johnson?"

Multi-Language Support

One unexpected win: the system automatically responds in the caller's language. OpenAI's Realtime API handles language detection natively. For restaurants in diverse cities, this is huge — no need to hire multilingual staff.

What We Learned

Things that work well:

Structured tool calling prevents most hallucination issues
The triage pattern keeps each agent focused and accurate
Real-time audio processing feels natural to callers (sub-second latency)
Automatic language detection is a massive differentiator

Things that need work:

Very noisy environments on the caller's end can cause transcription issues
Complex multi-party negotiations (event planning for 50+ people) still need human handoff
Some older callers are uncomfortable talking to an AI

Results

For a typical restaurant doing ~25 calls/day:

Missed calls dropped from ~20% to near 0%
~$1,200/month in recovered revenue from calls that would have gone to voicemail
Staff freed up from phone duty during peak hours
Setup time: ~30 minutes (connect calendar, upload menu, forward number)

Resources

If you're interested in building something similar or want to see how this works in practice:

How AI Phone Systems Reduce Missed Calls for Busy Restaurants — Deep dive into the missed call problem and how AI solves it
Virtual Receptionist vs AI Phone Agent for Restaurants — Comparison of different approaches by cost and capability

The full architecture handles edge cases I didn't cover here — call transfers, SMS confirmations, POS integration with Square and Toast, and more. Happy to answer questions in the comments.

What's your experience with voice AI in production? I'd love to hear about other real-world use cases.

DEV Community