voipbin

Posted on May 5

Build an AI IVR in an Afternoon — No Telephony Expertise Needed

#ai #voip #tutorial #webdev

Phone trees. We all hate them.

"Press 1 for sales. Press 2 for support. Press 3 for billing. Press 9 to repeat this menu."

Even the companies running them hate them. They are expensive to build, painful to maintain, and callers abandon them in frustration. But the alternative — routing every call to a human — does not scale.

For years, the only way to build a smarter IVR was to hire a telephony specialist, negotiate a contract with an enterprise platform like Avaya or Genesys, and spend months in integration hell.

That is no longer true.

What Changed: AI Understands Intent

The old IVR model was brittle because it depended on exact inputs: press a number, say a keyword, follow a script. One mismatch and the caller was lost.

Modern LLMs understand intent. A caller can say "I got double-charged last month" and an AI agent knows that means billing. It does not need a menu. It does not need a keyword. It listens, understands, and routes — or resolves the issue entirely.

The remaining challenge is the plumbing: how do you connect an LLM to an actual phone call?

The Traditional IVR Stack (Why It Is Hard)

If you tried to build this from scratch, here is what you would need:

SIP trunk — a carrier connection to receive phone calls
Media server — to handle the RTP audio stream
ASR/STT pipeline — to convert speech to text in real time
NLU layer — to extract intent from the transcript
TTS engine — to synthesize the AI response back to speech
Call routing logic — to transfer, queue, or hang up
Monitoring and failover — because calls cannot go down at 2AM

Each of these is a non-trivial system. Telephony engineers spend careers specializing in just the media layer. Most backend developers have never touched SIP or RTP.

The result: AI IVR projects get stuck in the plumbing phase and never ship.

A Different Model: Offload the Telephony

VoIPBin handles the entire telephony stack — SIP, RTP, STT, TTS, DTMF — and exposes it through a simple REST API.

Your AI agent only ever sees text. It receives a caller's transcribed message via webhook, responds with text, and VoIPBin handles everything else: synthesizing the voice, playing it on the call, listening for the next utterance, and looping back.

Here is how the architecture looks:

Caller ──► VoIPBin (SIP/RTP/STT) ──► Webhook (your server)
                                          │
                                    LLM (GPT, Claude...)
                                          │
                              Text response back to VoIPBin
                                          │
                              VoIPBin (TTS) ──► Caller hears voice

Your server is just an HTTP endpoint. No telephony knowledge required.

Building the AI IVR: Step by Step

Step 1: Get Your API Key

curl -X POST https://api.voipbin.net/v1.0/auth/signup \
  -H "Content-Type: application/json" \
  -d '{
    "name": "your-name",
    "email": "you@example.com",
    "password": "yourpassword"
  }'

The response includes accesskey.token — that is your API key for all subsequent requests.

Step 2: Get a Phone Number

curl -X POST https://api.voipbin.net/v1.0/numbers \
  -H "Authorization: Bearer <your-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "country_code": "US",
    "number_type": "local"
  }'

Save the number — callers will dial this to reach your AI IVR.

Step 3: Build the Webhook Handler

When a call comes in, VoIPBin sends your server a webhook with the caller's speech transcribed to text. Your job is to:

Run the text through your LLM
Return a response with the next action

Here is a minimal Python example using FastAPI and OpenAI:

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

SYSTEM_PROMPT = """
You are an AI receptionist for Acme Corp.

When a caller describes their issue, classify it into one of:
- BILLING: payment issues, invoices, refunds
- SUPPORT: technical problems, bugs, how-to questions
- SALES: pricing, demos, new accounts
- OTHER: anything else

If you can resolve the issue directly with a short answer, do so.
Otherwise, tell the caller you are connecting them to the right team
and include the routing tag at the end of your message like [ROUTE:BILLING].
"""

@app.post("/ivr/webhook")
async def handle_call(request: Request):
    body = await request.json()
    caller_speech = body.get("speech_result", "")
    call_id = body.get("call_id", "")

    if not caller_speech:
        # First turn: greet the caller
        return JSONResponse({
            "action": "talk",
            "text": "Hello, thank you for calling Acme Corp. How can I help you today?",
            "webhook_url": "https://yourserver.com/ivr/webhook"
        })

    # Process with LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": caller_speech}
        ]
    )

    ai_text = response.choices[0].message.content

    # Check if routing is needed
    if "[ROUTE:BILLING]" in ai_text:
        clean_text = ai_text.replace("[ROUTE:BILLING]", "").strip()
        return JSONResponse({
            "action": "transfer",
            "text": clean_text,
            "destination": "+18005550100"  # billing team
        })
    elif "[ROUTE:SUPPORT]" in ai_text:
        clean_text = ai_text.replace("[ROUTE:SUPPORT]", "").strip()
        return JSONResponse({
            "action": "transfer",
            "text": clean_text,
            "destination": "+18005550200"  # support team
        })
    elif "[ROUTE:SALES]" in ai_text:
        clean_text = ai_text.replace("[ROUTE:SALES]", "").strip()
        return JSONResponse({
            "action": "transfer",
            "text": clean_text,
            "destination": "+18005550300"  # sales team
        })
    else:
        # AI resolved it — continue the conversation
        return JSONResponse({
            "action": "talk",
            "text": ai_text + " Is there anything else I can help you with?",
            "webhook_url": "https://yourserver.com/ivr/webhook"
        })

Step 4: Link the Number to Your Webhook

curl -X POST https://api.voipbin.net/v1.0/flows \
  -H "Authorization: Bearer <your-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "AI IVR Flow",
    "actions": [
      {
        "type": "webhook",
        "url": "https://yourserver.com/ivr/webhook",
        "method": "POST"
      }
    ]
  }'

Assign this flow to your phone number, and every incoming call is routed through your AI agent.

The Conversation Loop

What makes this more powerful than a traditional IVR is that the conversation is stateful. The caller does not have to start over. They can say:

"I was double-charged in April"

And if the AI asks a follow-up:

"Which card ending in which digits was charged?"

The caller answers, the AI gathers the information, and when it transfers to billing, it can pass along a structured summary of what was discussed.

You can maintain conversation history keyed to call_id:

# Simple in-memory store (use Redis in production)
call_history = {}

@app.post("/ivr/webhook")
async def handle_call(request: Request):
    body = await request.json()
    call_id = body.get("call_id")
    caller_speech = body.get("speech_result", "")

    # Retrieve or initialize conversation history
    history = call_history.get(call_id, [])

    if caller_speech:
        history.append({"role": "user", "content": caller_speech})

    messages = [{"role": "system", "content": SYSTEM_PROMPT}] + history

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )

    ai_text = response.choices[0].message.content
    history.append({"role": "assistant", "content": ai_text})
    call_history[call_id] = history

    # ... routing logic as before

What You Did Not Have to Do

Look at what is missing from this code:

No SIP stack
No RTP handling
No audio codec configuration
No jitter buffer tuning
No STT streaming pipeline
No TTS voice synthesis
No telephony carrier negotiation

You wrote an HTTP server that talks to an LLM. VoIPBin handled everything else.

Going Further

Once the basic IVR is working, a few extensions that are straightforward to add:

Post-call summary: When the call ends, VoIPBin sends a final webhook. Log the full transcript, run it through your LLM to generate a structured summary, and store it in your CRM.

Business hours logic: Check the current time in your webhook handler. Outside business hours, tell the caller and offer a callback option instead.

Voicemail fallback: If no agents are available, switch the action from transfer to record to capture a voicemail.

Sentiment escalation: If the caller seems frustrated (detectable via LLM prompt), skip the routing step and connect directly to a senior agent.

All of these are just logic in your webhook handler — no telephony config changes needed.

The Bottom Line

Traditional IVR systems are expensive and inflexible because they were built in an era when telephony was the hard part. Today, the hard part is building good AI — and the telephony layer can be abstracted away entirely.

Your IVR is now just a Python function. Update the system prompt and the routing changes. Add a new route and it takes a few lines. The entire system can be deployed and iterated on like any other web service.

If you want to try it:

Sign up: https://voipbin.net
API docs: https://api.voipbin.net/v1.0
MCP server: uvx voipbin-mcp — works in Claude Code and Cursor
Go SDK: go get github.com/voipbin/voipbin-go

The phone tree that took six months to build can now be replaced in an afternoon. And when your requirements change next week, you update a prompt — not a call flow diagram.

DEV Community