Nrk Raju Guthikonda

Posted on Apr 25

I Built an AI That Makes Real Phone Calls — Here's the Architecture

#ai #python #voice #architecture

A few weeks ago, I asked my AI agent to call a local Indian restaurant and order chicken biryani for pickup. I gave it my name, my pickup time preference, and one rule: don't make anything up.

It dialed the number. It introduced itself as my assistant. It asked for one chicken biryani to-go, confirmed the $15 price, accepted a 20-minute pickup window, said thank you, and hung up. I drove over and picked up dinner.

No human on my end. The transcript is saved as a .txt file. The audio is saved as an .mp3. The whole thing cost about 18 cents.

This isn't a demo video edited to look smooth. It's the actual behavior of CallPilot, an open-source FastAPI server I built to make outbound phone calls on my behalf. In this post I'm going to walk through exactly how it works — the WebSocket bridge, the realtime audio plumbing, the RAG layer, the multi-provider abstraction, and the small handful of details that turn out to matter much more than they should.

The Problem CallPilot Solves

Phone calls are still the worst piece of personal infrastructure most people have. Booking a dentist, calling about an order, scheduling a follow-up, canceling a subscription — these are all turn-based, predictable, structured conversations that don't require a human's full attention. They just require someone's full attention for ten minutes, and that someone is usually you, in the middle of work.

I wanted an AI that could do these calls for me, with three properties:

Real audio, real PSTN — not a chatbot, not a meeting summarizer. An actual outbound call to an actual phone number.
Knows my context — has read my insurance card, knows my address, knows my pickup preferences. Not a fresh agent every time.
Doesn't hallucinate — if it doesn't know something, it says "I'll check with my client and get back to you," not invents a fake account number.

Off-the-shelf options either solved one of these (LLM chatbots) or all three at the cost of a SaaS contract (Bland.ai, Vapi). I wanted to own the stack and run it from my laptop, so I built it.

The High-Level Architecture

There are five moving pieces:

┌──────────┐   POST /call   ┌──────────┐   REST    ┌─────────┐   PSTN   ┌────────┐
│ Web UI   │───────────────▶│ FastAPI  │──────────▶│ Twilio  │─────────▶│ Callee │
└────┬─────┘                └────┬─────┘           └────┬────┘          └────┬───┘
     │                           │                      │ Media Stream       │
     │   live transcript         │ ◀────────────────────┘ (WebSocket)        │
     │ ◀─────────────────────────│                                            │
     │                           │                                            │
     │                  ┌────────▼─────────┐    bidi WS    ┌─────────────────┐│
     │                  │  Media Bridge    │──────────────▶│ AI Voice        ││
     │                  │  (WebSocket)     │◀──────────────│ Provider        ││
     │                  └────────┬─────────┘  (audio +     │ (OpenAI / Gemini)│
     │                           │            transcripts) └─────────────────┘│
     │                  ┌────────▼─────────┐                                  │
     │                  │ Context Builder  │ ◀──── ChromaDB ◀── PDFs/TXT      │
     │                  └──────────────────┘                                  │

In words:

The browser POSTs {to_number, instructions} to FastAPI.
FastAPI runs RAG against the user's documents, builds a system prompt, then asks Twilio to place an outbound call.
Twilio dials the number. When the call connects, Twilio opens a WebSocket back to my server and streams the caller's audio to it as base64 g711_ulaw 8 kHz.
My server runs a bidirectional bridge: forward Twilio's audio to the AI provider, forward the AI's audio back to Twilio.
Transcripts and audio are saved to disk when the call ends.

Almost everything interesting lives in step 4.

The WebSocket Bridge

This is the heart of the system. It's two coroutines tied together by a streamSid. The snippets below are abridged from the real source — imports, helpers, and record lookups are omitted for readability; see the repo for the runnable version:

async def handle_media_stream(websocket: WebSocket, call_id: str):
    await websocket.accept()
    stream_sid = await _wait_for_twilio_start(websocket, call_id)

    rag_context = retrieve_context(record.instructions, client_id=record.client_id)
    system_prompt = build_system_prompt(record.instructions, rag_context, record.client_id)

    provider = get_provider(call_id, system_prompt)   # OpenAI or Gemini
    await provider.connect()
    await provider.configure_session()

    async def twilio_to_ai():
        async for raw in websocket.iter_text():
            data = json.loads(raw)
            if data.get("event") == "media":
                await provider.send_audio(data["media"]["payload"])
            elif data.get("event") == "stop":
                break

    async def ai_to_twilio():
        async for event in provider.events():
            if event["type"] == "audio":
                await websocket.send_json({
                    "event": "media",
                    "streamSid": stream_sid,
                    "media": {"payload": event["data"]},
                })
            elif event["type"] == "speech_started" and event.get("ai_speaking"):
                # Caller cut in — kill the in-flight response
                await provider.cancel_response()
                await websocket.send_json({"event": "clear", "streamSid": stream_sid})
            elif event["type"] in ("transcript_ai", "transcript_caller"):
                record.transcript.append({"role": ..., "text": event["text"]})

    await asyncio.wait(
        [asyncio.create_task(twilio_to_ai()), asyncio.create_task(ai_to_twilio())],
        return_when=asyncio.FIRST_COMPLETED,
    )

That's the whole loop. The hard problems are hidden inside provider.

The Provider Abstraction

I started with OpenAI's Realtime API. It works great. Then Gemini Live launched and was about 30% the cost for the same conversational quality, so I wanted to be able to switch. Rather than fork the codebase, I extracted a tiny provider interface — every provider yields the same six event types:

{"type": "audio",              "data": "<b64 g711_ulaw>"}
{"type": "transcript_ai",      "text": "..."}
{"type": "transcript_caller",  "text": "..."}
{"type": "speech_started",     "ai_speaking": bool}
{"type": "response_done"}
{"type": "error",              "message": "..."}

The bridge code above doesn't know or care which provider is connected. Switching is one env var:

AI_PROVIDER=openai   # or: gemini

This is the pattern I'd recommend for any voice-AI project. The "hard part" of voice isn't the LLM — it's the audio pipeline and the interruption logic. Decouple them and you can swap models in 30 seconds.

OpenAI provider

Almost trivial. OpenAI's Realtime API speaks g711_ulaw natively, which is exactly what Twilio sends, so the audio passes through untouched:

# Twilio sends → forward as-is
await self._ws.send(json.dumps({
    "type": "input_audio_buffer.append",
    "audio": twilio_b64_payload,
}))

# OpenAI sends → forward as-is
async for raw in self._ws:
    msg = json.loads(raw)
    if msg["type"] == "response.audio.delta":
        yield {"type": "audio", "data": msg["delta"]}

Gemini provider — the audio conversion tax

Gemini Live wants PCM16 16 kHz in and emits PCM16 24 kHz out. Twilio is g711_ulaw 8 kHz in both directions. So every audio frame has to be transcoded twice — once on the way in, once on the way out:

def _mulaw_to_pcm16_16k(mulaw_b64: str) -> str:
    raw = base64.b64decode(mulaw_b64)
    pcm_8k = audioop.ulaw2lin(raw, 2)
    pcm_16k, _ = audioop.ratecv(pcm_8k, 2, 1, 8000, 16000, None)
    return base64.b64encode(pcm_16k).decode()

def _pcm16_24k_to_mulaw(pcm_b64: str) -> str:
    raw = base64.b64decode(pcm_b64)
    pcm_8k, _ = audioop.ratecv(raw, 2, 1, 24000, 8000, None)
    mulaw = audioop.lin2ulaw(pcm_8k, 2)
    return base64.b64encode(mulaw).decode()

A footgun: audioop was removed from stdlib in Python 3.13. The fix is pip install audioop-lts and a try/except import. Took me embarrassingly long to track that down.

RAG Mid-Call

The "doesn't hallucinate" requirement is what made this project actually useful instead of just a toy. The pattern is straightforward:

At startup, every file in clients/<client_id>/ (PDF, TXT, DOCX, MD) gets parsed, chunked at ~500 chars, embedded with OpenAI embeddings, and stored in a per-client ChromaDB collection.

At call time, the user's instruction string is embedded and used to retrieve the top 5 chunks. Those chunks are injected into the system prompt under a clearly-labeled section:

REFERENCE INFORMATION FROM RAJU'S DOCUMENTS:
<top 5 chunks>

Use the above information to answer any questions during the call.
If the information isn't in your documents, say you'll need to check
with Raju and get back to them.

The "if not in your documents" line matters more than you'd think. Without it the model will cheerfully invent an insurance policy number when asked. With it, the model defers — which is exactly the behavior I want from an agent acting on my behalf.

For longer calls I'm experimenting with mid-call re-querying: when the model emits a function_call for lookup_context, the bridge runs a fresh RAG query and injects the result. That's still gated behind a feature flag while I tune it.

Interruption Handling

The single most "uncanny valley" failure mode in voice AI is the agent talking over you. If you say "hold on a sec" and the AI keeps droning, every other thing it does is going to feel broken.

Both providers expose a speech_started event when their VAD detects the caller has begun speaking. The bridge listens for this and, if the AI is currently mid-response:

Tells the provider to cancel the in-flight response (so it stops generating tokens).
Sends a clear event to Twilio (so any audio already in Twilio's playback buffer is flushed immediately).

Both steps are required. The provider cancellation alone leaves ~500ms of buffered audio playing at the callee, which feels worse than not handling interruption at all because now the AI sounds like a politician.

elif etype == "speech_started" and event.get("ai_speaking"):
    await provider.cancel_response()
    await websocket.send_json({"event": "clear", "streamSid": stream_sid})

Voicemail Detection (AMD)

About 30% of my test calls hit voicemail. The first version of CallPilot would happily leave a 90-second monologue of itself trying to confirm an order with an answering machine. Funny once, useless after that.

Twilio has Answering Machine Detection (AMD) built in. You enable it on the call create and Twilio fires a webhook with AnsweredBy=human or AnsweredBy=machine_*. If it's a machine, I just hang up:

twilio_client.calls.create(
    to=to_number,
    from_=settings.twilio_from_number,
    twiml=twiml_url,
    machine_detection="DetectMessageEnd",
    async_amd=True,
    async_amd_status_callback=f"{settings.public_url}/amd/{call_id}",
)

Free, fast, and it's the difference between an MVP and something I trust to call my dentist.

Recordings

Every call now writes an .mp3 to recordings/. Twilio will record the whole call for you and POST a status callback with the recording URL once it's ready; you fetch it and save it. The handful of lines:

twilio_client.calls.create(
    ...,
    record=True,
    recording_status_callback=f"{settings.public_url}/recording/{call_id}",
)

@app.post("/recording/{call_id}")
async def recording_ready(call_id: str, RecordingUrl: str = Form(...)):
    audio = httpx.get(f"{RecordingUrl}.mp3", auth=(sid, token)).content
    Path(f"recordings/{call_id}.mp3").write_bytes(audio)

The biryani recording lives on my laptop. I'm not posting it here because the restaurant didn't sign up to be in a blog post, but as a demo of "AI agents in 2026 actually work," it's the most convincing thing I've ever shipped.

Cost Per Call

For a ~2-minute call:

Component	OpenAI	Gemini Live
Twilio voice	~$0.03	~$0.03
Realtime audio LLM	~$0.10–0.20	~$0.04–0.08
Embeddings (one-time at index time)	~$0.001	~$0.001
Total	$0.13–0.23	$0.07–0.11

Below the cost of a stamp. For a freelancer or a small clinic batching no-show reminders, this is rounding error.

What's Next

Things I'm working on or plan to ship:

Mid-call RAG re-querying via function calls — already prototyped, needs more tuning before I trust it.
Speaker diarization on multi-party calls — useful when a receptionist transfers you to scheduling.
Local-only mode — swap the realtime API for a local Whisper + LLM + Piper TTS pipeline. Latency is the tough constraint; sub-300ms first token is the bar.
Subscription cancellation runbook — the canonical demo. Cancel a Sirius XM subscription, decline retention offers, get a confirmation number, save the recording.

Why This Matters

The realtime audio APIs from OpenAI and Google quietly crossed a usability threshold a few months ago. Latency under 500ms, natural turn-taking, decent prosody. The "AI that calls people for you" idea has been a sci-fi staple for thirty years; the surprising part of 2026 is that the core capability is now a 600-line FastAPI server and ~$0.10 per call.

If you're building anything in this space — agentic workflows, accessibility tools, automation for solo professionals — the architecture above is a good starting point. The full source is here:

Repo: github.com/kennedyraju55/callpilot

It's MIT-licensed. PRs welcome. If you build something with it, I'd love to see it.

I'm a Senior Software Engineer at Microsoft. CallPilot is a personal open-source project, not a Microsoft product, and nothing in this post reflects internal Microsoft work. Find me on GitHub and LinkedIn.

DEV Community