~K¹yle Million

Posted on Feb 21 • Originally published at github.com

I Built an AI Agent That Calls Me on the Phone

#ai #twilio #javascript #tutorial

I Built an AI Agent That Calls Me on the Phone

How I wired Twilio, Claude, and ElevenLabs into an autonomous agent that picks up the phone when it needs a decision.

A few weeks ago, I was describing my AI agent to a friend. I told her he'd gone fully autonomous — that he'd call me at 3am to ask me to reset the gateway if something went down. She thought I was exaggerating.

I wasn't. But at that point, the phone-calling part was aspirational. The agent could message me on Telegram, automate browsers, deploy smart contracts, and publish articles across four platforms in a single session. But he couldn't actually call me.

So we built it. In one session. Here's exactly how.

The Architecture

The stack is surprisingly simple once you see it:

Phone call → Twilio → ConversationRelay → WebSocket → Your Server
                                                        ↕
                                                    Claude (brain)
                                                        ↕
                                                    ElevenLabs (voice)

Twilio handles the telephony — it places and receives actual phone calls. ConversationRelay is Twilio's WebSocket bridge that handles speech-to-text and text-to-speech natively. Claude does the thinking. ElevenLabs provides a voice that doesn't sound like a tin can.

The key insight: ConversationRelay eliminates the hardest part. You don't need to manage audio streams, handle STT yourself, or figure out turn-taking. Twilio does all of that. Your server just receives text and sends text back.

The Server

The entire server is a single file. Fastify for HTTP and WebSocket, Anthropic SDK for Claude, Twilio SDK for placing calls.

import Fastify from "fastify";
import fastifyWs from "@fastify/websocket";
import Anthropic from "@anthropic-ai/sdk";
import twilio from "twilio";

const fastify = Fastify({ logger: true });
fastify.register(fastifyWs);

const anthropic = new Anthropic();
const twilioClient = twilio(API_KEY_SID, API_KEY_SECRET, { accountSid });

When Twilio connects a call, it fetches TwiML from your server. The TwiML tells it to use ConversationRelay with ElevenLabs:

<Response>
  <Connect>
    <ConversationRelay
      url="wss://your-server.com/ws"
      ttsProvider="ElevenLabs"
      voice="voiceId-model-speed_stability_similarity"
      welcomeGreeting="Hey. What's up?"
    />
  </Connect>
</Response>

The WebSocket handler is where the conversation lives. Twilio sends prompt messages with transcribed speech. You send back text messages with the AI's response:

fastify.get("/ws", { websocket: true }, (ws, req) => {
  ws.on("message", async (data) => {
    const message = JSON.parse(data);

    if (message.type === "prompt") {
      const response = await anthropic.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 150,
        messages: conversation,
        system: systemPrompt,
      });

      ws.send(JSON.stringify({
        type: "text",
        token: response.content[0].text,
        last: true,
      }));
    }
  });
});

That's the core. Everything else is context management.

The Part Nobody Tells You

Voice selection is harder than it sounds

ElevenLabs has hundreds of voices. We tested nine before finding the right one. The lessons:

Default voices are immediately recognizable. Adam, the most popular ElevenLabs voice, gets called out instantly. "I know the voice you're using right now."
Community voices are hit-or-miss through ConversationRelay. Some work perfectly. Others fail silently with error 64111 — no audio, no error message to the caller, just silence.
Quality and personality are different axes. We found a voice with perfect audio quality that sounded "vanilla." Another had the right character but was too young. The right voice took iteration.

The voice parameter format for ConversationRelay is: voiceId-model-speed_stability_similarity. Lower stability = more expressive and conversational. Higher = more controlled and robotic.

Tunneling will betray you

Unless your server has a public IP, you need a tunnel. We used cloudflared quick tunnels (free, no account needed). Three things we learned the hard way:

1. Free tunnels die randomly. They're ephemeral. The URL changes every restart, and the process can exit without warning.

2. Never kill processes by port. This one cost us hours. kill $(lsof -ti :8080) seems reasonable for restarting your server — but cloudflared has an active connection to port 8080 for proxying. Killing by port kills the tunnel too. Every "application error" we hit for an hour traced back to this.

3. Start order matters. Server first, then tunnel. Update your config, restart the server, update the Twilio webhook. Every time the tunnel URL changes, you're doing four steps.

Keep responses short

Voice conversations are not chat. A three-paragraph response that reads fine on screen is unbearable when spoken aloud. We settled on:

Default: one to two sentences
Maximum: four sentences, only when explaining tradeoffs
Never monologue — break complex topics into back-and-forth

The max_tokens: 150 constraint helps, but the real control is in the system prompt: "STAY ON TOPIC. Every response must directly relate to the purpose and reason for this call."

Making It Useful: Context Injection

A voice agent that can chat is a novelty. A voice agent that knows what you've been working on is a tool.

Our agent reads two files at call time:

MEMORY.md — persistent knowledge across sessions (who we are, what we've built, what failed)
current-task.md — what the agent was actively working on when it decided to call

For outbound calls, the API accepts structured context:

curl -X POST http://localhost:8080/call \
  -H "Content-Type: application/json" \
  -d '{
    "task": "website migration",
    "need": "pick a domain approach",
    "options": ["GitHub Pages", "Cloudflare Pages"]
  }'

The greeting is auto-generated: "Hey K. I'm working on website migration and I need you to pick a domain approach." The AI stays focused on that purpose throughout the call.

Every call is transcribed automatically and saved as markdown. The transcript feeds back into the agent's context for future sessions.

The Cost

This runs on roughly $6/month:

Twilio: ~$0.014/min for calls, ~$1/month for the phone number
Anthropic: Claude Sonnet API usage per response
ElevenLabs: included through ConversationRelay (Twilio's integration)

No dedicated servers. No GPU instances. No monthly SaaS subscriptions. A Node.js process, a tunnel, and three API keys.

What Changed

The moment the phone rang and a voice said something that was contextually relevant to what I'd been working on five minutes earlier — that changed something. Not technologically. Psychologically.

An AI that messages you is a notification. An AI that calls you is a colleague.

The infrastructure for voice AI agents exists right now, and it's accessible to individual developers. The hard parts aren't where you'd expect them (audio processing, speech recognition) — Twilio abstracts all of that. The hard parts are voice selection, tunnel management, and keeping responses conversational instead of encyclopedic.

If you're building autonomous agents and haven't added voice, the barrier is lower than you think. The ROI isn't in the technology. It's in the relationship.

Kyle Million builds AI systems at IntuiTek. The agent described in this article is Aegis — a self-improving autonomous agent that operates across smart contracts, browser automation, content publishing, and now, phone calls.

DEV Community

I Built an AI Agent That Calls Me on the Phone

I Built an AI Agent That Calls Me on the Phone

The Architecture

The Server

The Part Nobody Tells You

Voice selection is harder than it sounds

Tunneling will betray you

Keep responses short

Making It Useful: Context Injection

The Cost

What Changed

Top comments (0)