DEV Community

David
David

Posted on

How I built a voice AI agent that answers phone calls and remembers callers

I got tired of seeing small businesses miss calls. So I built Vokio — a voice AI agent that answers real phone calls, remembers callers between sessions, and generates a post-call summary automatically.

Here's the full architecture and how I solved the hard parts.

Stack

  • Python + Flask (webhook server)
  • Vapi (telephony + STT)
  • Claude Haiku (conversation)
  • Deepgram Nova 3 (Spanish STT)
  • Azure TTS (Spanish voices)
  • SQLite (memory between calls)

How it works

Vapi handles the phone call and speech recognition. Instead of using a built-in LLM, I configured Vapi to use a Custom LLM — pointing it to my Flask server. Every time the caller says something, Vapi sends a POST request to my endpoint and expects an OpenAI-compatible streaming response.

@app.route("/chat/completions", methods=["POST"])
def chat():
    data = request.json or {}
    phone = data["call"]["customer"]["number"]
    last_message = [m for m in data["messages"] if m["role"] == "user"][-1]["content"]
    response = generate_response(phone, last_message)
    # return SSE streaming response
Enter fullscreen mode Exit fullscreen mode

The memory system

This is the interesting part. Every caller is stored in SQLite by phone number. When a call comes in, I query the DB and inject the caller's history into the system prompt.

def generate_response(phone: str, user_message: str) -> str:
    caller = get_caller(phone)
    history = get_history(phone)

    if caller and caller.get("name"):
        system += f"\n\nYou already know this caller, their name is {caller['name']}."

    if caller and caller.get("sentiment") == "frustrated":
        system += "\n\nNOTE: This caller was frustrated last time. Be especially patient."
Enter fullscreen mode Exit fullscreen mode

If the caller was frustrated last time, Claude knows and adjusts its tone automatically.

Post-call analysis

When the call ends, Vapi sends a webhook to /end-call. I pass the transcript to Claude and get back:

  • 2-line summary
  • Urgency score (1-5)
  • Required action ("call back today", "send quote", "none")
  • Customer sentiment (satisfied / neutral / frustrated)

All stored in SQLite. The business knows at a glance which calls need attention today.

The tricky parts

Streaming is required. Vapi expects SSE streaming responses. If you return regular JSON the call hangs up immediately. Took me a while to figure that out.

Tool use + verbal response. When Claude uses a tool (like saving the caller's name) without producing text, you get an empty response. I added a follow-up API call to get the verbal response without Claude announcing what it just saved.

dotenv loading. If environment variables already exist in the system, load_dotenv() won't override them. Use override=True.

Result

A 5-minute call costs approximately €0.28. The agent answers in Spanish, detects if the caller switches to English or Catalan, and responds in the same language automatically.

I packaged it as a ready-to-deploy template with 6 business sector prompts included (dental clinic, restaurant, hair salon, real estate, mechanic, hostel).

https://kitbot.gumroad.com/l/vokio

Top comments (0)