Most tutorials about AI voice calling assume you already know telephony. They throw terms like SIP trunking, RTP streams, codec negotiation, and DTMF detection at you before you've even written a line of code.
This isn't that tutorial.
In 10 minutes, you'll have a real AI voice call running — a bot that picks up, talks back with synthesized speech, and hangs up cleanly. No telephony background required.
What You're Building
A simple flow:
- You trigger an outbound call via a REST API
- VoIPBin connects the call and hits your webhook
- Your webhook returns instructions: "speak this text, then wait for input"
- VoIPBin handles all the audio — STT, TTS, RTP — and sends transcriptions back to you
- You respond with the next action
Your code never touches audio. It just speaks HTTP.
Prerequisites
- Any language that can run an HTTP server (examples below use Python)
- A publicly reachable URL for your webhook (use ngrok if you're local)
- 10 minutes
Step 1: Sign Up and Get Your API Key
curl -s -X POST "https://api.voipbin.net/v1.0/auth/signup" \
-H "Content-Type: application/json" \
-d '{
"name": "your-name",
"email": "you@example.com",
"password": "yourpassword"
}'
Response:
{
"accesskey": {
"token": "YOUR_API_TOKEN_HERE"
}
}
No email confirmation. No OTP. Just a token. Keep it.
Step 2: Write Your Webhook Handler
This is the core of your AI bot. VoIPBin calls this URL when the call connects, and whenever speech is transcribed.
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route("/call-webhook", methods=["POST"])
def handle_call():
data = request.json
print("Incoming event:", data)
call_type = data.get("type")
if call_type == "call.answered":
# Call just connected — greet the user
return jsonify({
"actions": [
{
"type": "talk",
"text": "Hello! This is your AI assistant. How can I help you today?",
"language": "en-US",
"gender": "female"
},
{
"type": "listen",
"timeout": 5000
}
]
})
elif call_type == "call.speech":
transcript = data.get("transcript", "")
print(f"User said: {transcript}")
# Simple echo response — replace with your AI logic
return jsonify({
"actions": [
{
"type": "talk",
"text": f"You said: {transcript}. Thanks for calling. Goodbye!",
"language": "en-US",
"gender": "female"
},
{
"type": "hangup"
}
]
})
# Default: hang up
return jsonify({"actions": [{"type": "hangup"}]})
if __name__ == "__main__":
app.run(port=5000)
Run it:
pip install flask
python app.py
Expose it with ngrok:
ngrok http 5000
# Copy the https URL, e.g. https://abc123.ngrok.io
Step 3: Make the Call
Now trigger an outbound call. Replace the values with your token, webhook URL, and a real destination number:
curl -s -X POST "https://api.voipbin.net/v1.0/calls" \
-H "Authorization: Bearer YOUR_API_TOKEN_HERE" \
-H "Content-Type: application/json" \
-d '{
"flow_id": null,
"source": {
"type": "sip",
"target": "sip:bot@voipbin.net"
},
"destination": {
"type": "tel",
"target": "+15551234567"
},
"webhook_url": "https://abc123.ngrok.io/call-webhook"
}'
VoIPBin dials the number. When it answers, your webhook gets called. The bot speaks, listens, echoes back what was said, and hangs up.
That's it.
What Just Happened (Under the Hood)
Here's why this was so simple:
You wrote zero audio code. VoIPBin handled:
- Dialing via PSTN
- Encoding and streaming RTP audio
- Speech-to-text (STT) on the caller's voice
- Text-to-speech (TTS) for your bot's responses
- Session state and timing
Your server only processed JSON and returned JSON. This is VoIPBin's Media Offloading model — your AI logic runs as a stateless HTTP service, and the telephony infrastructure handles everything media-related.
This matters a lot when you scale. Your webhook can be a serverless function, a container, or any backend. No persistent connections. No WebSocket management. No codec knowledge needed.
Swap in Real AI Logic
The echo response is a placeholder. Replacing it with GPT-4o, Claude, or any LLM is straightforward:
import openai
client = openai.OpenAI()
def ai_response(transcript: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful phone assistant. Keep responses short and clear."},
{"role": "user", "content": transcript}
]
)
return response.choices[0].message.content
# In your webhook:
elif call_type == "call.speech":
transcript = data.get("transcript", "")
reply = ai_response(transcript)
return jsonify({
"actions": [
{"type": "talk", "text": reply, "language": "en-US", "gender": "female"},
{"type": "listen", "timeout": 5000}
]
})
Now you have a real AI phone assistant — no telephony expertise required, no audio handling, no infrastructure to manage.
What's Next
From here, you can extend this pattern in many directions:
- Multi-turn conversations — maintain chat history between turns using a session store
- Inbound calls — assign a phone number (or use a Direct Hash SIP URI for no-number setups) and handle incoming calls the same way
- Structured data extraction — after each call, log the transcript and run a summarization pass
- Outbound campaigns — loop over a list of numbers and trigger calls programmatically
The core loop stays the same: webhook in, JSON out.
Resources
- VoIPBin API docs
- MCP Server (for Claude Desktop / Cursor):
uvx voipbin-mcp - Golang SDK:
go get github.com/voipbin/voipbin-go - Sign up:
POST https://api.voipbin.net/v1.0/auth/signup
If you build something with this, drop a comment — always curious what people connect AI phone calls to first.
Top comments (0)