Many teams build an AI chatbot first, then realize they also need voice support. So they build a second system — separate logic, separate prompts, separate maintenance burden.
It does not have to be that way.
This post walks through an architecture where the same AI business logic handles both incoming voice calls and web chat sessions. You write the intelligence once. The transport layer handles the rest.
The Problem: Two UIs, Two Codebases
When users contact your AI assistant, they might:
- Open a chat widget on your website
- Call a phone number
Both are valid. Both need responses. But they feel like completely different engineering problems:
Web chat: HTTP requests, JSON payloads, WebSocket for streaming
Voice call: SIP signaling, RTP audio, STT/TTS, codec negotiation
The typical response is to build them separately. Two prompt engineering efforts. Two conversation state machines. Two sets of edge cases to handle.
This is expensive — and it creates drift. The phone bot knows things the chat bot does not, and vice versa. Users get inconsistent answers depending on which channel they used.
The Solution: Abstract the Transport
The key insight is that your AI does not care how input arrives. It processes text and returns text. Everything else — audio encoding, channel management, delivery — is a transport-layer concern.
Voice Call ──→ VoIPBin (STT) ──→ text ──→ Your AI Core ──→ text ──→ VoIPBin (TTS) ──→ voice
Web Chat ──→ Your API ──→ text ──→ Your AI Core ──→ text ──→ Your API ──→ text
Your AI Core only sees text in, text out. The channel-specific plumbing lives outside it.
VoIPBin handles the entire telephony stack — SIP signaling, RTP streams, STT, TTS, codec transcoding. Your server never touches raw audio. It just receives transcribed text and replies with text.
Building the Unified AI Core
Here is a complete FastAPI service that handles both channels:
from fastapi import FastAPI, Request
from openai import OpenAI
app = FastAPI()
client = OpenAI()
# ── Shared in-memory history (use Redis in production) ────────
_history: dict[str, list] = {}
def get_history(session_id: str) -> list:
return _history.get(session_id, [])
def save_history(session_id: str, user_msg: str, ai_msg: str):
h = _history.get(session_id, [])
h.append({"role": "user", "content": user_msg})
h.append({"role": "assistant", "content": ai_msg})
_history[session_id] = h[-20:] # keep last 10 turns
# ── Core AI logic — channel-agnostic ─────────────────────────
async def get_ai_response(
user_message: str,
history: list,
channel: str = "text"
) -> str:
system = "You are a helpful customer support assistant."
if channel == "voice":
system += " Keep answers under two sentences. The user is listening, not reading."
else:
system += " You can be detailed. Use markdown when helpful."
messages = [
{"role": "system", "content": system},
*history,
{"role": "user", "content": user_message}
]
resp = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=300
)
return resp.choices[0].message.content
# ── Voice channel (VoIPBin webhook) ──────────────────────────
@app.post("/webhook/voice")
async def voice_webhook(request: Request):
data = await request.json()
call_id = data.get("call_id", "")
transcript = data.get("transcript", "")
if not transcript:
ai_reply = "Hello! How can I help you today?"
else:
history = get_history(call_id)
ai_reply = await get_ai_response(transcript, history, channel="voice")
save_history(call_id, transcript, ai_reply)
# VoIPBin reads these actions and handles TTS + next listen turn
return {
"actions": [
{"type": "talk", "text": ai_reply},
{"type": "listen", "webhook_url": "https://yourdomain.com/webhook/voice"}
]
}
# ── Web chat channel ──────────────────────────────────────────
@app.post("/chat")
async def chat_endpoint(request: Request):
data = await request.json()
session_id = data.get("session_id", "")
message = data.get("message", "")
history = get_history(session_id)
ai_reply = await get_ai_response(message, history, channel="text")
save_history(session_id, message, ai_reply)
return {"response": ai_reply}
Both channels call the same get_ai_response() function. Same model. Same business logic. The only difference is a channel flag that adjusts response length.
Setting Up the Voice Channel with VoIPBin
Get your API key — signup returns a token immediately, no OTP:
curl -s -X POST "https://api.voipbin.net/v1.0/auth/signup" \
-H "Content-Type: application/json" \
-d '{
"username": "yourname",
"password": "yourpassword",
"email": "you@example.com"
}'
# → { "token": "<your-access-token>" }
Then create a flow that points inbound calls at your webhook:
import httpx
async def create_voice_flow(token: str, webhook_url: str) -> dict:
async with httpx.AsyncClient() as c:
resp = await c.post(
"https://api.voipbin.net/v1.0/flows",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
},
json={
"name": "AI Support Flow",
"actions": [
{
"type": "webhook",
"url": webhook_url,
"method": "POST"
}
]
}
)
return resp.json()
VoIPBin runs STT on incoming audio, sends the transcript to your webhook, receives your text response, and converts it to speech before the caller hears anything. Your server is a pure text API.
The Web Chat Side
For the browser, a single fetch call is all you need:
async function sendMessage(sessionId, message) {
const res = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ session_id: sessionId, message })
});
const { response } = await res.json();
return response;
}
// Usage
const reply = await sendMessage("user-42", "What are your business hours?");
console.log(reply);
Same backend. Same AI. Different wire format.
What You Actually Maintain
| Concern | Voice | Web Chat |
|---|---|---|
| Audio encoding | VoIPBin | n/a |
| STT | VoIPBin | n/a |
| TTS | VoIPBin | n/a |
| SIP / RTP | VoIPBin | n/a |
| AI logic | Your code | Your code |
| Conversation state | Your code | Your code |
You own the intelligence. VoIPBin owns the telephony. Neither side needs to know about the other.
Adding a Third Channel Later
Because your AI core is isolated, adding SMS or a Slack bot later is additive — not a rewrite:
# SMS channel — exact same core function
@app.post("/sms")
async def sms_endpoint(request: Request):
data = await request.json()
from_ = data["from"]
message = data["body"]
history = get_history(from_)
ai_reply = await get_ai_response(message, history, channel="text")
save_history(from_, message, ai_reply)
return {"reply": ai_reply}
New channel, same get_ai_response(). Zero changes to your prompt or business logic.
What You Get From This Architecture
- Half the AI logic to write and maintain — one prompt engineering effort covers all channels
- Consistent answers — voice and chat users get the same information from the same model
- Zero telephony expertise needed — VoIPBin handles everything below the text layer
- Easy to extend — new channels are thin adapters, not separate systems
- Faster iteration — improve your prompt once and both channels benefit immediately
Try It
If you already have a chatbot and want to add a phone number — or are starting fresh and want both from day one — this architecture keeps your codebase small and your channels consistent.
VoIPBin signup is instant (token returned in the response, no email verification flow), and you can test inbound calls with a real number in minutes.
→ voipbin.net
→ MCP for Claude/Cursor: uvx voipbin-mcp
→ Go SDK: go get github.com/voipbin/voipbin-go
Top comments (0)