Iftekhairul Alam

Posted on Feb 6

Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

#asterisk #freeswitch #ai #sip

Bridging AI Voice Agents with Real Phone Calls

Building an AI voice agent is relatively easy today.

Connecting that agent to real phone calls (SIP, PBX, PSTN) is not.

Most AI voice systems are designed to work with WebSockets and raw audio streams, while production telephony still relies on SIP, RTP, and PSTN infrastructure. This mismatch is where many voice-AI projects struggle to move beyond demos.

This post explains how NextGenSwitch bridges that gap—allowing any AI voice system to interact with real phone callers using a Twilio-style streaming interface, without exposing SIP or RTP complexity to AI developers.

🔗 Original post:

https://nextgenswitch.com/blog/connecting-ai-voice-agents-to-sip-pstn-using-nextgenswitch/

The Core Problem

AI voice systems typically expect:


WebSocket → PCM audio → AI pipeline → PCM audio

Telephony systems operate very differently:


PSTN → SIP Trunk → PBX → RTP (μ-law / A-law)

Key challenges include:

SIP and RTP are stateful and codec-sensitive
AI systems expect clean, ordered audio frames
Handling barge-in, latency, and scaling is non-trivial
Most AI frameworks are not PBX-aware

The Role of NextGenSwitch

NextGenSwitch acts as a telephony abstraction layer between traditional phone systems and modern AI services.

It provides:

SIP & PSTN termination
Integration with PBX systems (Asterisk / FreeSWITCH)
A Twilio-style Programmable Voice API
Real-time WebSocket audio streaming
Codec and sample-rate normalization

Your AI service never has to interact directly with SIP or RTP.

High-Level Architecture


Caller
|
[PSTN / SIP Trunk]
|
[Asterisk / FreeSWITCH]
|
[NextGenSwitch]
| <WebSocket Audio Stream>
|
[AI Voice Service]

The AI voice service can be:

A custom WebSocket server
A cloud-based AI endpoint
An on-prem STT + LLM + TTS stack
Any framework capable of handling real-time audio

Twilio-Style XML Call Control

When a call reaches NextGenSwitch, it fetches XML instructions—similar to Twilio’s TwiML.

Minimal XML (only the stream URL is required)

xml <Response> <Connect> <Stream url="wss://ai.yourdomain.com/ws/voice-agent"/> </Connect> </Response>`

This instruction:

Answers the call
Opens a bidirectional WebSocket
Starts real-time audio streaming

Optional Parameters (Examples Only)

Parameters are not mandatory.
They are passed as metadata to your AI service.

`xml <Response> <Connect> <Stream url="wss://ai.yourdomain.com/ws/voice-agent"> <Parameter name="agent" value="support-bot"/> <Parameter name="tenant_id" value="company-01"/> <Parameter name="language" value="en-US"/> </Stream> </Connect> </Response> `

These values appear in the JSON start event and can be used for routing, prompts, or CRM lookups.

WebSocket Streaming Protocol (JSON)

NextGenSwitch uses a Twilio Media Streams–style JSON protocol.

Your AI service only needs to handle a small set of events.

`start` Event

Sent once when the stream begins.

`json { "event": "start", "streamId": "NGS_STREAM_123456", "start": { "callId": "NGS_CALL_abc", "from": "+8801XXXXXXXXX", "to": "5000", "customParameters": { "agent": "support-bot", "tenant_id": "company-01" } } } `

Save the streamId—it must be included in outbound audio messages.

`media` Event (Inbound Audio)

`json { "event": "media", "streamId": "NGS_STREAM_123456", "media": { "payload": "BASE64_AUDIO_BYTES==" } } `

Audio characteristics:

Codec: G.711 μ-law
Sample rate: 8 kHz
Payload: base64-encoded audio frames

`media` Event (Outbound Audio)

Your AI service responds using the same structure:

`json { "event": "media", "streamId": "NGS_STREAM_123456", "media": { "payload": "BASE64_AUDIO_BYTES==" } } `

NextGenSwitch converts this audio back into telephony format and sends it to the caller.

`stop` Event

`json { "event": "stop", "streamId": "NGS_STREAM_123456", "stop": { "reason": "hangup" } } `

AI Stack: Fully Flexible

NextGenSwitch does not require any specific AI framework.

You can use:

Any STT engine
Any LLM
Any TTS engine
Any programming language

Frameworks like Pipecat can be used as reference implementations, but they are not required.

Why This Architecture Works

No SIP or RTP handling in AI code
Twilio-style, developer-friendly interface
Real-time, low-latency audio streaming
Vendor-neutral AI integration
Production-ready PSTN scalability

Common Use Cases

AI receptionist
AI call center agents
Voice-based order processing
Appointment booking
IVR replacement
Multilingual voice bots

Key Takeaways

Only the <Stream url> is mandatory
XML parameters are optional metadata
Streaming protocol is Twilio-style JSON
Telephony audio uses μ-law @ 8kHz
AI logic is completely decoupled from PBX logic

Learn More

Programmable Voice Stream API
https://nextgenswitch.com/docs/programmable-voice-api/#stream
AI streaming examples
https://github.com/nextgenswitch/ai_agents

DEV Community

Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

Bridging AI Voice Agents with Real Phone Calls

The Core Problem

The Role of NextGenSwitch

High-Level Architecture

Twilio-Style XML Call Control

Minimal XML (only the stream URL is required)

Optional Parameters (Examples Only)

WebSocket Streaming Protocol (JSON)

`start` Event

`media` Event (Inbound Audio)

`media` Event (Outbound Audio)

`stop` Event

AI Stack: Fully Flexible

Why This Architecture Works

Common Use Cases

Key Takeaways

Learn More

Top comments (0)

Bridging AI Voice Agents with Real Phone Calls

The Core Problem

The Role of NextGenSwitch

High-Level Architecture

Twilio-Style XML Call Control

Minimal XML (only the stream URL is required)

Optional Parameters (Examples Only)

WebSocket Streaming Protocol (JSON)

start Event

media Event (Inbound Audio)

media Event (Outbound Audio)

stop Event

AI Stack: Fully Flexible

Why This Architecture Works

Common Use Cases

Key Takeaways

Learn More

`start` Event

`media` Event (Inbound Audio)

`media` Event (Outbound Audio)

`stop` Event