DEV Community

Iftekhairul Alam
Iftekhairul Alam

Posted on

Connecting AI Voice Agents to SIP & PSTN Using NextGenSwitch

Bridging AI Voice Agents with Real Phone Calls

Building an AI voice agent is relatively easy today.

Connecting that agent to real phone calls (SIP, PBX, PSTN) is not.

Most AI voice systems are designed to work with WebSockets and raw audio streams, while production telephony still relies on SIP, RTP, and PSTN infrastructure. This mismatch is where many voice-AI projects struggle to move beyond demos.

This post explains how NextGenSwitch bridges that gap—allowing any AI voice system to interact with real phone callers using a Twilio-style streaming interface, without exposing SIP or RTP complexity to AI developers.

đź”— Original post:

https://nextgenswitch.com/blog/connecting-ai-voice-agents-to-sip-pstn-using-nextgenswitch/


The Core Problem

AI voice systems typically expect:


WebSocket → PCM audio → AI pipeline → PCM audio

Enter fullscreen mode Exit fullscreen mode

Telephony systems operate very differently:


PSTN → SIP Trunk → PBX → RTP (μ-law / A-law)

Enter fullscreen mode Exit fullscreen mode

Key challenges include:

  • SIP and RTP are stateful and codec-sensitive
  • AI systems expect clean, ordered audio frames
  • Handling barge-in, latency, and scaling is non-trivial
  • Most AI frameworks are not PBX-aware

The Role of NextGenSwitch

NextGenSwitch acts as a telephony abstraction layer between traditional phone systems and modern AI services.

It provides:

  • SIP & PSTN termination
  • Integration with PBX systems (Asterisk / FreeSWITCH)
  • A Twilio-style Programmable Voice API
  • Real-time WebSocket audio streaming
  • Codec and sample-rate normalization

Your AI service never has to interact directly with SIP or RTP.


High-Level Architecture


Caller
|
[PSTN / SIP Trunk]
|
[Asterisk / FreeSWITCH]
|
[NextGenSwitch]
| <WebSocket Audio Stream>
|
[AI Voice Service]

Enter fullscreen mode Exit fullscreen mode


`

The AI voice service can be:

  • A custom WebSocket server
  • A cloud-based AI endpoint
  • An on-prem STT + LLM + TTS stack
  • Any framework capable of handling real-time audio

Twilio-Style XML Call Control

When a call reaches NextGenSwitch, it fetches XML instructions—similar to Twilio’s TwiML.

Minimal XML (only the stream URL is required)

xml
<Response>
<Connect>
<Stream url="wss://ai.yourdomain.com/ws/voice-agent"/>
</Connect>
</Response>
`

This instruction:

  • Answers the call
  • Opens a bidirectional WebSocket
  • Starts real-time audio streaming

Optional Parameters (Examples Only)

Parameters are not mandatory.
They are passed as metadata to your AI service.

`xml
<Response>
<Connect>
<Stream url="wss://ai.yourdomain.com/ws/voice-agent">
<Parameter name="agent" value="support-bot"/>
<Parameter name="tenant_id" value="company-01"/>
<Parameter name="language" value="en-US"/>
</Stream>
</Connect>
</Response>
`

These values appear in the JSON start event and can be used for routing, prompts, or CRM lookups.


WebSocket Streaming Protocol (JSON)

NextGenSwitch uses a Twilio Media Streams–style JSON protocol.

Your AI service only needs to handle a small set of events.


start Event

Sent once when the stream begins.

`json
{
"event": "start",
"streamId": "NGS_STREAM_123456",
"start": {
"callId": "NGS_CALL_abc",
"from": "+8801XXXXXXXXX",
"to": "5000",
"customParameters": {
"agent": "support-bot",
"tenant_id": "company-01"
}
}
}
`

Save the streamId—it must be included in outbound audio messages.


media Event (Inbound Audio)

`json
{
"event": "media",
"streamId": "NGS_STREAM_123456",
"media": {
"payload": "BASE64_AUDIO_BYTES=="
}
}
`

Audio characteristics:

  • Codec: G.711 ÎĽ-law
  • Sample rate: 8 kHz
  • Payload: base64-encoded audio frames

media Event (Outbound Audio)

Your AI service responds using the same structure:

`json
{
"event": "media",
"streamId": "NGS_STREAM_123456",
"media": {
"payload": "BASE64_AUDIO_BYTES=="
}
}
`

NextGenSwitch converts this audio back into telephony format and sends it to the caller.


stop Event

`json
{
"event": "stop",
"streamId": "NGS_STREAM_123456",
"stop": {
"reason": "hangup"
}
}
`


AI Stack: Fully Flexible

NextGenSwitch does not require any specific AI framework.

You can use:

  • Any STT engine
  • Any LLM
  • Any TTS engine
  • Any programming language

Frameworks like Pipecat can be used as reference implementations, but they are not required.


Why This Architecture Works

  • No SIP or RTP handling in AI code
  • Twilio-style, developer-friendly interface
  • Real-time, low-latency audio streaming
  • Vendor-neutral AI integration
  • Production-ready PSTN scalability

Common Use Cases

  • AI receptionist
  • AI call center agents
  • Voice-based order processing
  • Appointment booking
  • IVR replacement
  • Multilingual voice bots

Key Takeaways

  • Only the <Stream url> is mandatory
  • XML parameters are optional metadata
  • Streaming protocol is Twilio-style JSON
  • Telephony audio uses ÎĽ-law @ 8kHz
  • AI logic is completely decoupled from PBX logic

Learn More

Top comments (0)