Mohamed-Amine BENHIMA

Posted on Mar 5

What makes Pipecat different from other voice agent frameworks?

#pipecat #voiceai #python #fastapi

It's not an LLM problem

I thought building a voice agent was an LLM problem. Turns out, 80% of the work has nothing to do with the model.

You're actually orchestrating a chain of AI services. VAD, STT, LLM, TTS, or a speech-to-speech model like we use here. On top of that you need audio streaming, turn cancellation, context management, WebRTC transport, observability, and async concurrency. All at once. All low latency.

Pipecat handles all of that. In a few lines of code.

Most frameworks weren't built for this

Frameworks like LangChain are great. But they're built for LLM calls and agentic workflows. Text in, text out. That's not what a real-time voice agent is.

The first thing that breaks is transport. REST doesn't work here. You can't poll a server for audio. You need to stream the mic directly from the user's browser to your server, and stream the voice response back in real time.

Most people jump to WebSockets. But WebSockets are designed for server-to-server communication, and more importantly they run on TCP/IP. TCP guarantees delivery and order, which sounds good until you realize that in real-time audio, a delayed packet is worse than a lost one. You don't want the protocol retrying. You want speed.

WebRTC runs on UDP. It was built exactly for this: low latency, browser-to-server, real-time media streaming. That's why it's the right transport for voice agents.

But transport is just the start. Once audio hits your server you still need to orchestrate VAD to detect when the user is speaking, a speech-to-speech model or a full STT + LLM + TTS chain, audio streaming back to the client, turn cancellation when the user interrupts, and context management across turns.

Without a framework built around this, you're writing all of it from scratch. And that's months of debugging edge cases that have nothing to do with your actual product.

That's what Pipecat is built for.

Building a voice agent pipeline in Pipecat

Step 1: The entry point

Everything starts with a single POST /api/offer endpoint. The browser sends a WebRTC offer, the server processes it and returns an answer, and the connection is established.

@router.post("/api/offer", response_model=WebRTCAnswer)
async def offer(
    request: SmallWebRTCRequest,
    background_tasks: BackgroundTasks,
    small_webrtc_handler: SmallWebRTCRequestHandler = Depends(get_handler),
) -> WebRTCAnswer:
    async def webrtc_connection_callback(connection):
        webrtc_transport = SmallWebRTCTransport(
            webrtc_connection=connection,
            params=TransportParams(
                audio_in_enabled=True, audio_out_enabled=True, audio_out_10ms_chunks=2
            ),
        )
        background_tasks.add_task(run_bot, webrtc_transport)

    answer = await small_webrtc_handler.handle_web_request(
        request=request, webrtc_connection_callback=webrtc_connection_callback
    )
    return WebRTCAnswer(**answer)

Once the connection is ready, run_bot is called as a background task. FastAPI doesn't block waiting for the bot to finish. Each user gets their own transport instance and their own pipeline running concurrently.

There's also a PATCH /api/offer endpoint for ICE candidates. WebRTC uses ICE to negotiate the best network path between browser and server. This endpoint handles those negotiation messages as they come in.

Step 2: Managing WebRTC connections

The SmallWebRTCRequestHandler is a singleton, initialized once at startup and shared across all connections. It manages the WebRTC state for every active session.

small_webrtc_handler = SmallWebRTCRequestHandler()

def get_handler() -> SmallWebRTCRequestHandler:
    return small_webrtc_handler

On shutdown, the lifespan context manager closes the handler cleanly so no connections are left hanging.

@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    await small_webrtc_handler.close()

For transport you have two options. Daily is the managed paid solution. It handles WebRTC infrastructure for you and works well if your server is behind a private VM with a load balancer, since WebRTC peer-to-peer connections don't work well in that setup without a managed relay.

We use SmallWebRTC, the open source option. It works perfectly fine as long as your VM has a public IP. No extra cost, no external dependency.

Step 3: The pipeline

Once the transport is ready, run_bot builds and runs the pipeline. The core idea in Pipecat is simple. You define a list of processors that handle frames flowing through them in order. Audio in, intelligence, audio out.

pipeline = Pipeline(
    [
        transport.input(),
        user_aggregator,
        llm,
        transport.output(),
        assistant_aggregator,
    ]
)

transport.input() streams raw audio from the user's browser directly to the server over WebRTC. No buffering, no polling.

user_aggregator combines two things: Silero VAD to detect when the user starts and stops speaking, and SmartTurn to decide when they actually finished their thought. VAD gives you the audio boundaries. SmartTurn uses a local model to predict if the turn is really complete, not just a pause. Without this, the bot cuts in mid-sentence.

llm here is Gemini Live, a speech-to-speech model. You send it audio, it responds with audio. No STT, no TTS in between. That removes two network hops from your latency budget.

transport.output() streams the bot's audio response back to the browser in real time.

assistant_aggregator handles context. It keeps track of the conversation history and compresses the context window when it gets too long, so the model doesn't run out of memory mid-conversation.

Step 4: Smart Turn detection

Most voice agents use basic silence detection. Wait 500ms of no audio, assume the user is done, send to the LLM. Simple, but it breaks constantly. People pause mid-sentence. They think out loud. A fixed silence threshold either cuts them off too early or adds noticeable delay.

SmartTurn solves this with a small local model that runs on every audio chunk. It doesn't just detect silence, it predicts whether the turn is actually complete.

stop_strategy = TurnAnalyzerUserTurnStopStrategy(
    turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams(stop_secs=2)),
    timeout=0.3,
)

stop_secs=2 is the fallback. If the model is uncertain for 2 seconds, it ends the turn anyway.

This is wired into the user aggregator alongside VAD:

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context=context,
    user_params=LLMUserAggregatorParams(
        vad_analyzer=SileroVADAnalyzer(),
        user_turn_strategies=UserTurnStrategies(stop=[stop_strategy]),
    ),
)

VAD detects speech boundaries. SmartTurn decides when to act on them. Together they make interruptions and natural pauses feel handled correctly.

Step 5: Observability

task = PipelineTask(
    pipeline=pipeline,
    params=PipelineParams(
        enable_metrics=True,
        enable_usage_metrics=True,
        observers=[latency_observer]
    ),
)

enable_metrics=True gives you TTFB and processing time per service. You see exactly how long each stage takes.

enable_usage_metrics=True gives you token usage from the LLM and character count from TTS, per interaction.

UserBotLatencyObserver measures the total end-to-end latency: from the moment the user stops speaking to the moment the bot starts speaking. That's the number that actually matters for how natural the conversation feels.

@latency_observer.event_handler("on_latency_measured")
async def on_latency_measured(observer, latency):
    logger.debug(f"User-to-bot latency: {latency:.3f}s")

One callback. You get the full picture.

What surprised me

SmartTurn handles filler words

I expected turn detection to be a silence threshold. Wait long enough, assume the user is done. Simple.

The problem is people don't speak in clean sentences. They say "umm" and "euh" and pause mid-thought. Normal VAD hears that silence and ends the turn. The bot cuts in. The user feels interrupted. The conversation breaks.

SmartTurn runs a local model on every audio chunk and predicts whether the turn is actually complete. It hears "umm" followed by silence and knows the user isn't done yet. It waits. That one thing has a bigger impact on conversation quality than almost anything else in the pipeline.

Concurrency is handled for you

I expected to manage concurrent sessions myself. Thread safety, shared state, making sure one user's pipeline doesn't interfere with another's.

Pipecat handles this through its frame-based architecture. Each WebRTC connection spins up its own pipeline instance as an async background task. Sessions are fully isolated. You don't write any of that isolation logic yourself.

background_tasks.add_task(run_bot, webrtc_transport)

That one line is doing a lot. Each call to run_bot gets its own transport, its own context, its own pipeline. No shared state to worry about.

Observability is a first-class citizen

I planned to wire up my own latency tracking after getting the core working. I assumed it would be a separate logging layer I'd have to build.

It wasn't. Three lines of config and you get TTFB per service, token and character usage per interaction, and full end-to-end latency from user stop speaking to bot start speaking.

params=PipelineParams(
    enable_metrics=True,
    enable_usage_metrics=True,
    observers=[latency_observer]
)

Most frameworks make observability an afterthought. In Pipecat it's built into the pipeline task itself.

Full code is on GitHub: pipecat-demos/fastapi-pipecat

What's coming next

In the next posts I'll cover:

Integrating LangChain with Pipecat — how to bring agentic workflows into a real-time voice pipeline without killing your latency.

Communicating with the frontend — streaming transcription as the user speaks, streaming LLM output word by word, and highlighting the sentence the bot is currently speaking. The stuff that makes a voice agent feel alive, not just functional.

DEV Community