DEV Community: Sylvain Boily

RoomKit, Pipecat, TEN Framework, LiveKit Agents: Choosing the Right Conversational AI Framework

Sylvain Boily — Mon, 09 Feb 2026 15:09:36 +0000

The conversational AI space is booming. Whether you're building a voice assistant, a customer support system, or a multi-channel communication platform, there's no shortage of open-source frameworks to choose from. But with so many options, picking the right one can be overwhelming.

In this article, I compare four open-source frameworks that developers frequently encounter when building conversational AI systems: RoomKit, Pipecat, TEN Framework, and LiveKit Agents. Each takes a fundamentally different approach to the same problem space, and understanding their philosophies will save you weeks of going down the wrong path.

Full disclosure: I'm the creator of RoomKit. I'll do my best to keep this comparison fair and focused on helping you choose the right tool for your use case — including cases where RoomKit is not the right answer.

The Four Philosophies in 30 Seconds

Before diving into code, let's understand what makes each framework tick:

RoomKit thinks in rooms and channels. A conversation is a room. SMS, Email, WhatsApp, Voice, AI — they're all channels in that room.
Pipecat thinks in pipelines and frames. Data (audio, text, images) flows as frames through a linear chain of processors.
TEN Framework thinks in graphs and extensions. Extensions are nodes in a directed graph, connected via typed messages in a JSON configuration.
LiveKit Agents thinks in sessions and agents. An agent joins a WebRTC room as a participant, just like a human would.

These aren't just implementation details — they're design philosophies that determine what's easy, what's possible, and what's painful with each framework.

Show Me the Code

The best way to understand a framework is to see it in action. Here's what a minimal voice AI setup looks like with each one.

RoomKit — Voice Pipeline Meets Multi-Channel Rooms

import asyncio
from roomkit import (
    RoomKit, VoiceChannel, ChannelCategory,
    HookTrigger, HookResult, HookExecution, create_vllm_provider, VLLMConfig,
)
from roomkit.channels.ai import AIChannel
from roomkit.voice.backends.local import LocalAudioBackend
from roomkit.voice.pipeline import AudioPipelineConfig, RecordingConfig, WavFileRecorder
from roomkit.voice.pipeline.vad.sherpa_onnx import SherpaOnnxVADProvider, SherpaOnnxVADConfig
from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig
from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig

async def main():
    kit = RoomKit()

    # --- Full audio pipeline: Mic → AEC → Denoiser → VAD → STT → LLM → TTS → Speaker ---
    backend = LocalAudioBackend(input_sample_rate=16000, output_sample_rate=22050)

    vad = SherpaOnnxVADProvider(SherpaOnnxVADConfig(
        model="ten-vad.onnx", model_type="ten",  # or "silero"
        threshold=0.5, silence_threshold_ms=600,
    ))
    stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
        mode="transducer", encoder="encoder.onnx", decoder="decoder.onnx",
        joiner="joiner.onnx", tokens="tokens.txt",
    ))
    tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
        model="en_US-amy-low.onnx", tokens="tokens.txt",
        data_dir="espeak-ng-data",
    ))
    pipeline = AudioPipelineConfig(
        vad=vad,
        recorder=WavFileRecorder(),
        recording_config=RecordingConfig(storage="./recordings"),
    )

    # Voice is a channel — same abstraction as SMS or Email
    voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)
    ai = AIChannel("ai", provider=create_vllm_provider(
        VLLMConfig(model="qwen3:8b", base_url="http://localhost:11434/v1")
    ), system_prompt="You are a helpful voice assistant.")

    kit.register_channel(voice)
    kit.register_channel(ai)

    await kit.create_room(room_id="support-call")
    await kit.attach_channel("support-call", "voice")
    await kit.attach_channel("support-call", "ai", category=ChannelCategory.INTELLIGENCE)

    # Hooks intercept pipeline events — the same hook system used by all channels
    @kit.hook(HookTrigger.ON_TRANSCRIPTION)
    async def on_stt(text, ctx):
        print(f"User said: {text}")
        return HookResult.allow()

    @kit.hook(HookTrigger.BEFORE_TTS)
    async def before_tts(text, ctx):
        print(f"Assistant: {text}")
        return HookResult.allow()

asyncio.run(main())

What stands out: RoomKit has a full audio pipeline (VAD → STT → LLM → TTS) with optional AEC, denoiser, and WAV recording — but it's still a channel in a room. The same hook system that filters SMS spam can intercept speech events (ON_SPEECH_START, ON_TURN_COMPLETE, ON_VAD_SILENCE, etc.). You could add an SMS or WhatsApp channel to this same room and every message would flow across all channels automatically.

The audio backend is pluggable: LocalAudioBackend (microphone), FastRTCVoiceBackend (WebRTC), or RTPVoiceBackend (for SIP/telephony integration — RTP is already supported, with a SIP module in progress). VAD supports both TEN-VAD and Silero models via sherpa-onnx, and semantic turn detection is built into the pipeline.

Pipecat — A Linear Pipeline of Processors

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.transports.daily.transport import DailyTransport, DailyParams
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

async def main():
    transport = DailyTransport(
        room_url="https://your-domain.daily.co/room",
        params=DailyParams(vad_analyzer=SileroVADAnalyzer()),
    )
    stt = DeepgramSTTService(api_key="...")
    llm = OpenAILLMService(api_key="...", model="gpt-4o")
    tts = CartesiaTTSService(api_key="...", voice_id="...")

    context = OpenAILLMContext([{"role": "system", "content": "You are a helpful assistant."}])
    context_aggregator = llm.create_context_aggregator(context)

    # The pipeline: data flows left to right as "frames"
    pipeline = Pipeline([
        transport.input(),          # Audio from user
        stt,                        # Speech → Text
        context_aggregator.user(),  # Collect user turn
        llm,                        # Generate response
        tts,                        # Text → Speech
        transport.output(),         # Audio back to user
        context_aggregator.assistant(),
    ])

    task = PipelineTask(pipeline)
    runner = PipelineRunner()
    await runner.run(task)

What stands out: the pipeline metaphor is immediately intuitive. Audio goes in, gets transcribed, processed by an LLM, synthesized back to speech, and sent out. Each processor in the chain does one thing. Swapping Deepgram for Whisper or Cartesia for ElevenLabs is a one-line change.

TEN Framework — A Graph of Extensions in JSON

{
  "ten": {
    "predefined_graphs": [{
      "name": "voice_assistant",
      "nodes": [
        {
          "name": "agora_rtc",
          "addon": "agora_rtc",
          "extension_group": "default",
          "property": { "app_id": "${env:AGORA_APP_ID}" }
        },
        {
          "name": "stt",
          "addon": "deepgram_asr_python",
          "extension_group": "default",
          "property": { "api_key": "${env:DEEPGRAM_API_KEY}" }
        },
        {
          "name": "llm",
          "addon": "openai_chatgpt_python",
          "extension_group": "default",
          "property": { "api_key": "${env:OPENAI_API_KEY}", "model": "gpt-4o" }
        },
        {
          "name": "tts",
          "addon": "elevenlabs_tts_python",
          "extension_group": "default",
          "property": { "api_key": "${env:ELEVENLABS_TTS_KEY}" }
        }
      ],
      "connections": [
        {
          "extension": "agora_rtc",
          "audio_frame": [{ "name": "pcm_frame", "dest": [{ "extension": "stt" }] }]
        },
        {
          "extension": "stt",
          "data": [{ "name": "text_data", "dest": [{ "extension": "llm" }] }]
        },
        {
          "extension": "llm",
          "data": [{ "name": "text_data", "dest": [{ "extension": "tts" }] }]
        },
        {
          "extension": "tts",
          "audio_frame": [{ "name": "pcm_frame", "dest": [{ "extension": "agora_rtc" }] }]
        }
      ]
    }]
  }
}

What stands out: there's no Python to write for a standard setup. You define your agent as a graph of extensions connected by typed messages (audio frames, text data, commands). The visual TMAN Designer lets you drag-and-drop these nodes. Extensions can be written in C++, Go, or Python, and they all run in the same process.

LiveKit Agents — An Agent Joins the Room

from livekit.agents import Agent, AgentSession, JobContext, cli, WorkerOptions
from livekit.plugins import silero, deepgram, openai, cartesia

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    session = AgentSession(
        vad=silero.VAD.load(),
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(model="gpt-4.1-mini"),
        tts=cartesia.TTS(voice="9626c31c-..."),
    )

    await session.start(
        agent=Agent(instructions="You are a helpful voice assistant."),
        room=ctx.room,
    )

    await session.generate_reply(
        instructions="Greet the user and offer your assistance."
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

What stands out: the code is remarkably concise. The AgentSession handles the entire voice pipeline (VAD → STT → LLM → TTS) internally. The agent joins a LiveKit room as a regular participant, which means you get all of LiveKit's WebRTC infrastructure for free — noise cancellation, SIP telephony, multi-participant rooms.

Comparing on Voice (Apples to Apples)

All four frameworks can power a voice AI agent, so let's compare them where they overlap.

Capability	RoomKit	Pipecat	TEN Framework	LiveKit Agents
Pipeline model	Full audio pipeline + hook intercepts	Linear frame chain	Directed graph (JSON)	Session-managed pipeline
STT/TTS	Pluggable (Deepgram, ElevenLabs, SherpaOnnx local)	40+ services	Deepgram, Azure, Whisper, etc.	Deepgram, OpenAI, Cartesia, etc.
Speech-to-speech	OpenAI Realtime, Gemini Live	OpenAI, Gemini	OpenAI, Gemini	OpenAI Realtime API
VAD	TEN-VAD, Silero (sherpa-onnx), semantic turn detection	Silero, custom Smart Turn	Proprietary TEN VAD	Silero + semantic turn detection
Audio transport	FastRTC (WebRTC), RTP, WebSocket, local mic	Daily.co (WebRTC)	Agora RTC	LiveKit (WebRTC)
Barge-in	Via hook pipeline events	Built-in interruption handling	Interrupt detector extension	Built-in with min_words config
SIP/Telephony	RTP backend supported, SIP module in progress	Twilio WebSocket	SIP extension	Native SIP trunking
Audio processing	AEC (Speex), denoiser (GTCRN), WAV recorder	—	Noise reduction extension	BVC noise cancellation
Install	`pip install roomkit[all]`	`pip install pipecat-ai`	Docker + Agora SDK	`pip install livekit-agents`

On voice alone, Pipecat and LiveKit Agents are the most polished ecosystems with the widest service integrations. But RoomKit has a surprisingly complete audio pipeline under the hood: neural VAD (TEN-VAD or Silero via sherpa-onnx), semantic turn detection, echo cancellation (Speex AEC), neural denoising (GTCRN), WAV recording, and pluggable backends (WebRTC, RTP, local mic). The difference is architectural: RoomKit's pipeline lives inside a channel that coexists with SMS, Email, and WhatsApp in the same room. TEN goes deepest on real-time media with proprietary VAD and avatar lip-sync.

Where Each Framework Shines Alone

This is where the real differences emerge.

RoomKit: Multi-Channel Conversation Orchestration

RoomKit's unique strength is that voice is just one of many channels in a room. You can have a conversation that spans SMS, Email, WhatsApp, Voice, Microsoft Teams, and AI — all in the same room, with automatic content transcoding between them. A rich card sent to WhatsApp becomes plain text over SMS. An AI response broadcasts to every channel simultaneously.

But don't mistake "multi-channel" for "voice-light." The voice subsystem has a full audio pipeline: neural VAD (TEN-VAD or Silero), AEC, denoiser, STT, TTS, semantic turn detection, and WAV recording with stereo/mixed/separate modes. The pipeline fires granular events (ON_SPEECH_START, ON_SPEECH_END, ON_VAD_SILENCE, ON_TURN_COMPLETE, ON_TURN_INCOMPLETE, ON_BACKCHANNEL, ON_DTMF) that the hook system can intercept — the same hook system used by all channels. Audio backends are pluggable: local mic, FastRTC (WebRTC), or RTP (for telephony integration).

Choose RoomKit when: you're building a multi-channel conversation system (contact center, B2B2C messaging, omnichannel support) where voice is one touchpoint among many, or when you need the full audio pipeline integrated with hooks and identity resolution across channels.

Pipecat: The Widest Ecosystem for Voice AI

Pipecat's frame-based pipeline is the most intuitive model for building voice agents. With 40+ service integrations, client SDKs for every platform (React, Swift, Kotlin, C++), and debugging tools like Whisker and Tail, it has the most mature developer ecosystem for pure voice AI.

Pipecat Flows adds structured conversation state management on top, letting you build complex multi-step interactions (patient intake, order tracking) without reinventing the wheel.

Choose Pipecat when: you're building a voice-first AI agent and want the widest choice of AI services with the fastest path to production.

TEN Framework: Visual Builder + Multi-Language Extensions

TEN is the most ambitious in scope. The TMAN Designer lets non-developers visually wire together voice agents. Extensions can be written in C++, Go, or Python and run in the same process. The proprietary VAD and turn detection models are highly optimized. And the lip-sync avatar integrations (with Trulience, HeyGen, Tavus) make it the clear choice for "wow" demos with visual characters.

The tradeoff is weight: TEN requires Docker, an Agora account, and a Go server. It's a full platform, not a library.

Choose TEN when: you need visual agent building, multi-language extensions, or real-time avatar experiences — and you're okay with the Agora ecosystem.

LiveKit Agents: Production-Grade Voice with Observability

LiveKit Agents is the most developer-friendly framework for getting a production voice agent running. The v1.0 AgentSession API is beautifully clean. Semantic turn detection (using a custom open-weights model) reduces false interruptions. The built-in test framework and metrics collection make it easy to iterate. And with native SIP trunking, you can deploy phone agents without extra infrastructure.

LiveKit's open-source media server means you can self-host the entire stack, which matters for compliance-sensitive deployments.

Choose LiveKit Agents when: you're building a production voice AI agent and need the best balance of developer experience, observability, and deployment flexibility.

Decision Tree

Still not sure? Here's a quick guide:

"I need voice + SMS + Email + WhatsApp in one conversation"
→ RoomKit. None of the others do multi-channel.

"I want the fastest path to a working voice agent"
→ LiveKit Agents or Pipecat. Both get you to "hello world" in under 50 lines.

"I need to visually build agents without writing code"
→ TEN Framework's TMAN Designer.

"I need lip-sync avatars and video in real-time"
→ TEN Framework.

"I need the widest choice of AI providers"
→ Pipecat (40+ integrations).

"I need to self-host everything for compliance"
→ LiveKit Agents (open-source media server) or RoomKit (no external dependencies for core).

"I need SIP/telephony integration"
→ LiveKit Agents (native SIP) or Pipecat (Twilio WebSocket). RoomKit has RTP support already, with a SIP module in development.

"I'm building a hook/middleware system around conversations"
→ RoomKit. The hook pipeline (PRE_INBOUND, POST_DELIVERY, BEFORE_BROADCAST, etc.) is purpose-built for this.

Conclusion

These four frameworks are not in direct competition — they solve overlapping but different problems. Pipecat, TEN, and LiveKit Agents are voice/AI agent frameworks optimized for real-time audio interactions. RoomKit is a multi-channel conversation framework with a full voice pipeline that happens to coexist with SMS, Email, WhatsApp, and a dozen other channels in the same room abstraction.

The landscape is evolving fast. Speech-to-speech models (OpenAI Realtime, Gemini Live) are making traditional STT → LLM → TTS pipelines less necessary. All four frameworks are adapting. What won't change is the underlying architectural bet each one makes: pipelines vs. graphs vs. rooms vs. sessions. Choose the abstraction that matches your problem, and you'll be building on solid ground.

All four frameworks are open source. Go explore:

RoomKit: github.com/roomkit-live/roomkit — roomkit.live
Pipecat: github.com/pipecat-ai/pipecat — pipecat.ai
TEN Framework: github.com/TEN-framework/ten-framework — theten.ai
LiveKit Agents: github.com/livekit/agents — livekit.io

Your Voice Assistant Doesn't Need the Cloud

Sylvain Boily — Sat, 07 Feb 2026 19:42:50 +0000

Build a fully local, open-source voice assistant in Python — no API keys, no subscriptions, no data leaving your machine.

If you've ever built a voice assistant, you know the drill: sign up for Deepgram, get an ElevenLabs API key, wire up OpenAI, watch the invoices stack up, and hope your users are comfortable sending their audio to three different cloud providers.

I wanted something different. A voice assistant that runs entirely on my machine — STT, LLM, TTS, everything — with zero cloud dependencies. And I wanted to build it with the same clean abstractions I'd use for a cloud-based setup.

Here's what I ended up with: a fully local voice pipeline running on a single NVIDIA 4070, responding in under 300ms after the initial warmup. Everything open source. Everything local.

I built it with RoomKit, an open-source Python framework I created for multi-channel conversation orchestration.

The Stack

No API keys. No cloud. Just models running on your GPU:

Component	Tool	Role
STT	Kroko ASR	Speech-to-text (streaming Zipformer via sherpa-onnx)
LLM	Qwen 2.5 4B	Language model via Ollama
TTS	Piper (fr_FR-siwis-medium)	Text-to-speech (VITS via sherpa-onnx)
VAD	TEN-VAD	Voice activity detection (sherpa-onnx)
Orchestration	RoomKit	Wires it all together

The entire audio pipeline looks like this:

Mic → [Resampler] → [AEC] → [Denoiser] → VAD → STT → LLM → TTS → Speaker

Every component runs locally. The heaviest lift is the LLM at 4B parameters — small enough to leave plenty of VRAM for the ONNX models.

Why sherpa-onnx?

When I started building the local voice pipeline for RoomKit, I looked at the usual suspects: faster-whisper, whisper.cpp, Coqui TTS. All great projects. But sherpa-onnx stood out for one reason: it covers the entire audio stack.

One library gives you streaming STT (Zipformer, Whisper, Paraformer), TTS (VITS/Piper voices), neural VAD (TEN-VAD, Silero), speech enhancement (GTCRN denoiser), and even echo cancellation support. All through ONNX Runtime, which means you get CUDA acceleration without PyTorch overhead.

For RoomKit, this was ideal. I could build providers for STT, TTS, VAD, and denoising that all share the same runtime, the same model format, and the same deployment story.

Show Me the Code

Let's build this step by step, starting simple.

Step 1: The providers

Each piece of the pipeline is a RoomKit provider — a pluggable component you can swap without changing your application logic:

from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig
from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig
from roomkit.voice.pipeline.vad.sherpa_onnx import SherpaOnnxVADProvider, SherpaOnnxVADConfig

# Speech-to-text: Kroko ASR (streaming Zipformer)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="transducer",
    encoder="kroko-encoder.onnx",
    decoder="kroko-decoder.onnx",
    joiner="kroko-joiner.onnx",
    tokens="tokens.txt",
    sample_rate=16000,
    provider="cuda",  # or "cpu"
))

# Text-to-speech: Piper French voice
tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
    model="fr_FR-siwis-medium.onnx",
    tokens="tokens.txt",
    data_dir="espeak-ng-data",
    sample_rate=22050,
    provider="cuda",
))

# Voice activity detection: TEN-VAD
vad = SherpaOnnxVADProvider(SherpaOnnxVADConfig(
    model="ten-vad.onnx",
    model_type="ten",
    threshold=0.5,
    silence_threshold_ms=600,
    sample_rate=16000,
    provider="cpu",  # VAD is tiny — CPU is actually faster
))

Notice that VAD runs on CPU. The model is so small (runs every 20ms) that the GPU transfer overhead makes CUDA slower for it. RoomKit lets you mix providers — CUDA for the heavy models, CPU for the lightweight ones.

Step 2: The audio pipeline

RoomKit's AudioPipelineConfig chains the preprocessing stages together:

from roomkit.voice.pipeline import AudioPipelineConfig

pipeline = AudioPipelineConfig(
    vad=vad,
    # Optional: add denoiser and echo cancellation
    # denoiser=SherpaOnnxDenoiserProvider(config),
    # aec=SpeexAECProvider(frame_size=320),
)

Want to add noise reduction? Uncomment one line. Echo cancellation for a speaker setup? One more line. Each stage is optional and pluggable.

Step 3: The LLM

Any OpenAI-compatible server works — Ollama, vLLM, LM Studio:

from roomkit import VLLMConfig, create_vllm_provider
from roomkit.channels.ai import AIChannel

ai_provider = create_vllm_provider(VLLMConfig(
    model="qwen2.5:4b",
    base_url="http://localhost:11434/v1",  # Ollama
    max_tokens=256,
))

ai = AIChannel("ai", provider=ai_provider, system_prompt=(
    "You are a friendly voice assistant. Keep responses "
    "short and conversational — one or two sentences."
))

Step 4: Wire it all together

This is where RoomKit shines. The room is the conversation — channels attach to it, messages flow through hooks:

from roomkit import RoomKit, VoiceChannel, ChannelCategory
from roomkit.voice.backends.local import LocalAudioBackend

kit = RoomKit()

# Local mic + speakers
backend = LocalAudioBackend(
    input_sample_rate=16000,
    output_sample_rate=22050,
    channels=1,
    block_duration_ms=20,
)

# Register channels
voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)
kit.register_channel(voice)
kit.register_channel(ai)

# Create a room and attach both channels
await kit.create_room(room_id="local-voice")
await kit.attach_channel("local-voice", "voice")
await kit.attach_channel("local-voice", "ai", category=ChannelCategory.INTELLIGENCE)

That's it. Speak into your mic, the audio flows through VAD → STT → LLM → TTS → speaker. No cloud. No API keys.

Step 5: Add hooks for visibility

RoomKit's hook system lets you intercept events at any point — logging, moderation, analytics, custom routing:

from roomkit import HookTrigger, HookResult, HookExecution

@kit.hook(HookTrigger.ON_TRANSCRIPTION)
async def on_transcription(text, ctx):
    print(f"You said: {text}")
    return HookResult.allow()

@kit.hook(HookTrigger.BEFORE_TTS)
async def before_tts(text, ctx):
    print(f"Assistant: {text}")
    return HookResult.allow()

@kit.hook(HookTrigger.ON_SPEECH_START, execution=HookExecution.ASYNC)
async def on_speech_start(session, ctx):
    print("Listening...")

Hooks aren't just for logging. You can block messages, reroute them, transform text before TTS, or trigger external actions — all without touching the pipeline code.

What It Looks Like in Practice

Here's an actual conversation running on my machine (RTX 4070, Ollama with Qwen 2.5 4B, French models):

14:05:08 You said: Salut, comment ça va
14:05:12 Assistant: Ça va bien, merci! Et toi?

14:05:18 You said: Ouais, ça va plutôt pas mal
14:05:18 Assistant: C'est cool! Fais-toi repos et profite du bon temps.

14:05:28 You said: Qu'est-ce que tu sais faire en fait ?
14:05:29 Assistant: Je peux répondre aux questions, donner des suggestions,
          raconter des histoires et aider avec plein de choses!

The first response takes ~4 seconds (Ollama cold start). After that, the LLM responds in ~280ms. The full loop — end of speech to start of audio playback — is well under 2 seconds.

And yes, barge-in works. Start talking while the assistant is speaking and the TTS stops immediately:

14:05:34 TTS interrupted: reason=barge_in, position=4751ms
14:05:34 Speech started (new utterance)

This is the kind of thing that's painful to implement from scratch but comes built into RoomKit's voice pipeline.

The Real Point: Swap Without Rewriting

Here's what I find most interesting about this setup. The local voice pipeline uses the exact same RoomKit abstractions as the cloud-based one. If tomorrow you want to switch to Deepgram for STT and ElevenLabs for TTS, you swap the providers:

# Local
stt = SherpaOnnxSTTProvider(config)
tts = SherpaOnnxTTSProvider(config)

# Cloud (same interface, same hooks, same room)
stt = DeepgramSTTProvider(api_key="...")
tts = ElevenLabsTTSProvider(api_key="...", voice_id="...")

Your hooks don't change. Your room logic doesn't change. Your recording, moderation, and routing code stays the same. That's the whole point of RoomKit — the channel and provider are implementation details; the conversation is the abstraction.

Getting Started

Prerequisites:

Linux with NVIDIA GPU (tested on RTX 4070)
Ollama installed and running
Python 3.12+

Install:

pip install roomkit[local-audio,openai,sherpa-onnx]
ollama pull qwen2.5:4b

Download models (one time):

# VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vad-models/ten-vad.onnx

# STT - Kroko ASR
# (download from https://huggingface.co/Banafo/Kroko-ASR)

# TTS - Piper French voice
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-fr_FR-siwis-medium.tar.bz2
tar xf vits-piper-fr_FR-siwis-medium.tar.bz2

For GPU acceleration (optional but recommended):

# Install cuDNN 9
sudo apt-get install cudnn9-cuda-12

# Install CUDA wheel for sherpa-onnx
pip install sherpa-onnx==1.12.23+cuda12.cudnn9 \
    -f https://k2-fsa.github.io/sherpa/onnx/cuda.html

export ONNX_PROVIDER=cuda

Run the full example:

git clone https://github.com/roomkit-live/roomkit
cd roomkit

# Set your model paths and run
LLM_MODEL=qwen2.5:4b \
LLM_BASE_URL=http://localhost:11434/v1 \
VAD_MODEL=ten-vad.onnx \
STT_ENCODER=<path-to-encoder> \
STT_DECODER=<path-to-decoder> \
STT_JOINER=<path-to-joiner> \
STT_TOKENS=<path-to-tokens> \
TTS_MODEL=fr_FR-siwis-medium.onnx \
TTS_TOKENS=tokens.txt \
TTS_DATA_DIR=espeak-ng-data \
ONNX_PROVIDER=cuda \
python examples/voice_local_onnx_vllm.py

The full example on GitHub includes everything: echo cancellation, noise reduction, WAV recording, debug taps, and graceful shutdown.

What's Next

This example is a starting point. RoomKit's architecture means you can extend it in any direction:

Add a WebSocket channel to the same room — now your voice assistant also handles text chat
Plug in MCP tools — let the assistant search documents, query databases, or control devices
Record and analyze — the built-in WAV recorder captures both sides of the conversation
Switch languages — swap STT/TTS models for English, German, Spanish, or any language sherpa-onnx supports

The full source is at github.com/roomkit-live/roomkit. Star the repo, try the example, open an issue. The framework is MIT licensed and actively looking for contributors.

Website: roomkit.live
GitHub: github.com/roomkit-live
PyPI: pypi.org/project/roomkit

I Built a Multi-Channel Conversation Framework in Python. Here's Why.

Sylvain Boily — Fri, 06 Feb 2026 14:58:30 +0000

If you've ever integrated SMS, email, voice, and chat into the same app, you know the pain. Each channel has its own SDK, its own webhooks, its own quirks. A customer starts on SMS, continues on email, finishes on chat — and your system treats them as three strangers.

I spent 20 years building telecom infrastructure. After the third time rebuilding the same "route messages between channels" plumbing, I extracted the pattern into a library.

It's called RoomKit.

The Problem

Every conversation system I've worked on had the same architecture smell: channel-specific code scattered everywhere, identity stitched together with duct tape, and zero shared context between channels.

The typical approach looks like this:

# The "just add another if-statement" pattern
if source == "sms":
    handle_sms(message)
elif source == "email":
    handle_email(message)
elif source == "whatsapp":
    handle_whatsapp(message)
# ... repeat for every new channel

Each handler has its own storage, its own user lookup, its own response logic. Switching from Twilio to Telnyx means rewriting half the codebase. Adding AI to the conversation means threading it through every handler.

The Idea: Rooms, Not Channels

RoomKit introduces a single abstraction: the room. A room is a conversation. Channels attach to rooms. Messages flow in, get processed through hooks, and broadcast to all attached channels.

[SMS] ──┐
[Email] ─┤──→ Room ──→ Hooks ──→ Broadcast ──→ [All Channels]
[AI]  ───┘

The channel doesn't matter. The room is the conversation.

Show Me the Code

Install:

pip install roomkit

Here's a working example — a support room where a customer on WebSocket talks to an AI assistant:

import asyncio
from roomkit import (
    RoomKit, WebSocketChannel, AIChannel, MockAIProvider,
    ChannelCategory, InboundMessage, TextContent,
)

async def main():
    kit = RoomKit()

    # Register channels
    kit.register_channel(WebSocketChannel("customer-ws"))
    kit.register_channel(AIChannel("assistant", provider=MockAIProvider(
        responses=["I found your order — it shipped yesterday."]
    )))

    # Create room and attach channels
    await kit.create_room(room_id="support-42")
    await kit.attach_channel("support-42", "customer-ws")
    await kit.attach_channel("support-42", "assistant",
                              category=ChannelCategory.INTELLIGENCE)

    # Process an inbound message
    await kit.process_inbound(InboundMessage(
        channel_id="customer-ws",
        sender_id="customer-1",
        content=TextContent(body="Where is my order?"),
    ))

    # Check the conversation timeline
    for event in await kit.store.list_events("support-42"):
        print(f"[{event.source.channel_id}] {event.content.body}")

asyncio.run(main())

Output:

[customer-ws] Where is my order?
[assistant] I found your order — it shipped yesterday.

That's it. The customer's message enters the room, the AI channel picks it up, responds, and everything is stored in a unified timeline. Replace MockAIProvider with AnthropicAIProvider or OpenAIAIProvider for production.

Hooks: Where Your Logic Lives

The hook system is where RoomKit gets interesting. Instead of scattering logic across handlers, you intercept events at well-defined points:

from roomkit import HookTrigger, HookResult

@kit.hook(HookTrigger.BEFORE_BROADCAST)
async def moderate_content(event, ctx):
    if contains_profanity(event.content.body):
        return HookResult.block("Content policy violation")
    return HookResult.allow()

@kit.hook(HookTrigger.BEFORE_BROADCAST)
async def route_to_ai(event, ctx):
    if needs_ai_response(event, ctx):
        return HookResult.inject_to(["ai-channel"])
    return HookResult.allow()

Content moderation, AI routing, analytics, transformations — all in one place, applied uniformly regardless of which channel the message came from.

What RoomKit Is (and Isn't)

RoomKit is not a platform. There's no dashboard, no hosted infrastructure, no per-message fees. It's a Python library — primitives you compose into your own system.

Think of it as the missing layer between your channels and your logic:

vs. Twilio/Telnyx: RoomKit doesn't send messages. It orchestrates them. Use Twilio as a provider inside RoomKit.
vs. Chatwoot/Intercom: Those are full applications. RoomKit is what you'd use to build one.
vs. Rasa/Dialogflow: Those focus on NLP. RoomKit focuses on routing messages to AI (or humans) across channels.
vs. LiveKit: LiveKit is WebRTC infrastructure for real-time media. RoomKit is conversation orchestration for any channel.

The Channel Matrix

Built-in channel types with pluggable providers:

Channel	Providers
SMS	Twilio, Telnyx, Sinch, VoiceMeUp
RCS	Twilio, Telnyx
Email	ElasticEmail, SMTP
WhatsApp	Business & Personal
Messenger	Facebook
Teams	Bot Framework
Voice	Deepgram STT, ElevenLabs TTS, FastRTC
Realtime Voice	Gemini Live, OpenAI Realtime
WebSocket	Built-in
AI	Anthropic, OpenAI, Gemini
HTTP	Generic webhooks

Swap providers without touching application logic. Twilio today, Telnyx tomorrow — your hooks don't change.

Production-Ready Patterns

RoomKit ships with the resilience patterns you'd eventually build yourself:

Pluggable storage: In-memory for dev, Redis/PostgreSQL for production
Circuit breakers to isolate failing providers
Rate limiting with token buckets
Retry with exponential backoff
Identity resolution across channels (the same person on SMS and email becomes one participant)
Event sources with auto-restart, health monitoring, and backpressure

AI-Native

This is 2026. Every conversation system needs AI integration. RoomKit was designed for it:

AI channels are first-class citizens, not bolted on
Two voice modes: STT/TTS pipeline or speech-to-speech (Gemini Live, OpenAI Realtime)
llms.txt and AGENTS.md built into the package so AI coding assistants understand the codebase
MCP integration — use RoomKit as a tool in Claude, Cursor, or any MCP-compatible agent
Programmatic AI context: get_llms_txt() and get_agents_md() for feeding documentation into LLM context windows

from roomkit import get_llms_txt, get_agents_md

# Give your AI assistant full context on RoomKit
llms_content = get_llms_txt()
agents_guidelines = get_agents_md()

Protocol-First: The RFC

This is the part I'm most excited about. RoomKit isn't just a Python library — it's a protocol.

Early on, I made a deliberate choice: write the specification before locking in the implementation. The result is roomkit-rfc.md — a language-agnostic RFC that defines rooms, channels, hooks, identity resolution, event schemas, and the full message lifecycle. The Python library is the reference implementation, but the spec stands on its own.

Why does this matter? Because conversation orchestration shouldn't be a Python-only problem. The same room/channel/hook model makes sense in Go, Rust, TypeScript, Java — anywhere you're building multi-channel systems.

This is where you come in. The spec is stable and ready for other language bindings. If you're building conversation systems in Go and want a RoomKit SDK, the RFC gives you everything you need to build one that's compatible with the Python implementation. Same concepts, same semantics, interoperable by design.

What's available today:

roomkit-specs: The protocol RFC — start here if you want to build a binding
roomkit: Python reference implementation
roomkit-docs: Documentation site
roomkit-website: Landing page

I'd love to see a roomkit-go, roomkit-ts, or roomkit-rust emerge from the community. The protocol is designed to make that possible — and I'm happy to support anyone who wants to take it on.

Getting Started

# Core library (only dependency: Pydantic)
pip install roomkit

# With AI providers
pip install roomkit[anthropic]
pip install roomkit[openai]

# Everything
pip install roomkit[all]

The library is fully typed, async-first, and runs on Python 3.12+. The API is stable, the test suite is comprehensive, and the documentation covers everything from quickstart to production deployment.

Website: roomkit.live
GitHub: github.com/roomkit-live
PyPI: pypi.org/project/roomkit
Docs: roomkit.live/docs

RoomKit is open source, MIT licensed, and looking for early adopters. If you're building anything that involves conversations across multiple channels, I'd love to hear how it fits your use case.

Star the repo, try the quickstart, open an issue. Let's build this together.