DEV Community

Sylvain Boily
Sylvain Boily

Posted on • Originally published at roomkit.live

Your Voice Assistant Doesn't Need the Cloud

Build a fully local, open-source voice assistant in Python — no API keys, no subscriptions, no data leaving your machine.


If you've ever built a voice assistant, you know the drill: sign up for Deepgram, get an ElevenLabs API key, wire up OpenAI, watch the invoices stack up, and hope your users are comfortable sending their audio to three different cloud providers.

I wanted something different. A voice assistant that runs entirely on my machine — STT, LLM, TTS, everything — with zero cloud dependencies. And I wanted to build it with the same clean abstractions I'd use for a cloud-based setup.

Here's what I ended up with: a fully local voice pipeline running on a single NVIDIA 4070, responding in under 300ms after the initial warmup. Everything open source. Everything local.

I built it with RoomKit, an open-source Python framework I created for multi-channel conversation orchestration.

The Stack

No API keys. No cloud. Just models running on your GPU:

Component Tool Role
STT Kroko ASR Speech-to-text (streaming Zipformer via sherpa-onnx)
LLM Qwen 2.5 4B Language model via Ollama
TTS Piper (fr_FR-siwis-medium) Text-to-speech (VITS via sherpa-onnx)
VAD TEN-VAD Voice activity detection (sherpa-onnx)
Orchestration RoomKit Wires it all together

The entire audio pipeline looks like this:

Mic → [Resampler] → [AEC] → [Denoiser] → VAD → STT → LLM → TTS → Speaker
Enter fullscreen mode Exit fullscreen mode

Every component runs locally. The heaviest lift is the LLM at 4B parameters — small enough to leave plenty of VRAM for the ONNX models.

Why sherpa-onnx?

When I started building the local voice pipeline for RoomKit, I looked at the usual suspects: faster-whisper, whisper.cpp, Coqui TTS. All great projects. But sherpa-onnx stood out for one reason: it covers the entire audio stack.

One library gives you streaming STT (Zipformer, Whisper, Paraformer), TTS (VITS/Piper voices), neural VAD (TEN-VAD, Silero), speech enhancement (GTCRN denoiser), and even echo cancellation support. All through ONNX Runtime, which means you get CUDA acceleration without PyTorch overhead.

For RoomKit, this was ideal. I could build providers for STT, TTS, VAD, and denoising that all share the same runtime, the same model format, and the same deployment story.

Show Me the Code

Let's build this step by step, starting simple.

Step 1: The providers

Each piece of the pipeline is a RoomKit provider — a pluggable component you can swap without changing your application logic:

from roomkit.voice.stt.sherpa_onnx import SherpaOnnxSTTProvider, SherpaOnnxSTTConfig
from roomkit.voice.tts.sherpa_onnx import SherpaOnnxTTSProvider, SherpaOnnxTTSConfig
from roomkit.voice.pipeline.vad.sherpa_onnx import SherpaOnnxVADProvider, SherpaOnnxVADConfig

# Speech-to-text: Kroko ASR (streaming Zipformer)
stt = SherpaOnnxSTTProvider(SherpaOnnxSTTConfig(
    mode="transducer",
    encoder="kroko-encoder.onnx",
    decoder="kroko-decoder.onnx",
    joiner="kroko-joiner.onnx",
    tokens="tokens.txt",
    sample_rate=16000,
    provider="cuda",  # or "cpu"
))

# Text-to-speech: Piper French voice
tts = SherpaOnnxTTSProvider(SherpaOnnxTTSConfig(
    model="fr_FR-siwis-medium.onnx",
    tokens="tokens.txt",
    data_dir="espeak-ng-data",
    sample_rate=22050,
    provider="cuda",
))

# Voice activity detection: TEN-VAD
vad = SherpaOnnxVADProvider(SherpaOnnxVADConfig(
    model="ten-vad.onnx",
    model_type="ten",
    threshold=0.5,
    silence_threshold_ms=600,
    sample_rate=16000,
    provider="cpu",  # VAD is tiny — CPU is actually faster
))
Enter fullscreen mode Exit fullscreen mode

Notice that VAD runs on CPU. The model is so small (runs every 20ms) that the GPU transfer overhead makes CUDA slower for it. RoomKit lets you mix providers — CUDA for the heavy models, CPU for the lightweight ones.

Step 2: The audio pipeline

RoomKit's AudioPipelineConfig chains the preprocessing stages together:

from roomkit.voice.pipeline import AudioPipelineConfig

pipeline = AudioPipelineConfig(
    vad=vad,
    # Optional: add denoiser and echo cancellation
    # denoiser=SherpaOnnxDenoiserProvider(config),
    # aec=SpeexAECProvider(frame_size=320),
)
Enter fullscreen mode Exit fullscreen mode

Want to add noise reduction? Uncomment one line. Echo cancellation for a speaker setup? One more line. Each stage is optional and pluggable.

Step 3: The LLM

Any OpenAI-compatible server works — Ollama, vLLM, LM Studio:

from roomkit import VLLMConfig, create_vllm_provider
from roomkit.channels.ai import AIChannel

ai_provider = create_vllm_provider(VLLMConfig(
    model="qwen2.5:4b",
    base_url="http://localhost:11434/v1",  # Ollama
    max_tokens=256,
))

ai = AIChannel("ai", provider=ai_provider, system_prompt=(
    "You are a friendly voice assistant. Keep responses "
    "short and conversational — one or two sentences."
))
Enter fullscreen mode Exit fullscreen mode

Step 4: Wire it all together

This is where RoomKit shines. The room is the conversation — channels attach to it, messages flow through hooks:

from roomkit import RoomKit, VoiceChannel, ChannelCategory
from roomkit.voice.backends.local import LocalAudioBackend

kit = RoomKit()

# Local mic + speakers
backend = LocalAudioBackend(
    input_sample_rate=16000,
    output_sample_rate=22050,
    channels=1,
    block_duration_ms=20,
)

# Register channels
voice = VoiceChannel("voice", stt=stt, tts=tts, backend=backend, pipeline=pipeline)
kit.register_channel(voice)
kit.register_channel(ai)

# Create a room and attach both channels
await kit.create_room(room_id="local-voice")
await kit.attach_channel("local-voice", "voice")
await kit.attach_channel("local-voice", "ai", category=ChannelCategory.INTELLIGENCE)
Enter fullscreen mode Exit fullscreen mode

That's it. Speak into your mic, the audio flows through VAD → STT → LLM → TTS → speaker. No cloud. No API keys.

Step 5: Add hooks for visibility

RoomKit's hook system lets you intercept events at any point — logging, moderation, analytics, custom routing:

from roomkit import HookTrigger, HookResult, HookExecution

@kit.hook(HookTrigger.ON_TRANSCRIPTION)
async def on_transcription(text, ctx):
    print(f"You said: {text}")
    return HookResult.allow()

@kit.hook(HookTrigger.BEFORE_TTS)
async def before_tts(text, ctx):
    print(f"Assistant: {text}")
    return HookResult.allow()

@kit.hook(HookTrigger.ON_SPEECH_START, execution=HookExecution.ASYNC)
async def on_speech_start(session, ctx):
    print("Listening...")
Enter fullscreen mode Exit fullscreen mode

Hooks aren't just for logging. You can block messages, reroute them, transform text before TTS, or trigger external actions — all without touching the pipeline code.

What It Looks Like in Practice

Here's an actual conversation running on my machine (RTX 4070, Ollama with Qwen 2.5 4B, French models):

14:05:08 You said: Salut, comment ça va
14:05:12 Assistant: Ça va bien, merci! Et toi?

14:05:18 You said: Ouais, ça va plutôt pas mal
14:05:18 Assistant: C'est cool! Fais-toi repos et profite du bon temps.

14:05:28 You said: Qu'est-ce que tu sais faire en fait ?
14:05:29 Assistant: Je peux répondre aux questions, donner des suggestions,
          raconter des histoires et aider avec plein de choses!
Enter fullscreen mode Exit fullscreen mode

The first response takes ~4 seconds (Ollama cold start). After that, the LLM responds in ~280ms. The full loop — end of speech to start of audio playback — is well under 2 seconds.

And yes, barge-in works. Start talking while the assistant is speaking and the TTS stops immediately:

14:05:34 TTS interrupted: reason=barge_in, position=4751ms
14:05:34 Speech started (new utterance)
Enter fullscreen mode Exit fullscreen mode

This is the kind of thing that's painful to implement from scratch but comes built into RoomKit's voice pipeline.

The Real Point: Swap Without Rewriting

Here's what I find most interesting about this setup. The local voice pipeline uses the exact same RoomKit abstractions as the cloud-based one. If tomorrow you want to switch to Deepgram for STT and ElevenLabs for TTS, you swap the providers:

# Local
stt = SherpaOnnxSTTProvider(config)
tts = SherpaOnnxTTSProvider(config)

# Cloud (same interface, same hooks, same room)
stt = DeepgramSTTProvider(api_key="...")
tts = ElevenLabsTTSProvider(api_key="...", voice_id="...")
Enter fullscreen mode Exit fullscreen mode

Your hooks don't change. Your room logic doesn't change. Your recording, moderation, and routing code stays the same. That's the whole point of RoomKit — the channel and provider are implementation details; the conversation is the abstraction.

Getting Started

Prerequisites:

  • Linux with NVIDIA GPU (tested on RTX 4070)
  • Ollama installed and running
  • Python 3.12+

Install:

pip install roomkit[local-audio,openai,sherpa-onnx]
ollama pull qwen2.5:4b
Enter fullscreen mode Exit fullscreen mode

Download models (one time):

# VAD
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vad-models/ten-vad.onnx

# STT - Kroko ASR
# (download from https://huggingface.co/Banafo/Kroko-ASR)

# TTS - Piper French voice
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-fr_FR-siwis-medium.tar.bz2
tar xf vits-piper-fr_FR-siwis-medium.tar.bz2
Enter fullscreen mode Exit fullscreen mode

For GPU acceleration (optional but recommended):

# Install cuDNN 9
sudo apt-get install cudnn9-cuda-12

# Install CUDA wheel for sherpa-onnx
pip install sherpa-onnx==1.12.23+cuda12.cudnn9 \
    -f https://k2-fsa.github.io/sherpa/onnx/cuda.html

export ONNX_PROVIDER=cuda
Enter fullscreen mode Exit fullscreen mode

Run the full example:

git clone https://github.com/roomkit-live/roomkit
cd roomkit

# Set your model paths and run
LLM_MODEL=qwen2.5:4b \
LLM_BASE_URL=http://localhost:11434/v1 \
VAD_MODEL=ten-vad.onnx \
STT_ENCODER=<path-to-encoder> \
STT_DECODER=<path-to-decoder> \
STT_JOINER=<path-to-joiner> \
STT_TOKENS=<path-to-tokens> \
TTS_MODEL=fr_FR-siwis-medium.onnx \
TTS_TOKENS=tokens.txt \
TTS_DATA_DIR=espeak-ng-data \
ONNX_PROVIDER=cuda \
python examples/voice_local_onnx_vllm.py
Enter fullscreen mode Exit fullscreen mode

The full example on GitHub includes everything: echo cancellation, noise reduction, WAV recording, debug taps, and graceful shutdown.

What's Next

This example is a starting point. RoomKit's architecture means you can extend it in any direction:

  • Add a WebSocket channel to the same room — now your voice assistant also handles text chat
  • Plug in MCP tools — let the assistant search documents, query databases, or control devices
  • Record and analyze — the built-in WAV recorder captures both sides of the conversation
  • Switch languages — swap STT/TTS models for English, German, Spanish, or any language sherpa-onnx supports

The full source is at github.com/roomkit-live/roomkit. Star the repo, try the example, open an issue. The framework is MIT licensed and actively looking for contributors.

Website: roomkit.live
GitHub: github.com/roomkit-live
PyPI: pypi.org/project/roomkit

Top comments (0)