DEV Community

Cover image for Build and run real-time media pipelines, Speech to Text, Voice Agents, live audio processing
John
John

Posted on

Build and run real-time media pipelines, Speech to Text, Voice Agents, live audio processing

With StreamKit.dev one can build and run real-time media pipelines on your own infrastructure. Speech-to-text, voice agents, live audio processing — composable, observable, self-hosted. It is totally open source. Full references and description at https://streamkit.dev/

Who is this for?
StreamKit is built for developers who need to process real-time media — whether you’re building voice features for an app, prototyping an AI audio pipeline, or self-hosting alternatives to cloud speech APIs.

What you can build
Live transcription — Ingest audio via MoQ, run Whisper or SenseVoice STT, stream transcription updates to clients
Voice agents — TTS-powered bots using Kokoro, Piper, or Matcha that respond to audio input
Real-time translation — Bilingual streams with live subtitles using NLLB or Helsinki models
Audio processing — Mixing, gain control, format conversion, encoding/decoding pipelines
Content analysis — VAD for speech detection, keyword spotting, or custom safety filters.

one can try the powerful engine at https://demo.streamkit.dev/

Top comments (1)

Collapse
 
ankush_banyal_708fa19a469 profile image
Ankush Banyal

Hey, this StreamKit.dev looks really interesting for real-time processing. I was checking out the Rust engine and how it handles the DAG-based pipelines—the modularity is quite clean for something open source.

Actually, we are using Ant Media Server for our streaming needs since it handles the high-scale WebRTC and LL-HLS delivery very well with sub-500ms latency. But sometimes, the challenge is what to do with the audio before it reaches the edge—like running VAD or Whisper for live captions.

I think there is a great use case here where Ant Media handles the massive distribution and clustering, and StreamKit acts as a "processing sidecar" for the AI agents or mixing. Since both are focusing on Media over QUIC (MoQ) and low-latency, they could work together very nicely in a self-hosted setup.

Is there any plan to support video frames in the pipeline soon? Or is it strictly for audio-only agents for now? Good job on the demo btw, it is working very smooth.