Building a real time voice and vision feels quite hard. The hard part is
- streaming audio and video in real time.
- handing barge-in + reconnection
- Wiring STT->LLM->TTS without a pile of glue code
here LLMRTC comes in
Links (start here):
- Project homepage + docs hub : LLMRTC
- Docs (LLMRTC Getting Started): Quickstarts for install, backend, and web client LLMRTC Quickstart
- Source, packages, architecture, and examples: GitHub repo
Table of Contents
- Why Real-Time AI Still Feels Hard
- What Is LLMRTC (and Who Is It For)?
- The 60-Second Mental Model
- What You Get Out of the Box
- 5-Minute “Hello Voice Agent”
- Install
- Run a Backend
- Connect from the Browser
- Adding Vision (Camera / Screen-Aware Agents)
- Tool Calling: From “Chat” to “Do Things”
- Wrap-Up + Links (Docs + GitHub)
Why real time AI is hard
If you have tried to build a "talk to an app" experience, you know the trap: the demo is simple but the system is quite complex
The real time agent is not just an LLM call. it is WebRTC, STT, LLM,TTS and often vision plus a lot of stuff like reconnection, sessions and Observability
Here are the two things that usually break first
Latency
A voice agent can be smart but feel un-usable if it is slow. We as humans are very sensitive to conversational timing
Abrupt pauses makes us feel uncomfortable, robotic or laggy
The difficult part is that latency is not just one hop it is a sum of Capture -> transport -> model -> synthesis -> playback
If you can't do this end-to-end and fast the whole thing stops feeling like a conversation
Glue + provider drift
When you are building an app, it tends to collapse under its own integration weight.
Every provider has its own and different streaming semantics, event formats and handles barge-in differently.
And after some time the code base is mostly glue and not logic that is related to your product
LLMRTC comes in
LLMRTC comes it to make this the default path - the one you take on day one- the production path that you can keep shipping on
What is LLMRTC?
LLMRTC is an open source TypeScript SDK for building real time voice and vision AI apps.
LLMRTC uses WebRTC for low latency audio and video streaming and provides a unified and provider-agnostic orchestration layer for the complete pipeline like so STT->LLM->TTS plus vision, you that you can focus on your app logic instead of stitching together streaming, tool calling and session/reconnect processes
What you get with LLMRTC
Real time voice over WebRTC + server side VAD: you get the low latency audio and server side speech detection, so that your agent knows when to listen vs when to respond. (Without you wiring the audio plumbing yourself)
Barge-in (interrupt mid-speech) Users can cut the assistant off naturally, and the pipeline handles the "stop talking, start listening" switch just like having a real conversation.
Provider agnostic by design- you can swap or mix providers (OpenAI/ Anthropic/Gemini/Bedrock/OpenRouter/local) via config instead rewriting your app for every one of the vendor event model.
Tool calling with JSON schema - define tools once and then get structured arguments and keep "agent actions" predictable and debuggable
Playbooks for multi-stage flows: move beyond a single prompt into structured multi -step conversations (triage->confirm-act-follow-up)
- Hooks/metrics + reconnection/session persistence the unsexy production stuff (events, observability, reconnect behaviour, session continuity) is a part of the SDK story and not an afterthought.
5 minute "Hello Voice Agent"
Installing LLMRTC
you can easily install LLMRTC using npm
npm install @llmrtc/llmrtc-backend
npm install @llmrtc/llmrtc-web-client
these packages cover the node backend (WebRTC + providers) and the browser client (capture/playback + events)
Start a backend
LLMRTC gives you two ways to run the server
- Library mode (recommended): import
LLMRTCServerand configure it in code. - CLI mode: run
npc llmrtc-backendand configure via env/.env
What you configure
- Providers:
llm,stt,tts - A systemPrompt
- A port ( your browser will connect to this via the signalling URL)
Library mode examples
import {
LLMRTCServer,
OpenAILLMProvider,
OpenAIWhisperProvider,
ElevenLabsTTSProvider
} from "@llmrtc/llmrtc-backend";
const server = new LLMRTCServer({
providers: {
llm: new OpenAILLMProvider({ apiKey: process.env.OPENAI_API_KEY! }),
stt: new OpenAIWhisperProvider({ apiKey: process.env.OPENAI_API_KEY! }),
tts: new ElevenLabsTTSProvider({ apiKey: process.env.ELEVENLABS_API_KEY! }),
},
port: 8787,
systemPrompt: "You are a helpful voice assistant.",
});
await server.start();
CLI mode example (minimal)
echo "OPENAI_API_KEY=sk-..." > .env
echo "ELEVENLABS_API_KEY=xi-..." >> .env
npx llmrtc-backend
Connect from browser
Minimal flow:
- Create
LLMRTCWebClient - Listen for transcript + streamed LLM chunks
-
getUserMedia({ audio: true})->client.shareAudio(stream)
Browser example
import { LLMRTCWebClient } from "@llmrtc/llmrtc-web-client";
const client = new LLMRTCWebClient({
signallingUrl: "ws://localhost:8787",
});
client.on("transcript", (text) => console.log("User:", text));
client.on("llmChunk", (chunk) => console.log("Assistant:", chunk));
await client.start();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
await client.shareAudio(stream);
How to implement vision
Send the camera frames or screen captures alongside speech - you can make the agent "screen-aware" or camera-aware instead of voice-only
Vision-capable models can see what the user sees: great for "help me with what's on my screen" or "what am I pointing at?" experiences
Don't reinvent the patterns: for walkthroughs, jump into the Concepts and Recipes section in the docs
Conclusion
LLMRTC is a critical infrastructure layer for real-time voice + vision agents in TypeScript: WebRTC transport, streaming STT->LLM->TTS, tool calling and the production details (session, reconnects) so you can spend your time on the product
Here are some important links
- Docs: https://www.llmrtc.org
- GitHub Repo: https://github.com/llmrtc/llmrtc

Top comments (1)
Thank you for reading. I hope you like the article