alakkadshaw

Posted on Jan 5

LLMRTC: Build real-time voice vision AI apps

#webdev #webrtc #ai #javascript

Building a real time voice and vision feels quite hard. The hard part is

streaming audio and video in real time.
handing barge-in + reconnection
Wiring STT->LLM->TTS without a pile of glue code

here LLMRTC comes in

Links (start here):

Project homepage + docs hub : LLMRTC
Docs (LLMRTC Getting Started): Quickstarts for install, backend, and web client LLMRTC Quickstart
Source, packages, architecture, and examples: GitHub repo

Why Real-Time AI Still Feels Hard
What Is LLMRTC (and Who Is It For)?
The 60-Second Mental Model
What You Get Out of the Box
5-Minute “Hello Voice Agent”
Install
Run a Backend
Connect from the Browser
Adding Vision (Camera / Screen-Aware Agents)
Tool Calling: From “Chat” to “Do Things”
Wrap-Up + Links (Docs + GitHub)

Why real time AI is hard

If you have tried to build a "talk to an app" experience, you know the trap: the demo is simple but the system is quite complex

The real time agent is not just an LLM call. it is WebRTC, STT, LLM,TTS and often vision plus a lot of stuff like reconnection, sessions and Observability

Here are the two things that usually break first

Latency

A voice agent can be smart but feel un-usable if it is slow. We as humans are very sensitive to conversational timing

Abrupt pauses makes us feel uncomfortable, robotic or laggy

The difficult part is that latency is not just one hop it is a sum of Capture -> transport -> model -> synthesis -> playback

If you can't do this end-to-end and fast the whole thing stops feeling like a conversation

Glue + provider drift

When you are building an app, it tends to collapse under its own integration weight.

Every provider has its own and different streaming semantics, event formats and handles barge-in differently.

And after some time the code base is mostly glue and not logic that is related to your product

LLMRTC comes in

LLMRTC comes it to make this the default path - the one you take on day one- the production path that you can keep shipping on

What is LLMRTC?

LLMRTC is an open source TypeScript SDK for building real time voice and vision AI apps.

LLMRTC uses WebRTC for low latency audio and video streaming and provides a unified and provider-agnostic orchestration layer for the complete pipeline like so STT->LLM->TTS plus vision, you that you can focus on your app logic instead of stitching together streaming, tool calling and session/reconnect processes

What you get with LLMRTC

Real time voice over WebRTC + server side VAD: you get the low latency audio and server side speech detection, so that your agent knows when to listen vs when to respond. (Without you wiring the audio plumbing yourself)
Barge-in (interrupt mid-speech) Users can cut the assistant off naturally, and the pipeline handles the "stop talking, start listening" switch just like having a real conversation.
Provider agnostic by design- you can swap or mix providers (OpenAI/ Anthropic/Gemini/Bedrock/OpenRouter/local) via config instead rewriting your app for every one of the vendor event model.
Tool calling with JSON schema - define tools once and then get structured arguments and keep "agent actions" predictable and debuggable

Playbooks for multi-stage flows: move beyond a single prompt into structured multi -step conversations (triage->confirm-act-follow-up)

Hooks/metrics + reconnection/session persistence the unsexy production stuff (events, observability, reconnect behaviour, session continuity) is a part of the SDK story and not an afterthought.

5 minute "Hello Voice Agent"

Installing LLMRTC

you can easily install LLMRTC using npm

npm install @llmrtc/llmrtc-backend
npm install @llmrtc/llmrtc-web-client

these packages cover the node backend (WebRTC + providers) and the browser client (capture/playback + events)

Start a backend

LLMRTC gives you two ways to run the server

Library mode (recommended): import LLMRTCServer and configure it in code.
CLI mode: run npc llmrtc-backend and configure via env/ .env

What you configure

Providers: llm,stt,tts
A systemPrompt
A port ( your browser will connect to this via the signalling URL)

Library mode examples

import {
  LLMRTCServer,
  OpenAILLMProvider,
  OpenAIWhisperProvider,
  ElevenLabsTTSProvider
} from "@llmrtc/llmrtc-backend";

const server = new LLMRTCServer({
  providers: {
    llm: new OpenAILLMProvider({ apiKey: process.env.OPENAI_API_KEY! }),
    stt: new OpenAIWhisperProvider({ apiKey: process.env.OPENAI_API_KEY! }),
    tts: new ElevenLabsTTSProvider({ apiKey: process.env.ELEVENLABS_API_KEY! }),
  },
  port: 8787,
  systemPrompt: "You are a helpful voice assistant.",
});

await server.start();

CLI mode example (minimal)

echo "OPENAI_API_KEY=sk-..." > .env
echo "ELEVENLABS_API_KEY=xi-..." >> .env
npx llmrtc-backend

Connect from browser

Minimal flow:

Create LLMRTCWebClient
Listen for transcript + streamed LLM chunks
getUserMedia({ audio: true}) -> client.shareAudio(stream)

Browser example

import { LLMRTCWebClient } from "@llmrtc/llmrtc-web-client";

const client = new LLMRTCWebClient({
  signallingUrl: "ws://localhost:8787",
});

client.on("transcript", (text) => console.log("User:", text));
client.on("llmChunk", (chunk) => console.log("Assistant:", chunk));

await client.start();

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
await client.shareAudio(stream);

How to implement vision

Send the camera frames or screen captures alongside speech - you can make the agent "screen-aware" or camera-aware instead of voice-only
Vision-capable models can see what the user sees: great for "help me with what's on my screen" or "what am I pointing at?" experiences
Don't reinvent the patterns: for walkthroughs, jump into the Concepts and Recipes section in the docs

Conclusion

LLMRTC is a critical infrastructure layer for real-time voice + vision agents in TypeScript: WebRTC transport, streaming STT->LLM->TTS, tool calling and the production details (session, reconnects) so you can spend your time on the product

Here are some important links