Speech in. Speech out. No fluff. Just vibes.
๐ง What Are We Building?
A voice AI agent that:
- ๐ค Listens to you speak
- ๐ Transcribes your words in real time
- ๐ค Thinks with an LLM
- ๐ Talks back, out loud
All of this, running locally, with Groq and Pipecat.
This is my quickstart into the world of real-time multimodal AI. And honestly? The code is surprisingly clean.
๐ค Why This Is Hard (and Why It's Not Anymore)
Building a real-time voice agent used to require:
- โ Custom WebRTC servers
- โ Streaming audio pipelines from scratch
- โ Gluing together 5 different SDKs
- โ Fighting with async concurrency bugs
Today? A framework called Pipecat abstracts all of that away.
You declare a pipeline. You plug in services. It just works.
๐งฉ The Stack
| Layer | Tool |
|---|---|
| ๐๏ธ STT | Groq (Whisper) |
| ๐ง LLM | Groq (LLaMA) |
| ๐ TTS | Groq (PlayAI) |
| ๐ก Transport | WebRTC |
| ๐ VAD | Silero |
| ๐ง Framework | Pipecat |
Groq is one of the fastest inference providers available, which matters a lot for real-time voice.
๐๏ธ How the Pipeline Works
Think of it as an assembly line for audio.
๐ค Microphone
โโโบ ๐ก WebRTC input
โโโบ ๐ Groq STT (Whisper)
โโโบ ๐งฉ User Context Aggregator + Silero VAD
โโโบ ๐ง Groq LLM
โโโบ ๐ Groq TTS
โโโบ ๐ก WebRTC output
โโโบ ๐งฉ Assistant Context Aggregator
Each stage is a processor that receives frames, transforms them, and passes them downstream.
๐ก A "frame" in Pipecat is just a unit of data: audio bytes, text, or a signal to trigger the LLM.
The magic of Pipecat is that you don't manage this flow manually. You declare it, and the framework handles scheduling, buffering, and async coordination.
๐๏ธ VAD: The Unsung Hero
Voice Activity Detection (VAD) is what makes the bot feel responsive.
Without VAD, the pipeline wouldn't know when you've finished speaking. It would either:
- Wait forever โณ
- Cut you off mid-sentence โ๏ธ
Silero VAD listens to the audio stream continuously. It fires a signal when you start speaking, and another when you stop. Only after the stop signal does the pipeline forward your speech to STT.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer())
)
One line. Fully streaming VAD. That's the abstraction Pipecat gives you. ๐
๐งโ๐ป The Code: Let's Walk Through It
The full bot lives in a single file: main.py. Let's break it down.
1๏ธโฃ Services: Plug in Groq
stt = GroqSTTService(api_key=GROQ_API_KEY)
tts = GroqTTSService(api_key=GROQ_API_KEY)
llm = GroqLLMService(api_key=GROQ_API_KEY)
Three services, three lines. STT, TTS, LLM. All running on Groq.
2๏ธโฃ Context: Give the Bot a Memory
messages = [
{
"role": "system",
"content": "You are a friendly AI assistant. Respond naturally and keep your answers conversational. Always give short, concise answers โ no more than 2-3 sentences.",
}
]
context = LLMContext(messages)
This is the conversation history. Every turn (user speech and bot response) gets appended here automatically by the aggregators.
The system prompt is where you shape the bot's personality. Short sentences, direct tone, no essays. โ
3๏ธโฃ The Pipeline: Declare the Flow
pipeline = Pipeline([
transport.input(), # ๐ค Audio in
stt, # ๐ Transcribe
user_aggregator, # ๐งฉ Accumulate + VAD
llm, # ๐ง Think
tts, # ๐ Speak
transport.output(), # ๐ก Audio out
assistant_aggregator, # ๐งฉ Save response to context
])
Read it top to bottom. That's literally the data flow. Clean. Declarative. No callbacks spaghetti.
4๏ธโฃ Events: Connect and Disconnect
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
context.add_message(
{"role": "system", "content": "Say Hello, and briefly introduce yourself."}
)
await task.queue_frames([LLMRunFrame()])
When a client connects, we inject a message into the context and trigger the LLM manually with LLMRunFrame(). This fires the greeting before the user says anything.
@transport.event_handler("on_client_disconnected")
async def client_disconnected(transport, client):
await task.cancel()
On disconnect: clean shutdown. No zombie pipelines. ๐งน
โ ๏ธ Common mistake: registering
on_client_disconnectedontaskinstead oftransport. The event lives on the transport. Get this wrong and the handler silently never fires.
5๏ธโฃ Run It
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
The runner manages the lifecycle of the task: starts it, keeps it alive, handles signals.
๐ Try It Yourself
git clone https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos.git
cd pipecat-demos
uv sync
cp .env.example .env
# Add your GROQ_API_KEY
uv run python main.py
Open the URL printed in your terminal, click Connect in the top-right corner, and say:
"Give me some info about Morocco" ๐ฒ๐ฆ
๐ญ What's Next?
This is just the beginning.
The bot you just built runs locally, using Pipecat's built-in WebRTC playground. Great for prototyping. Not production.
In the next post, we'll go deeper:
- ๐ Wrap the bot in a FastAPI web app
- ๐ฆ Expose a proper
/connectendpoint - ๐ Replace the toy client with a real frontend
- ๐ข Make it deployable
The architecture shifts from a script to a service. That's where it gets real.
๐ฌ Final Thoughts
What strikes me most about this stack is how much complexity Pipecat hides.
WebRTC negotiation, audio buffering, frame scheduling, async pipelines. All gone. You write business logic. The framework handles the plumbing.
That's the right abstraction level for building production-grade real-time AI agents.
We're genuinely at an inflection point. The tools are here. The APIs are accessible.
There's never been a better time to build voice AI. ๐๏ธ
Top comments (0)