DEV Community

Cover image for ๐ŸŽ™๏ธ I Built a Real-Time Voice AI Agent in ~90 Lines of Python
Mohamed-Amine BENHIMA
Mohamed-Amine BENHIMA

Posted on

๐ŸŽ™๏ธ I Built a Real-Time Voice AI Agent in ~90 Lines of Python

Speech in. Speech out. No fluff. Just vibes.


๐Ÿง  What Are We Building?

A voice AI agent that:

  • ๐ŸŽค Listens to you speak
  • ๐Ÿ“ Transcribes your words in real time
  • ๐Ÿค– Thinks with an LLM
  • ๐Ÿ”Š Talks back, out loud

All of this, running locally, with Groq and Pipecat.

This is my quickstart into the world of real-time multimodal AI. And honestly? The code is surprisingly clean.



๐Ÿค” Why This Is Hard (and Why It's Not Anymore)

Building a real-time voice agent used to require:

  • โŒ Custom WebRTC servers
  • โŒ Streaming audio pipelines from scratch
  • โŒ Gluing together 5 different SDKs
  • โŒ Fighting with async concurrency bugs

Today? A framework called Pipecat abstracts all of that away.

You declare a pipeline. You plug in services. It just works.


๐Ÿงฉ The Stack

Layer Tool
๐ŸŽ™๏ธ STT Groq (Whisper)
๐Ÿง  LLM Groq (LLaMA)
๐Ÿ”Š TTS Groq (PlayAI)
๐Ÿ“ก Transport WebRTC
๐Ÿ”‡ VAD Silero
๐Ÿ”ง Framework Pipecat

Groq is one of the fastest inference providers available, which matters a lot for real-time voice.


๐Ÿ—๏ธ How the Pipeline Works

Think of it as an assembly line for audio.

๐ŸŽค Microphone
    โ””โ”€โ–บ ๐Ÿ“ก WebRTC input
            โ””โ”€โ–บ ๐Ÿ“ Groq STT (Whisper)
                    โ””โ”€โ–บ ๐Ÿงฉ User Context Aggregator + Silero VAD
                                โ””โ”€โ–บ ๐Ÿง  Groq LLM
                                        โ””โ”€โ–บ ๐Ÿ”Š Groq TTS
                                                โ””โ”€โ–บ ๐Ÿ“ก WebRTC output
                                                        โ””โ”€โ–บ ๐Ÿงฉ Assistant Context Aggregator
Enter fullscreen mode Exit fullscreen mode

Each stage is a processor that receives frames, transforms them, and passes them downstream.

๐Ÿ’ก A "frame" in Pipecat is just a unit of data: audio bytes, text, or a signal to trigger the LLM.

The magic of Pipecat is that you don't manage this flow manually. You declare it, and the framework handles scheduling, buffering, and async coordination.


๐Ÿ‘๏ธ VAD: The Unsung Hero

Voice Activity Detection (VAD) is what makes the bot feel responsive.

Without VAD, the pipeline wouldn't know when you've finished speaking. It would either:

  • Wait forever โณ
  • Cut you off mid-sentence โœ‚๏ธ

Silero VAD listens to the audio stream continuously. It fires a signal when you start speaking, and another when you stop. Only after the stop signal does the pipeline forward your speech to STT.

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
    context,
    user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer())
)
Enter fullscreen mode Exit fullscreen mode

One line. Fully streaming VAD. That's the abstraction Pipecat gives you. ๐Ÿ™Œ


๐Ÿง‘โ€๐Ÿ’ป The Code: Let's Walk Through It

The full bot lives in a single file: main.py. Let's break it down.

1๏ธโƒฃ Services: Plug in Groq

stt = GroqSTTService(api_key=GROQ_API_KEY)
tts = GroqTTSService(api_key=GROQ_API_KEY)
llm = GroqLLMService(api_key=GROQ_API_KEY)
Enter fullscreen mode Exit fullscreen mode

Three services, three lines. STT, TTS, LLM. All running on Groq.


2๏ธโƒฃ Context: Give the Bot a Memory

messages = [
    {
        "role": "system",
        "content": "You are a friendly AI assistant. Respond naturally and keep your answers conversational. Always give short, concise answers โ€” no more than 2-3 sentences.",
    }
]

context = LLMContext(messages)
Enter fullscreen mode Exit fullscreen mode

This is the conversation history. Every turn (user speech and bot response) gets appended here automatically by the aggregators.

The system prompt is where you shape the bot's personality. Short sentences, direct tone, no essays. โœ…


3๏ธโƒฃ The Pipeline: Declare the Flow

pipeline = Pipeline([
    transport.input(),       # ๐ŸŽค Audio in
    stt,                     # ๐Ÿ“ Transcribe
    user_aggregator,         # ๐Ÿงฉ Accumulate + VAD
    llm,                     # ๐Ÿง  Think
    tts,                     # ๐Ÿ”Š Speak
    transport.output(),      # ๐Ÿ“ก Audio out
    assistant_aggregator,    # ๐Ÿงฉ Save response to context
])
Enter fullscreen mode Exit fullscreen mode

Read it top to bottom. That's literally the data flow. Clean. Declarative. No callbacks spaghetti.


4๏ธโƒฃ Events: Connect and Disconnect

@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
    context.add_message(
        {"role": "system", "content": "Say Hello, and briefly introduce yourself."}
    )
    await task.queue_frames([LLMRunFrame()])
Enter fullscreen mode Exit fullscreen mode

When a client connects, we inject a message into the context and trigger the LLM manually with LLMRunFrame(). This fires the greeting before the user says anything.

@transport.event_handler("on_client_disconnected")
async def client_disconnected(transport, client):
    await task.cancel()
Enter fullscreen mode Exit fullscreen mode

On disconnect: clean shutdown. No zombie pipelines. ๐Ÿงน

โš ๏ธ Common mistake: registering on_client_disconnected on task instead of transport. The event lives on the transport. Get this wrong and the handler silently never fires.


5๏ธโƒฃ Run It

runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
Enter fullscreen mode Exit fullscreen mode

The runner manages the lifecycle of the task: starts it, keeps it alive, handles signals.


๐Ÿš€ Try It Yourself

git clone https://github.com/BENHIMA-Mohamed-Amine/pipecat-demos.git
cd pipecat-demos

uv sync

cp .env.example .env
# Add your GROQ_API_KEY

uv run python main.py
Enter fullscreen mode Exit fullscreen mode

Open the URL printed in your terminal, click Connect in the top-right corner, and say:

"Give me some info about Morocco" ๐Ÿ‡ฒ๐Ÿ‡ฆ


๐Ÿ”ญ What's Next?

This is just the beginning.

The bot you just built runs locally, using Pipecat's built-in WebRTC playground. Great for prototyping. Not production.

In the next post, we'll go deeper:

  • ๐Ÿ Wrap the bot in a FastAPI web app
  • ๐Ÿ“ฆ Expose a proper /connect endpoint
  • ๐ŸŒ Replace the toy client with a real frontend
  • ๐Ÿšข Make it deployable

The architecture shifts from a script to a service. That's where it gets real.


๐Ÿ’ฌ Final Thoughts

What strikes me most about this stack is how much complexity Pipecat hides.

WebRTC negotiation, audio buffering, frame scheduling, async pipelines. All gone. You write business logic. The framework handles the plumbing.

That's the right abstraction level for building production-grade real-time AI agents.

We're genuinely at an inflection point. The tools are here. The APIs are accessible.

There's never been a better time to build voice AI. ๐ŸŽ™๏ธ

Top comments (0)