DEV Community

Cover image for How I Designed a Modular, Event-Driven Architecture for Real-Time Voice AI
Ayoola Solomon
Ayoola Solomon

Posted on

How I Designed a Modular, Event-Driven Architecture for Real-Time Voice AI

Most voice AI systems today are built as a fixed chain:

STT → LLM → TTS → Audio Output.

This works for demos, but falls apart the moment you need:

  • Custom business logic
  • CRM integrations
  • Multi-agent routing
  • Knowledge lookups
  • Scheduling flows
  • Post-call actions
  • Pipeline branching

Swappable providers (Claude vs GPT, Deepgram vs Whisper, etc.)

So for EchoStack, I scrapped the idea of a “voice bot pipeline” entirely and built a voice automation platform powered by an event-driven orchestration layer.

Here’s how the architecture works — and why it has completely changed what’s possible with real-time AI.

LiveKit Only Handles Ingress & Egress

Not STT.
Not LLM.
Not TTS.

Just pure audio transport:

User Mic → LiveKit → EchoStack  
EchoStack → LiveKit → User Speaker
Enter fullscreen mode Exit fullscreen mode

Inside EchoStack, every audio frame becomes an event:

processing.livekit.audio_frame
Enter fullscreen mode Exit fullscreen mode

This makes the audio layer fully modular and independent of AI logic.

Everything Inside EchoStack Is a Connector

A connector can be:

  • Deepgram STT
  • WhisperX
  • AssemblyAI
  • Claude
  • GPT-4o
  • Llama 3
  • ElevenLabs
  • Azure Neural TTS
  • HubSpot
  • Salesforce
  • Zendesk
  • Calendly
  • A custom HTTP API
  • A knowledge search
  • A database entry
  • Or even another AI agent
{
  "consumes": ["processing.deepgram.text"],
  "produces": ["processing.claude.agent_message"]
}
Enter fullscreen mode Exit fullscreen mode

EchoStack uses this to decide where events flow next.

This creates a real-time version of Zapier or LangGraph.

Pipelines Are Just Manifests

Instead of hardcoded logic, pipelines are defined like this:

{
  "pipeline": [
    "ingress.livekit.audio_frame → deepgram.stt",
    "deepgram.stt → claude.agent",
    "claude.agent → elevenlabs.tts",
    "elevenlabs.tts → egress.livekit.audio_chunk"
  ]
}
Enter fullscreen mode Exit fullscreen mode

No code.
No wiring.
Just declarative routing.

Want to swap Deepgram for Whisper?
Edit one line.

Want to add sentiment analysis between STT and LLM?
Add one rule.

Want multi-agent routing?
Add a router connector.

Multi-Playbook Orchestration (The Real Game-Changer)

Traditional voice agents can only run one flow.

EchoStack can run many — and switch between them in real time:

LeadQualifier.json  
MeetingBooker.json  
FAQBot.json  
SupportAgent.json  
CRMLogger.json
Enter fullscreen mode Exit fullscreen mode

If the user says:

“I want to book a meeting.”

Routing connector switches the playbook:

processing.deepgram.text → intent.router → meeting_booker.playbook
Enter fullscreen mode Exit fullscreen mode

This is impossible in a linear voice bot pipeline, but trivial in an event system.

Real-Time Streaming (STT, LLM, TTS)

Because everything is async events, the system supports:

  • Streaming STT transcripts
  • Streaming LLM tokens (Claude / GPT-4o)
  • Streaming TTS audio chunks
  • Barge-in and interruption
  • Live agent escalation
  • Parallel processing
  • Multi-agent collaboration

Example LLM output stream event:

processing.claude.agent_message.partial
Enter fullscreen mode Exit fullscreen mode

Example TTS stream:

processing.elevenlabs.audio_chunk.stream
Enter fullscreen mode Exit fullscreen mode

The user hears responses as they are generated — not after the full LLM response.

Full Pipeline Simulation (No LiveKit Needed)

This is my favorite feature.

EchoStack can simulate:

  • Audio → STT
  • STT → LLM
  • LLM → TTS
  • TTS → Egress
  • All connector interactions

Without touching real providers.

It utilizes a mock runtime registry to generate realistic, fake outputs.

This allows:

  • Visual debugging
  • Step-by-step replay
  • Education demos
  • Test-driven development
  • Predictable QA
  • “Dry runs” before deployment

This is something even Retell & Vapi don’t have today.

And It Scales Like a Distributed System

Because everything is events:

  • Each connector is a worker
  • Workers scale horizontally
  • Backpressure is manageable
  • Failures can be contained
  • Retries & fallbacks are simple
  • Pipelines can fork or merge
  • Multi-agent flows work naturally
  • Audioless connectors (CRM, DB, API) blend seamlessly

It behaves like:

  • Zapier
  • AWS EventBridge
  • LangGraph
  • Airflow
  • N8N

…but optimized for real-time audio.

What This Unlocks for Businesses

This is where the architecture stops being “cool tech” and becomes actual value:

  • Lead qualification
  • After-hours support
  • Customer triage
  • Booking assistants
  • Helpdesk automation
  • Sales follow-ups
  • Knowledge Q&A
  • Order tracking
  • Multi-agent escalation
  • CRM syncing
  • Custom playbooks per industry
  • Complex routing between AI tools

You don’t just deploy “a bot.”

You deploy a network of intelligent voice automations.

Closing Thoughts

Voice AI is moving fast, but most of what exists today is still:

  • rigid
  • non-composable
  • difficult to integrate
  • tied to single vendors
  • non-debuggable
  • non-portable

By making the entire system event-driven and connector-based, EchoStack becomes:

A real-time automation platform where voice is the entry point — not the limitation.

If you’re into real-time systems, LiveKit, STT/LLM/TTS pipelines, or voice automation, I’d love to exchange ideas.

Top comments (0)