Ayoola Solomon

Posted on Nov 19

How I Designed a Modular, Event-Driven Architecture for Real-Time Voice AI

#ai #architecture #automation #startup

Most voice AI systems today are built as a fixed chain:

STT → LLM → TTS → Audio Output.

This works for demos, but falls apart the moment you need:

Custom business logic
CRM integrations
Multi-agent routing
Knowledge lookups
Scheduling flows
Post-call actions
Pipeline branching

Swappable providers (Claude vs GPT, Deepgram vs Whisper, etc.)

So for EchoStack, I scrapped the idea of a “voice bot pipeline” entirely and built a voice automation platform powered by an event-driven orchestration layer.

Here’s how the architecture works — and why it has completely changed what’s possible with real-time AI.

LiveKit Only Handles Ingress & Egress

Not STT.
Not LLM.
Not TTS.

Just pure audio transport:

User Mic → LiveKit → EchoStack  
EchoStack → LiveKit → User Speaker

Inside EchoStack, every audio frame becomes an event:

processing.livekit.audio_frame

This makes the audio layer fully modular and independent of AI logic.

Everything Inside EchoStack Is a Connector

A connector can be:

Deepgram STT
WhisperX
AssemblyAI
Claude
GPT-4o
Llama 3
ElevenLabs
Azure Neural TTS
HubSpot
Salesforce
Zendesk
Calendly
A custom HTTP API
A knowledge search
A database entry
Or even another AI agent

{
  "consumes": ["processing.deepgram.text"],
  "produces": ["processing.claude.agent_message"]
}

EchoStack uses this to decide where events flow next.

This creates a real-time version of Zapier or LangGraph.

Pipelines Are Just Manifests

Instead of hardcoded logic, pipelines are defined like this:

{
  "pipeline": [
    "ingress.livekit.audio_frame → deepgram.stt",
    "deepgram.stt → claude.agent",
    "claude.agent → elevenlabs.tts",
    "elevenlabs.tts → egress.livekit.audio_chunk"
  ]
}

No code.
No wiring.
Just declarative routing.

Want to swap Deepgram for Whisper?
Edit one line.

Want to add sentiment analysis between STT and LLM?
Add one rule.

Want multi-agent routing?
Add a router connector.

Multi-Playbook Orchestration (The Real Game-Changer)

Traditional voice agents can only run one flow.

EchoStack can run many — and switch between them in real time:

LeadQualifier.json  
MeetingBooker.json  
FAQBot.json  
SupportAgent.json  
CRMLogger.json

If the user says:

“I want to book a meeting.”

Routing connector switches the playbook:

processing.deepgram.text → intent.router → meeting_booker.playbook

This is impossible in a linear voice bot pipeline, but trivial in an event system.

Real-Time Streaming (STT, LLM, TTS)

Because everything is async events, the system supports:

Streaming STT transcripts
Streaming LLM tokens (Claude / GPT-4o)
Streaming TTS audio chunks
Barge-in and interruption
Live agent escalation
Parallel processing
Multi-agent collaboration

Example LLM output stream event:

processing.claude.agent_message.partial

Example TTS stream:

processing.elevenlabs.audio_chunk.stream

The user hears responses as they are generated — not after the full LLM response.

Full Pipeline Simulation (No LiveKit Needed)

This is my favorite feature.

EchoStack can simulate:

Audio → STT
STT → LLM
LLM → TTS
TTS → Egress
All connector interactions

Without touching real providers.

It utilizes a mock runtime registry to generate realistic, fake outputs.

This allows:

Visual debugging
Step-by-step replay
Education demos
Test-driven development
Predictable QA
“Dry runs” before deployment

This is something even Retell & Vapi don’t have today.

And It Scales Like a Distributed System

Because everything is events:

Each connector is a worker
Workers scale horizontally
Backpressure is manageable
Failures can be contained
Retries & fallbacks are simple
Pipelines can fork or merge
Multi-agent flows work naturally
Audioless connectors (CRM, DB, API) blend seamlessly

It behaves like:

Zapier
AWS EventBridge
LangGraph
Airflow
N8N

…but optimized for real-time audio.

What This Unlocks for Businesses

This is where the architecture stops being “cool tech” and becomes actual value:

Lead qualification
After-hours support
Customer triage
Booking assistants
Helpdesk automation
Sales follow-ups
Knowledge Q&A
Order tracking
Multi-agent escalation
CRM syncing
Custom playbooks per industry
Complex routing between AI tools

You don’t just deploy “a bot.”

You deploy a network of intelligent voice automations.

Closing Thoughts

Voice AI is moving fast, but most of what exists today is still:

rigid
non-composable
difficult to integrate
tied to single vendors
non-debuggable
non-portable

By making the entire system event-driven and connector-based, EchoStack becomes:

A real-time automation platform where voice is the entry point — not the limitation.

If you’re into real-time systems, LiveKit, STT/LLM/TTS pipelines, or voice automation, I’d love to exchange ideas.

DEV Community