Most voice AI systems today are built as a fixed chain:
STT → LLM → TTS → Audio Output.
This works for demos, but falls apart the moment you need:
- Custom business logic
- CRM integrations
- Multi-agent routing
- Knowledge lookups
- Scheduling flows
- Post-call actions
- Pipeline branching
Swappable providers (Claude vs GPT, Deepgram vs Whisper, etc.)
So for EchoStack, I scrapped the idea of a “voice bot pipeline” entirely and built a voice automation platform powered by an event-driven orchestration layer.
Here’s how the architecture works — and why it has completely changed what’s possible with real-time AI.
LiveKit Only Handles Ingress & Egress
Not STT.
Not LLM.
Not TTS.
Just pure audio transport:
User Mic → LiveKit → EchoStack
EchoStack → LiveKit → User Speaker
Inside EchoStack, every audio frame becomes an event:
processing.livekit.audio_frame
This makes the audio layer fully modular and independent of AI logic.
Everything Inside EchoStack Is a Connector
A connector can be:
- Deepgram STT
- WhisperX
- AssemblyAI
- Claude
- GPT-4o
- Llama 3
- ElevenLabs
- Azure Neural TTS
- HubSpot
- Salesforce
- Zendesk
- Calendly
- A custom HTTP API
- A knowledge search
- A database entry
- Or even another AI agent
{
"consumes": ["processing.deepgram.text"],
"produces": ["processing.claude.agent_message"]
}
EchoStack uses this to decide where events flow next.
This creates a real-time version of Zapier or LangGraph.
Pipelines Are Just Manifests
Instead of hardcoded logic, pipelines are defined like this:
{
"pipeline": [
"ingress.livekit.audio_frame → deepgram.stt",
"deepgram.stt → claude.agent",
"claude.agent → elevenlabs.tts",
"elevenlabs.tts → egress.livekit.audio_chunk"
]
}
No code.
No wiring.
Just declarative routing.
Want to swap Deepgram for Whisper?
Edit one line.
Want to add sentiment analysis between STT and LLM?
Add one rule.
Want multi-agent routing?
Add a router connector.
Multi-Playbook Orchestration (The Real Game-Changer)
Traditional voice agents can only run one flow.
EchoStack can run many — and switch between them in real time:
LeadQualifier.json
MeetingBooker.json
FAQBot.json
SupportAgent.json
CRMLogger.json
If the user says:
“I want to book a meeting.”
Routing connector switches the playbook:
processing.deepgram.text → intent.router → meeting_booker.playbook
This is impossible in a linear voice bot pipeline, but trivial in an event system.
Real-Time Streaming (STT, LLM, TTS)
Because everything is async events, the system supports:
- Streaming STT transcripts
- Streaming LLM tokens (Claude / GPT-4o)
- Streaming TTS audio chunks
- Barge-in and interruption
- Live agent escalation
- Parallel processing
- Multi-agent collaboration
Example LLM output stream event:
processing.claude.agent_message.partial
Example TTS stream:
processing.elevenlabs.audio_chunk.stream
The user hears responses as they are generated — not after the full LLM response.
Full Pipeline Simulation (No LiveKit Needed)
This is my favorite feature.
EchoStack can simulate:
- Audio → STT
- STT → LLM
- LLM → TTS
- TTS → Egress
- All connector interactions
Without touching real providers.
It utilizes a mock runtime registry to generate realistic, fake outputs.
This allows:
- Visual debugging
- Step-by-step replay
- Education demos
- Test-driven development
- Predictable QA
- “Dry runs” before deployment
This is something even Retell & Vapi don’t have today.
And It Scales Like a Distributed System
Because everything is events:
- Each connector is a worker
- Workers scale horizontally
- Backpressure is manageable
- Failures can be contained
- Retries & fallbacks are simple
- Pipelines can fork or merge
- Multi-agent flows work naturally
- Audioless connectors (CRM, DB, API) blend seamlessly
It behaves like:
- Zapier
- AWS EventBridge
- LangGraph
- Airflow
- N8N
…but optimized for real-time audio.
What This Unlocks for Businesses
This is where the architecture stops being “cool tech” and becomes actual value:
- Lead qualification
- After-hours support
- Customer triage
- Booking assistants
- Helpdesk automation
- Sales follow-ups
- Knowledge Q&A
- Order tracking
- Multi-agent escalation
- CRM syncing
- Custom playbooks per industry
- Complex routing between AI tools
You don’t just deploy “a bot.”
You deploy a network of intelligent voice automations.
Closing Thoughts
Voice AI is moving fast, but most of what exists today is still:
- rigid
- non-composable
- difficult to integrate
- tied to single vendors
- non-debuggable
- non-portable
By making the entire system event-driven and connector-based, EchoStack becomes:
A real-time automation platform where voice is the entry point — not the limitation.
If you’re into real-time systems, LiveKit, STT/LLM/TTS pipelines, or voice automation, I’d love to exchange ideas.
Top comments (0)