Voice AI in 2 Months: Tech Stack, Latency Challenges, and Multilingual Insights

#buildinpublic #productivity #realtimeai #llminfra

We didn’t set out to build Voice AI. We set out to stop missing calls that mattered.

Constant interruptions vs. missing critical info — that was the problem we faced. We wanted something that handled calls intelligently without adding another app or dashboard.

Our 2-Month Tech Stack Graza.ai uses a multi‑provider approach to balance performance, cost, and reliability while maintaining human-like quality for voice interactions.

Twilio – Voice call routing and real‑time communication
Deepgram – High‑performance speech‑to‑text transcription
ElevenLabs – Natural voice synthesis in 70+ languages
OpenAI & Anthropic – Context understanding and human-like responses
Google AI Services – AI processing and infrastructure support
AWS – Additional hosting for scalability
Microsoft Azure – Backup AI services and flexibility
Plausible Analytics – Privacy‑focused, cookie‑less tracking
Postmark – Transactional email delivery
Mailgun – Marketing and broader email capabilities
Google Cloud & Firebase – Firestore, Functions, and Cloud Storage

Big decision: We used proven APIs instead of training custom models. It let us ship fast and focus on orchestration logic.

Hard lessons Phone audio is brutal. Crystal‑clear tests worked. Real calls? A mess of accents, background noise, and bad connections. We rebuilt our pipeline for real‑world conditions.

Context is everything The AI must remember:

Who’s calling and why
Past conversations
VIP lists and language preferences
Current availability

Latency kills experience. Even 2–3 second delays felt broken. We cut it down to ~800ms average.

Multilingual Surprise - Users tested Spanish, Mandarin, and French calls immediately. We hadn’t planned for it, but GPT‑4 handled them well with one simple rule:

“Detect the caller’s language and respond naturally in the same language. If uncertain, ask for preference.”

What Works After 2 Months

Handles deliveries, sales, and family calls intelligently
Responds in multiple languages
Summarizes calls clearly
No app needed — works through your existing phone

Current Beta Performance (Real-World Metrics):

~900ms average response time (measured end‑to‑end on 50+ calls)
~92% transcription accuracy in clean conditions (quiet environment)
~76% accuracy in noisy conditions (mobile calls, background chatter)
8.3/10 average beta tester satisfaction (small cohort of early users) We’re actively optimizing latency and noise handling — sub‑500ms and >80% noisy‑call accuracy are our next targets.

Mistakes You Can Avoid

Testing only with perfect audio — real calls are messy
Underestimating context complexity — conversations build on each other
Skipping observability — when calls fail, you must know where and why

Resources That Saved Us Time

Deepgram’s real‑time API docs (excellent)
OpenAI function calling for structured responses
Twilio voice webhooks for handling call flows

What’s Next

Custom wake words for hands‑free use
Calendar/email integration for richer context
Sub‑500ms response times

Questions for the Community

Try It (And Break It) We’re live in beta — check it out here — free for now.

What’s the most useful “invisible” tool you’ve built?

Sometimes the best tech is the stuff you don’t notice — and that’s exactly what we wanted Graza.ai to be.

Always happy to chat about the technical details if anyone's curious about specific parts of the implementation!

DEV Community

Voice AI in 2 Months: Tech Stack, Latency Challenges, and Multilingual Insights

Top comments (0)