DEV Community

Cover image for Voice AI in 2 Months: Tech Stack, Latency Challenges, and Multilingual Insights
Arina from Graza.ai
Arina from Graza.ai

Posted on

Voice AI in 2 Months: Tech Stack, Latency Challenges, and Multilingual Insights

We didn’t set out to build Voice AI. We set out to stop missing calls that mattered.

Constant interruptions vs. missing critical info — that was the problem we faced. We wanted something that handled calls intelligently without adding another app or dashboard.

Our 2-Month Tech Stack Graza.ai uses a multi‑provider approach to balance performance, cost, and reliability while maintaining human-like quality for voice interactions.

  • Twilio – Voice call routing and real‑time communication
  • Deepgram – High‑performance speech‑to‑text transcription
  • ElevenLabs – Natural voice synthesis in 70+ languages
  • OpenAI & Anthropic – Context understanding and human-like responses
  • Google AI Services – AI processing and infrastructure support
  • AWS – Additional hosting for scalability
  • Microsoft Azure – Backup AI services and flexibility
  • Plausible Analytics – Privacy‑focused, cookie‑less tracking
  • Postmark – Transactional email delivery
  • Mailgun – Marketing and broader email capabilities
  • Google Cloud & Firebase – Firestore, Functions, and Cloud Storage

Big decision: We used proven APIs instead of training custom models. It let us ship fast and focus on orchestration logic.

Hard lessons Phone audio is brutal. Crystal‑clear tests worked. Real calls? A mess of accents, background noise, and bad connections. We rebuilt our pipeline for real‑world conditions.

Context is everything The AI must remember:

  • Who’s calling and why
  • Past conversations
  • VIP lists and language preferences
  • Current availability

Latency kills experience. Even 2–3 second delays felt broken. We cut it down to ~800ms average.

  • Multilingual Surprise - Users tested Spanish, Mandarin, and French calls immediately. We hadn’t planned for it, but GPT‑4 handled them well with one simple rule:

“Detect the caller’s language and respond naturally in the same language. If uncertain, ask for preference.”

What Works After 2 Months

  • Handles deliveries, sales, and family calls intelligently
  • Responds in multiple languages
  • Summarizes calls clearly
  • No app needed — works through your existing phone

Current Beta Performance (Real-World Metrics):

  • ~900ms average response time (measured end‑to‑end on 50+ calls)
  • ~92% transcription accuracy in clean conditions (quiet environment)
  • ~76% accuracy in noisy conditions (mobile calls, background chatter)
  • 8.3/10 average beta tester satisfaction (small cohort of early users) We’re actively optimizing latency and noise handling — sub‑500ms and >80% noisy‑call accuracy are our next targets.

Mistakes You Can Avoid

  • Testing only with perfect audio — real calls are messy
  • Underestimating context complexity — conversations build on each other
  • Skipping observability — when calls fail, you must know where and why

Resources That Saved Us Time

  • Deepgram’s real‑time API docs (excellent)
  • OpenAI function calling for structured responses
  • Twilio voice webhooks for handling call flows

What’s Next

  • Custom wake words for hands‑free use
  • Calendar/email integration for richer context
  • Sub‑500ms response times

Questions for the Community

Try It (And Break It) We’re live in beta — check it out here — free for now.

What’s the most useful “invisible” tool you’ve built?

Sometimes the best tech is the stuff you don’t notice — and that’s exactly what we wanted Graza.ai to be.

Always happy to chat about the technical details if anyone's curious about specific parts of the implementation!

Top comments (0)