We didn’t set out to build Voice AI. We set out to stop missing calls that mattered.
Constant interruptions vs. missing critical info — that was the problem we faced. We wanted something that handled calls intelligently without adding another app or dashboard.
Our 2-Month Tech Stack Graza.ai uses a multi‑provider approach to balance performance, cost, and reliability while maintaining human-like quality for voice interactions.
- Twilio – Voice call routing and real‑time communication
- Deepgram – High‑performance speech‑to‑text transcription
- ElevenLabs – Natural voice synthesis in 70+ languages
- OpenAI & Anthropic – Context understanding and human-like responses
- Google AI Services – AI processing and infrastructure support
- AWS – Additional hosting for scalability
- Microsoft Azure – Backup AI services and flexibility
- Plausible Analytics – Privacy‑focused, cookie‑less tracking
- Postmark – Transactional email delivery
- Mailgun – Marketing and broader email capabilities
- Google Cloud & Firebase – Firestore, Functions, and Cloud Storage
Big decision: We used proven APIs instead of training custom models. It let us ship fast and focus on orchestration logic.
Hard lessons Phone audio is brutal. Crystal‑clear tests worked. Real calls? A mess of accents, background noise, and bad connections. We rebuilt our pipeline for real‑world conditions.
Context is everything The AI must remember:
- Who’s calling and why
- Past conversations
- VIP lists and language preferences
- Current availability
Latency kills experience. Even 2–3 second delays felt broken. We cut it down to ~800ms average.
- Multilingual Surprise - Users tested Spanish, Mandarin, and French calls immediately. We hadn’t planned for it, but GPT‑4 handled them well with one simple rule:
“Detect the caller’s language and respond naturally in the same language. If uncertain, ask for preference.”
What Works After 2 Months
- Handles deliveries, sales, and family calls intelligently
- Responds in multiple languages
- Summarizes calls clearly
- No app needed — works through your existing phone
Current Beta Performance (Real-World Metrics):
- ~900ms average response time (measured end‑to‑end on 50+ calls)
- ~92% transcription accuracy in clean conditions (quiet environment)
- ~76% accuracy in noisy conditions (mobile calls, background chatter)
- 8.3/10 average beta tester satisfaction (small cohort of early users) We’re actively optimizing latency and noise handling — sub‑500ms and >80% noisy‑call accuracy are our next targets.
Mistakes You Can Avoid
- Testing only with perfect audio — real calls are messy
- Underestimating context complexity — conversations build on each other
- Skipping observability — when calls fail, you must know where and why
Resources That Saved Us Time
- Deepgram’s real‑time API docs (excellent)
- OpenAI function calling for structured responses
- Twilio voice webhooks for handling call flows
What’s Next
- Custom wake words for hands‑free use
- Calendar/email integration for richer context
- Sub‑500ms response times
Questions for the Community
Try It (And Break It) We’re live in beta — check it out here — free for now.
What’s the most useful “invisible” tool you’ve built?
Sometimes the best tech is the stuff you don’t notice — and that’s exactly what we wanted Graza.ai to be.
Always happy to chat about the technical details if anyone's curious about specific parts of the implementation!
Top comments (0)