Every voice AI tool I evaluated did the same thing: listen to speech, convert to text, send to an LLM, return audio. Essentially a chatbot with a microphone.
But I wanted something different. I wanted voice AI that could actually do things on a website — click buttons, fill forms, navigate pages. A voice agent, not a voice chatbot.
So I built AnveVoice.
The Problem with Voice Chatbots
Here's what most "voice AI" tools do:
- User speaks
- Speech-to-text converts it
- Text goes to an LLM
- LLM generates a response
- Text-to-speech reads it back
That's it. The AI talks back, but it doesn't do anything. It can't click your "Book Appointment" button. It can't fill in your contact form. It can't navigate to your pricing page.
For websites, this is a huge missed opportunity. 96.3% of websites fail basic accessibility standards (WebAIM 2025). Voice navigation isn't just a feature — it's an accessibility requirement.
The Architecture: Voice → Intent → DOM Action
Here's how AnveVoice works differently:
User Speech → STT (sub-200ms) → Intent Parser → Action Router
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
DOM Actions Navigation Form Fill
(click, scroll) (page redirect) (input values)
↓ ↓ ↓
Visual Feedback → TTS Response → State Update
The key innovation is the Action Router. Instead of just generating text responses, the AI interprets user intent and maps it to real DOM actions using 46 MCP (Model Context Protocol) tools over JSON-RPC 2.0.
Real DOM Actions
When a user says "Book an appointment for Tuesday," AnveVoice doesn't just say "I'd be happy to help you book an appointment." It actually:
- Identifies the booking form on the page
- Fills in the date field with next Tuesday's date
- Clicks the submit button
- Confirms the booking with voice feedback
This is possible because we maintain a real-time DOM map of the page and use semantic understanding to match user intents to actionable elements.
The Technical Challenge: Sub-700ms Latency
End-to-end voice latency needs to be under 1 second to feel natural. Here's our pipeline:
| Stage | Target | Actual |
|---|---|---|
| STT | < 200ms | ~180ms |
| Intent Parse | < 100ms | ~80ms |
| Action Execution | < 200ms | ~150ms |
| TTS | < 200ms | ~190ms |
| Total | < 700ms | ~600ms |
We achieve this by:
- Streaming STT — processing audio chunks as they arrive, not waiting for silence detection
- Pre-computed DOM maps — indexing actionable elements on page load so we don't need to traverse the DOM at query time
- Parallel TTS — starting speech synthesis while the action is still executing
- Edge inference — running intent classification at the edge, not round-tripping to a central server
The Embed: One Script Tag
The entire integration is a single script tag:
<script
src="https://widget.anvevoice.app/embed.js"
data-agent-id="YOUR_AGENT_ID">
</script>
That's it. No WebRTC server management. No complex API integration. Works with React, Vue, Angular, Next.js, Shopify, WordPress, or any HTML page.
The widget handles:
- Microphone permission and audio capture
- Real-time speech recognition in 50+ languages
- Intent classification and action routing
- DOM manipulation and visual feedback
- Text-to-speech response in the detected language
50+ Languages (Including 22 Indian Languages)
This was a non-negotiable for us. India has 700M+ smartphone users, and 65 of every 100 mobile searches happen in non-English languages.
We support all 22 scheduled Indian languages plus Hinglish (Hindi-English code-switching), which is how most urban Indians actually communicate with technology.
The language detection works automatically — if a user starts speaking Hindi, the system detects it, locks to Hindi for the session, and responds in Hindi. No configuration needed.
Pricing: Flat-Rate vs. Per-Minute
Most voice AI tools charge per minute:
- Retell AI: ~$0.13-0.31/min
- Vapi: ~$0.15-0.33/min
- ElevenLabs: ~$0.08-0.10/min
At 1,000 minutes/month, that's $80-$330.
AnveVoice uses flat-rate token pricing:
- Free: $0/mo (50K tokens)
- Growth: $35/mo (500K tokens, 3 bots)
- Enterprise: Custom
Predictable costs. No surprise bills.
What's Next
We're currently focused on:
- Healthcare — 94% appointment booking success rate in pilot clinics
- E-commerce — Voice-powered product discovery and checkout
- Government portals — Citizen services in vernacular languages
- Accessibility — Making WCAG 2.1 AA compliance achievable through voice
Try It
You can try AnveVoice at anvevoice.app or see the experience hub at experience.anvevoice.app.
The embed is free to start. If you're building a website that needs voice interaction — especially if accessibility or multilingual support matters — give it a try.
I'm Adarsh, founder of ANVE.AI. I'm a cybersecurity professional (CISA/CEH certified) who got obsessed with making the web more accessible through voice. If you have questions about the architecture or want to discuss voice AI, drop a comment below or find me on LinkedIn.
Top comments (0)