Adarsh Kant

Posted on Mar 21

How I Built a Voice AI That Takes Real DOM Actions on Websites

#webdev #ai #javascript #voiceai

Every voice AI tool I evaluated did the same thing: listen to speech, convert to text, send to an LLM, return audio. Essentially a chatbot with a microphone.

But I wanted something different. I wanted voice AI that could actually do things on a website — click buttons, fill forms, navigate pages. A voice agent, not a voice chatbot.

So I built AnveVoice.

The Problem with Voice Chatbots

Here's what most "voice AI" tools do:

User speaks
Speech-to-text converts it
Text goes to an LLM
LLM generates a response
Text-to-speech reads it back

That's it. The AI talks back, but it doesn't do anything. It can't click your "Book Appointment" button. It can't fill in your contact form. It can't navigate to your pricing page.

For websites, this is a huge missed opportunity. 96.3% of websites fail basic accessibility standards (WebAIM 2025). Voice navigation isn't just a feature — it's an accessibility requirement.

The Architecture: Voice → Intent → DOM Action

Here's how AnveVoice works differently:

User Speech → STT (sub-200ms) → Intent Parser → Action Router
                                                    ↓
                                    ┌───────────────┼───────────────┐
                                    ↓               ↓               ↓
                              DOM Actions      Navigation      Form Fill
                              (click, scroll)  (page redirect)  (input values)
                                    ↓               ↓               ↓
                              Visual Feedback → TTS Response → State Update

The key innovation is the Action Router. Instead of just generating text responses, the AI interprets user intent and maps it to real DOM actions using 46 MCP (Model Context Protocol) tools over JSON-RPC 2.0.

Real DOM Actions

When a user says "Book an appointment for Tuesday," AnveVoice doesn't just say "I'd be happy to help you book an appointment." It actually:

Identifies the booking form on the page
Fills in the date field with next Tuesday's date
Clicks the submit button
Confirms the booking with voice feedback

This is possible because we maintain a real-time DOM map of the page and use semantic understanding to match user intents to actionable elements.

The Technical Challenge: Sub-700ms Latency

End-to-end voice latency needs to be under 1 second to feel natural. Here's our pipeline:

Stage	Target	Actual
STT	< 200ms	~180ms
Intent Parse	< 100ms	~80ms
Action Execution	< 200ms	~150ms
TTS	< 200ms	~190ms
Total	< 700ms	~600ms

We achieve this by:

Streaming STT — processing audio chunks as they arrive, not waiting for silence detection
Pre-computed DOM maps — indexing actionable elements on page load so we don't need to traverse the DOM at query time
Parallel TTS — starting speech synthesis while the action is still executing
Edge inference — running intent classification at the edge, not round-tripping to a central server

The Embed: One Script Tag

The entire integration is a single script tag:

<script 
  src="https://widget.anvevoice.app/embed.js" 
  data-agent-id="YOUR_AGENT_ID">
</script>

That's it. No WebRTC server management. No complex API integration. Works with React, Vue, Angular, Next.js, Shopify, WordPress, or any HTML page.

The widget handles:

Microphone permission and audio capture
Real-time speech recognition in 50+ languages
Intent classification and action routing
DOM manipulation and visual feedback
Text-to-speech response in the detected language

50+ Languages (Including 22 Indian Languages)

This was a non-negotiable for us. India has 700M+ smartphone users, and 65 of every 100 mobile searches happen in non-English languages.

We support all 22 scheduled Indian languages plus Hinglish (Hindi-English code-switching), which is how most urban Indians actually communicate with technology.

The language detection works automatically — if a user starts speaking Hindi, the system detects it, locks to Hindi for the session, and responds in Hindi. No configuration needed.

Pricing: Flat-Rate vs. Per-Minute

Most voice AI tools charge per minute:

Retell AI: ~$0.13-0.31/min
Vapi: ~$0.15-0.33/min
ElevenLabs: ~$0.08-0.10/min

At 1,000 minutes/month, that's $80-$330.

AnveVoice uses flat-rate token pricing:

Free: $0/mo (50K tokens)
Growth: $35/mo (500K tokens, 3 bots)
Enterprise: Custom

Predictable costs. No surprise bills.

What's Next

We're currently focused on:

Healthcare — 94% appointment booking success rate in pilot clinics
E-commerce — Voice-powered product discovery and checkout
Government portals — Citizen services in vernacular languages
Accessibility — Making WCAG 2.1 AA compliance achievable through voice

Try It

You can try AnveVoice at anvevoice.app or see the experience hub at experience.anvevoice.app.

The embed is free to start. If you're building a website that needs voice interaction — especially if accessibility or multilingual support matters — give it a try.

I'm Adarsh, founder of ANVE.AI. I'm a cybersecurity professional (CISA/CEH certified) who got obsessed with making the web more accessible through voice. If you have questions about the architecture or want to discuss voice AI, drop a comment below or find me on LinkedIn.

DEV Community