What if you could add a voice AI assistant to any website with a single line of code?
That's what I built. One <script> tag. The user talks. The AI listens, understands, and takes real actions on the page — clicking buttons, filling forms, navigating pages.
Here's how it works under the hood.
The Problem
Most websites are built for mouse-and-keyboard users. But:
- 15-20% of the global population has some form of disability
- Voice search is growing 35% year over year
- WCAG 2.1 AA compliance is now legally required for government and healthcare sites (deadline: April 24, 2026)
- Mobile users on the go need hands-free interaction
Traditional chatbots just answer questions. They don't do anything on the page. I wanted to build something that actually takes action.
The Architecture
AnveVoice has three core layers:
1. Speech-to-Text (STT)
We use a streaming STT pipeline that achieves sub-200ms first-token latency. The audio is captured via the Web Audio API:
// Simplified audio capture
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
// Stream chunks to STT service via WebSocket
We support 53 languages with automatic language detection. The system identifies the language within the first 500ms of audio.
2. Intent Resolution + DOM Mapping
This is the hard part. Once we have the transcribed text, we need to:
-
Understand intent: "I want to buy the blue shoes in size 10" maps to
{action: "click", target: "product-variant-blue", then: "select-size-10", then: "add-to-cart"} - Map to DOM elements: We crawl the page's accessibility tree and semantic HTML to find matching elements
- Execute actions: Click, scroll, fill form fields, navigate
// Simplified DOM action executor
async function executeVoiceAction(intent) {
const { action, target, value } = intent;
// Find the target element using multiple strategies
const element = await findElement(target, [
'aria-label', // ARIA attributes first
'data-testid', // Test IDs
'innerText', // Visible text matching
'semantic-role', // HTML5 semantic roles
]);
switch (action) {
case 'click':
element.click();
break;
case 'fill':
element.value = value;
element.dispatchEvent(new Event('input', { bubbles: true }));
break;
case 'navigate':
window.location.href = element.href;
break;
}
}
3. Text-to-Speech (TTS) Response
After executing the action, the system confirms what it did via natural speech. We use streaming TTS for sub-300ms response time.
The total pipeline: STT (200ms) + Intent (100ms) + Action (50ms) + TTS (300ms) = under 700ms end-to-end.
The One-Tag Integration
Here's what the actual integration looks like:
<script src="https://app.anvevoice.app/widget.js"
data-key="your-api-key">
</script>
That's it. The script:
- Injects a floating voice button into the page
- Handles microphone permissions
- Streams audio to our STT service
- Resolves intents against the current page's DOM
- Executes actions and provides voice feedback
No server-side changes. No framework dependencies. Works with React, Vue, Angular, vanilla HTML, Shopify, WordPress — anything with a DOM.
What It Can Actually Do
Real examples from production:
- E-commerce: "Show me red dresses under fifty dollars" → filters products, scrolls to results
- Healthcare forms: "Fill in my date of birth, March 15, 1985" → finds the DOB field, enters the date
- Government portals: "Navigate to the benefits application page" → clicks through menu navigation
- Multi-language: A user says the same command in Hindi, Spanish, or Japanese — same result
The WCAG Compliance Angle
The April 24, 2026 WCAG 2.1 AA deadline affects:
- Government sites serving 50,000+ people
- Healthcare organizations receiving federal funding
- Any site that wants to avoid accessibility lawsuits ($55K+/day penalties for government entities)
Voice interfaces aren't just nice to have anymore. They're becoming a legal requirement for accessible web experiences.
Performance Numbers
After 6 months of optimization:
| Metric | Target | Actual |
|---|---|---|
| End-to-end latency | <1000ms | 680ms avg |
| Language detection | <500ms | 420ms |
| DOM action execution | <100ms | 45ms |
| Languages supported | 20+ | 53 |
| Integration time | <5 min | ~60 seconds |
Try It
AnveVoice is live at anvevoice.app.
- Free tier: 60 conversations/month
- Growth: $36/month for 2,100 conversations
- Scale: $120/month for high-volume sites
If you're working on accessibility, multilingual support, or just want to make your site more interactive — I'd love to hear what you think.
Drop a comment if you have questions about the architecture, the DOM mapping approach, or the STT/TTS pipeline. Happy to go deeper on any of these.
I'm Adarsh, solo founder building AnveVoice. Currently pivoting from horizontal positioning to three urgent verticals: healthcare, government, and international e-commerce. Building in public on Twitter/X.
Top comments (0)