Adarsh Kant

Posted on Mar 18

I Added Voice AI to Any Website with One Script Tag

#webdev #a11y #javascript #ai

What if you could add a voice AI assistant to any website with a single line of code?

That's what I built. One <script> tag. The user talks. The AI listens, understands, and takes real actions on the page — clicking buttons, filling forms, navigating pages.

Here's how it works under the hood.

The Problem

Most websites are built for mouse-and-keyboard users. But:

15-20% of the global population has some form of disability
Voice search is growing 35% year over year
WCAG 2.1 AA compliance is now legally required for government and healthcare sites (deadline: April 24, 2026)
Mobile users on the go need hands-free interaction

Traditional chatbots just answer questions. They don't do anything on the page. I wanted to build something that actually takes action.

The Architecture

AnveVoice has three core layers:

1. Speech-to-Text (STT)

We use a streaming STT pipeline that achieves sub-200ms first-token latency. The audio is captured via the Web Audio API:

// Simplified audio capture
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
// Stream chunks to STT service via WebSocket

We support 53 languages with automatic language detection. The system identifies the language within the first 500ms of audio.

2. Intent Resolution + DOM Mapping

This is the hard part. Once we have the transcribed text, we need to:

Understand intent: "I want to buy the blue shoes in size 10" maps to {action: "click", target: "product-variant-blue", then: "select-size-10", then: "add-to-cart"}
Map to DOM elements: We crawl the page's accessibility tree and semantic HTML to find matching elements
Execute actions: Click, scroll, fill form fields, navigate

// Simplified DOM action executor
async function executeVoiceAction(intent) {
  const { action, target, value } = intent;

  // Find the target element using multiple strategies
  const element = await findElement(target, [
    'aria-label',      // ARIA attributes first
    'data-testid',     // Test IDs
    'innerText',       // Visible text matching
    'semantic-role',   // HTML5 semantic roles
  ]);

  switch (action) {
    case 'click':
      element.click();
      break;
    case 'fill':
      element.value = value;
      element.dispatchEvent(new Event('input', { bubbles: true }));
      break;
    case 'navigate':
      window.location.href = element.href;
      break;
  }
}

3. Text-to-Speech (TTS) Response

After executing the action, the system confirms what it did via natural speech. We use streaming TTS for sub-300ms response time.

The total pipeline: STT (200ms) + Intent (100ms) + Action (50ms) + TTS (300ms) = under 700ms end-to-end.

The One-Tag Integration

Here's what the actual integration looks like:

<script src="https://app.anvevoice.app/widget.js"
        data-key="your-api-key">
</script>

That's it. The script:

Injects a floating voice button into the page
Handles microphone permissions
Streams audio to our STT service
Resolves intents against the current page's DOM
Executes actions and provides voice feedback

No server-side changes. No framework dependencies. Works with React, Vue, Angular, vanilla HTML, Shopify, WordPress — anything with a DOM.

What It Can Actually Do

Real examples from production:

E-commerce: "Show me red dresses under fifty dollars" → filters products, scrolls to results
Healthcare forms: "Fill in my date of birth, March 15, 1985" → finds the DOB field, enters the date
Government portals: "Navigate to the benefits application page" → clicks through menu navigation
Multi-language: A user says the same command in Hindi, Spanish, or Japanese — same result

The WCAG Compliance Angle

The April 24, 2026 WCAG 2.1 AA deadline affects:

Government sites serving 50,000+ people
Healthcare organizations receiving federal funding
Any site that wants to avoid accessibility lawsuits ($55K+/day penalties for government entities)

Voice interfaces aren't just nice to have anymore. They're becoming a legal requirement for accessible web experiences.

Performance Numbers

After 6 months of optimization:

Metric	Target	Actual
End-to-end latency	<1000ms	680ms avg
Language detection	<500ms	420ms
DOM action execution	<100ms	45ms
Languages supported	20+	53
Integration time	<5 min	~60 seconds

Try It

AnveVoice is live at anvevoice.app.

Free tier: 60 conversations/month
Growth: $36/month for 2,100 conversations
Scale: $120/month for high-volume sites

If you're working on accessibility, multilingual support, or just want to make your site more interactive — I'd love to hear what you think.

Drop a comment if you have questions about the architecture, the DOM mapping approach, or the STT/TTS pipeline. Happy to go deeper on any of these.

I'm Adarsh, solo founder building AnveVoice. Currently pivoting from horizontal positioning to three urgent verticals: healthcare, government, and international e-commerce. Building in public on Twitter/X.

DEV Community