DEV Community

Adarsh Kant
Adarsh Kant

Posted on

I Added Voice AI to Any Website with One Script Tag

What if you could add a voice AI assistant to any website with a single line of code?

That's what I built. One <script> tag. The user talks. The AI listens, understands, and takes real actions on the page — clicking buttons, filling forms, navigating pages.

Here's how it works under the hood.

The Problem

Most websites are built for mouse-and-keyboard users. But:

  • 15-20% of the global population has some form of disability
  • Voice search is growing 35% year over year
  • WCAG 2.1 AA compliance is now legally required for government and healthcare sites (deadline: April 24, 2026)
  • Mobile users on the go need hands-free interaction

Traditional chatbots just answer questions. They don't do anything on the page. I wanted to build something that actually takes action.

The Architecture

AnveVoice has three core layers:

1. Speech-to-Text (STT)

We use a streaming STT pipeline that achieves sub-200ms first-token latency. The audio is captured via the Web Audio API:

// Simplified audio capture
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
// Stream chunks to STT service via WebSocket
Enter fullscreen mode Exit fullscreen mode

We support 53 languages with automatic language detection. The system identifies the language within the first 500ms of audio.

2. Intent Resolution + DOM Mapping

This is the hard part. Once we have the transcribed text, we need to:

  1. Understand intent: "I want to buy the blue shoes in size 10" maps to {action: "click", target: "product-variant-blue", then: "select-size-10", then: "add-to-cart"}
  2. Map to DOM elements: We crawl the page's accessibility tree and semantic HTML to find matching elements
  3. Execute actions: Click, scroll, fill form fields, navigate
// Simplified DOM action executor
async function executeVoiceAction(intent) {
  const { action, target, value } = intent;

  // Find the target element using multiple strategies
  const element = await findElement(target, [
    'aria-label',      // ARIA attributes first
    'data-testid',     // Test IDs
    'innerText',       // Visible text matching
    'semantic-role',   // HTML5 semantic roles
  ]);

  switch (action) {
    case 'click':
      element.click();
      break;
    case 'fill':
      element.value = value;
      element.dispatchEvent(new Event('input', { bubbles: true }));
      break;
    case 'navigate':
      window.location.href = element.href;
      break;
  }
}
Enter fullscreen mode Exit fullscreen mode

3. Text-to-Speech (TTS) Response

After executing the action, the system confirms what it did via natural speech. We use streaming TTS for sub-300ms response time.

The total pipeline: STT (200ms) + Intent (100ms) + Action (50ms) + TTS (300ms) = under 700ms end-to-end.

The One-Tag Integration

Here's what the actual integration looks like:

<script src="https://app.anvevoice.app/widget.js"
        data-key="your-api-key">
</script>
Enter fullscreen mode Exit fullscreen mode

That's it. The script:

  1. Injects a floating voice button into the page
  2. Handles microphone permissions
  3. Streams audio to our STT service
  4. Resolves intents against the current page's DOM
  5. Executes actions and provides voice feedback

No server-side changes. No framework dependencies. Works with React, Vue, Angular, vanilla HTML, Shopify, WordPress — anything with a DOM.

What It Can Actually Do

Real examples from production:

  • E-commerce: "Show me red dresses under fifty dollars" → filters products, scrolls to results
  • Healthcare forms: "Fill in my date of birth, March 15, 1985" → finds the DOB field, enters the date
  • Government portals: "Navigate to the benefits application page" → clicks through menu navigation
  • Multi-language: A user says the same command in Hindi, Spanish, or Japanese — same result

The WCAG Compliance Angle

The April 24, 2026 WCAG 2.1 AA deadline affects:

  • Government sites serving 50,000+ people
  • Healthcare organizations receiving federal funding
  • Any site that wants to avoid accessibility lawsuits ($55K+/day penalties for government entities)

Voice interfaces aren't just nice to have anymore. They're becoming a legal requirement for accessible web experiences.

Performance Numbers

After 6 months of optimization:

Metric Target Actual
End-to-end latency <1000ms 680ms avg
Language detection <500ms 420ms
DOM action execution <100ms 45ms
Languages supported 20+ 53
Integration time <5 min ~60 seconds

Try It

AnveVoice is live at anvevoice.app.

  • Free tier: 60 conversations/month
  • Growth: $36/month for 2,100 conversations
  • Scale: $120/month for high-volume sites

If you're working on accessibility, multilingual support, or just want to make your site more interactive — I'd love to hear what you think.

Drop a comment if you have questions about the architecture, the DOM mapping approach, or the STT/TTS pipeline. Happy to go deeper on any of these.


I'm Adarsh, solo founder building AnveVoice. Currently pivoting from horizontal positioning to three urgent verticals: healthcare, government, and international e-commerce. Building in public on Twitter/X.

Top comments (0)