DEV Community

Cover image for Speech-to-Text on any Field in 70 Lines of Stimulus
Seryl Lns
Seryl Lns

Posted on

Speech-to-Text on any Field in 70 Lines of Stimulus

Typing is slow. Users hate it.

Sandrine from accounting types with two fingers. Kevin (14) types like he’s on TikTok.

Your PM types slowly. Your users type badly.

So I built voice input for any field in ~70 lines of Stimulus. No backend. No API. No billing.

The end result

<!-- That's it. Any input becomes voice-enabled. -->
<div data-controller="speech-to-text">
  <input data-speech-to-text-target="input" placeholder="Search stays...">
  <button data-speech-to-text-target="mic" data-action="speech-to-text#toggle">🎀</button>
</div>
Enter fullscreen mode Exit fullscreen mode

Click the mic, speak, the text appears. Done.

The controller β€” 70 lines

// app/javascript/controllers/speech_to_text_controller.js
import { Controller } from "@hotwired/stimulus"

export default class extends Controller {
  static targets = ["input", "mic"]
  static values = {
    lang:          { type: String,  default: "fr-FR" },
    autoSubmit:    { type: Boolean, default: false },
    listeningText: { type: String,  default: "Speak..." }
  }

  connect() {
    this.recognition = null
    this.isListening = false
    this.originalPlaceholder = this.inputTarget.placeholder
    this.initRecognition()
  }

  disconnect() {
    this.stop()
  }

  initRecognition() {
    const SR = window.SpeechRecognition || window.webkitSpeechRecognition
    if (!SR) {
      // Browser doesn't support it β€” hide the mic button
      if (this.hasMicTarget) this.micTarget.style.display = "none"
      return
    }

    this.recognition = new SR()
    this.recognition.lang = this.langValue
    this.recognition.continuous = false
    this.recognition.interimResults = true

    this.recognition.onresult = (event) => {
      const transcript = Array.from(event.results)
        .map(r => r[0].transcript)
        .join("")
      this.inputTarget.value = transcript

      if (event.results[event.results.length - 1].isFinal) {
        this.stop()
        // Fire standard events so other JS/Stimulus controllers react
        this.inputTarget.dispatchEvent(new Event("input", { bubbles: true }))
        this.inputTarget.dispatchEvent(new Event("change", { bubbles: true }))
        // Optional: auto-submit the parent form
        if (this.autoSubmitValue && transcript.trim()) {
          const form = this.inputTarget.closest("form")
          if (form) form.requestSubmit()
        }
      }
    }

    this.recognition.onerror = () => this.stop()
    this.recognition.onend = () => { if (this.isListening) this.stop() }
  }

  toggle() {
    if (!this.recognition) return
    this.isListening ? this.stop() : this.start()
  }

  start() {
    if (!this.recognition || this.isListening) return
    this.isListening = true
    this.inputTarget.placeholder = this.listeningTextValue
    if (this.hasMicTarget) this.micTarget.classList.add("stt-active")
    this.dispatch("start")
    try { this.recognition.start() } catch (_) { this.stop() }
  }

  stop() {
    if (!this.isListening) return
    this.isListening = false
    this.inputTarget.placeholder = this.originalPlaceholder
    if (this.hasMicTarget) this.micTarget.classList.remove("stt-active")
    this.dispatch("stop")
    try { this.recognition?.stop() } catch (_) { /* noop */ }
  }
}
Enter fullscreen mode Exit fullscreen mode

That's the whole thing. Let's break down the design decisions.

How it works

1. Progressive enhancement

The controller checks for SpeechRecognition support on connect(). If the browser doesn't support it (Firefox), the mic button is hidden. No errors, no broken UI. The input works normally.

const SR = window.SpeechRecognition || window.webkitSpeechRecognition
if (!SR) {
  if (this.hasMicTarget) this.micTarget.style.display = "none"
  return
}
Enter fullscreen mode Exit fullscreen mode

2. Interim results

interimResults: true means the user sees their words appear in real-time as they speak. It feels responsive. When the final result arrives, we fire input and change events so any other Stimulus controllers or event listeners react naturally.

3. Configurable via HTML

All behavior is controlled through Stimulus values β€” no JS changes needed:

<!-- English, auto-submit when done speaking -->
<div data-controller="speech-to-text"
     data-speech-to-text-lang-value="en-US"
     data-speech-to-text-auto-submit-value="true"
     data-speech-to-text-listening-text-value="Listening...">
  <textarea data-speech-to-text-target="input"></textarea>
  <button data-speech-to-text-target="mic"
          data-action="speech-to-text#toggle">🎀</button>
</div>
Enter fullscreen mode Exit fullscreen mode

4. Stimulus events for composition

The controller dispatches speech-to-text:start and speech-to-text:stop events. Other controllers can listen to them:

<div data-controller="speech-to-text my-other-controller"
     data-action="speech-to-text:start->my-other-controller#onListening
                  speech-to-text:stop->my-other-controller#onDone">
Enter fullscreen mode Exit fullscreen mode

Usage examples

Search bar with auto-submit

<form action="/search" method="get">
  <div data-controller="speech-to-text"
       data-speech-to-text-auto-submit-value="true">
    <input name="q"
           data-speech-to-text-target="input"
           placeholder="Search...">
    <button type="button"
            data-speech-to-text-target="mic"
            data-action="speech-to-text#toggle">🎀</button>
  </div>
</form>
Enter fullscreen mode Exit fullscreen mode

Speak β†’ text fills in β†’ form submits automatically.

Textarea for notes

<div data-controller="speech-to-text"
     data-speech-to-text-lang-value="fr-FR">
  <textarea data-speech-to-text-target="input"
            placeholder="Ajoutez une note..."
            rows="4"></textarea>
  <button type="button"
          data-speech-to-text-target="mic"
          data-action="speech-to-text#toggle">🎀</button>
</div>
Enter fullscreen mode Exit fullscreen mode

No auto-submit β€” the user dictates, reviews, then saves manually.

Styled mic button with active state

Add 4 lines of CSS for visual feedback:

.stt-active {
  background: #ef4444 !important;
  color: white !important;
  animation: stt-pulse 1.5s infinite;
}

@keyframes stt-pulse {
  0%, 100% { box-shadow: 0 0 0 0 rgba(239, 68, 68, 0.4); }
  50% { box-shadow: 0 0 0 8px rgba(239, 68, 68, 0); }
}
Enter fullscreen mode Exit fullscreen mode

The controller adds/removes the stt-active class on the mic target automatically.

Multiple instances on the same page

Each instance is independent. Put 10 on the same page, they all work:

<div data-controller="speech-to-text">
  <input data-speech-to-text-target="input" placeholder="First name">
  <button data-speech-to-text-target="mic" data-action="speech-to-text#toggle">🎀</button>
</div>

<div data-controller="speech-to-text">
  <input data-speech-to-text-target="input" placeholder="Last name">
  <button data-speech-to-text-target="mic" data-action="speech-to-text#toggle">🎀</button>
</div>
Enter fullscreen mode Exit fullscreen mode

Browser support

Browser Support
Chrome / Edge Full (v33+)
Safari Full (v14.1+)
Firefox Not supported
Mobile Chrome Full
Mobile Safari Full (iOS 14.5+)

That covers ~85% of users. On unsupported browsers, the mic button simply disappears.

Why not use a paid API?

Web Speech API Whisper API Google Cloud STT
Cost $0 $0.006/min $0.006/min
Latency Instant 1-3s roundtrip 1-2s roundtrip
Privacy On-device* Sent to OpenAI Sent to Google
Offline Partial No No
Setup 0 lines backend API key + endpoint API key + endpoint

*Chrome may send audio to Google for processing, but it's handled transparently by the browser.

For a dictation use-case (filling forms, search bars, notes), the Web Speech API is more than enough. Save the paid APIs for when you need transcription of audio files or real-time streaming in production.

Going further: a Rails helper to drop it in one line

The HTML boilerplate works fine for a one-off. But if you're adding voice input to 10+ forms across your app, you'll want something shorter. Here's a Rails helper that reduces it to a single line:

# app/helpers/speech_to_text_helper.rb
module SpeechToTextHelper
  def stt_input(form, field, lang: "fr-FR", **input_options)
    disabled = input_options[:disabled]
    input_html = (input_options.delete(:input_html) || {})
      .merge(data: { speech_to_text_target: "input" })
    input_options[:input_html] = input_html

    wrapper_attrs = {
      data: { controller: "speech-to-text", speech_to_text_lang_value: lang },
      style: "position: relative;"
    }

    content_tag(:div, wrapper_attrs) do
      concat form.input(field, **input_options)
      concat stt_mic_button unless disabled
    end
  end

  private

  def stt_mic_button
    content_tag(:button, type: "button",
      data: { speech_to_text_target: "mic", action: "speech-to-text#toggle" },
      style: "position: absolute; right: 1rem; top: 2.2rem; border-radius: 50%; " \
             "width: 1.8rem; height: 1.8rem; padding: 0; display: flex; " \
             "align-items: center; justify-content: center; " \
             "border: 1px solid #d1d5db; background: white; cursor: pointer;") do
      mic_svg_icon
    end
  end

  def mic_svg_icon
    tag.svg(xmlns: "http://www.w3.org/2000/svg", width: 12, height: 12,
            viewBox: "0 0 24 24", fill: "none", stroke: "currentColor",
            stroke_width: 2.5, stroke_linecap: "round", stroke_linejoin: "round") do
      safe_join([
        tag.path(d: "M12 1a3 3 0 0 0-3 3v8a3 3 0 0 0 6 0V4a3 3 0 0 0-3-3z"),
        tag.path(d: "M19 10v2a7 7 0 0 1-14 0v-2"),
        tag.line(x1: 12, y1: 19, x2: 12, y2: 23),
        tag.line(x1: 8, y1: 23, x2: 16, y2: 23)
      ])
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

This works with simple_form (swap form.input for form.text_field if you use plain Rails form helpers). Now any form field becomes voice-enabled in one line:

<%# Notes field with mic β€” one line %>
<%= stt_input f, :note, input_html: { class: "form-control" } %>

<%# Disabled field β€” mic is hidden automatically %>
<%= stt_input f, :note, disabled: @record.locked? %>

<%# English recognition %>
<%= stt_input f, :description, lang: "en-US", as: :text %>

<%# Any simple_form option works %>
<%= stt_input f, :address, placeholder: "Dictate your address...",
              input_html: { rows: 3 } %>
Enter fullscreen mode Exit fullscreen mode

What the helper does:

  1. Wraps the input in a div with data-controller="speech-to-text" and relative positioning
  2. Adds data-speech-to-text-target="input" to the field automatically
  3. Appends the mic button (positioned absolute, top-right of the field)
  4. Hides the mic when disabled: true is passed
  5. Forwards every simple_form option (as:, placeholder:, input_html:, etc.) to the original input

The mic SVG is rendered server-side β€” no icon library needed.

Before β€” 15 lines of HTML per field:

<div data-controller="speech-to-text" style="position: relative;">
  <%= f.input :note, input_html: {
    class: "form-control",
    data: { speech_to_text_target: "input" }
  } %>
  <button type="button"
    data-speech-to-text-target="mic"
    data-action="speech-to-text#toggle"
    style="position: absolute; right: 1rem; top: 2.2rem; ...">
    <svg ...><!-- 4 paths --></svg>
  </button>
</div>
Enter fullscreen mode Exit fullscreen mode

After β€” 1 line:

<%= stt_input f, :note, input_html: { class: "form-control" } %>
Enter fullscreen mode Exit fullscreen mode

Wrapping up

70 lines of JS. 40 lines of Ruby helper. Zero dependencies. Zero cost. Works on <input>, <textarea>, any Rails + Stimulus app.

The full controller and helper are copy-pasteable from the code blocks above. Drop them in your project and start adding voice to your forms β€” whether it's one field or fifty.

If you want to go further, you can combine this with the SpeechSynthesis API to read responses back β€” turning any UI into a voice assistant. But that's another article.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.