Typing is slow. Users hate it.
Sandrine from accounting types with two fingers. Kevin (14) types like heβs on TikTok.
Your PM types slowly. Your users type badly.
So I built voice input for any field in ~70 lines of Stimulus. No backend. No API. No billing.
The end result
<!-- That's it. Any input becomes voice-enabled. -->
<div data-controller="speech-to-text">
<input data-speech-to-text-target="input" placeholder="Search stays...">
<button data-speech-to-text-target="mic" data-action="speech-to-text#toggle">π€</button>
</div>
Click the mic, speak, the text appears. Done.
The controller β 70 lines
// app/javascript/controllers/speech_to_text_controller.js
import { Controller } from "@hotwired/stimulus"
export default class extends Controller {
static targets = ["input", "mic"]
static values = {
lang: { type: String, default: "fr-FR" },
autoSubmit: { type: Boolean, default: false },
listeningText: { type: String, default: "Speak..." }
}
connect() {
this.recognition = null
this.isListening = false
this.originalPlaceholder = this.inputTarget.placeholder
this.initRecognition()
}
disconnect() {
this.stop()
}
initRecognition() {
const SR = window.SpeechRecognition || window.webkitSpeechRecognition
if (!SR) {
// Browser doesn't support it β hide the mic button
if (this.hasMicTarget) this.micTarget.style.display = "none"
return
}
this.recognition = new SR()
this.recognition.lang = this.langValue
this.recognition.continuous = false
this.recognition.interimResults = true
this.recognition.onresult = (event) => {
const transcript = Array.from(event.results)
.map(r => r[0].transcript)
.join("")
this.inputTarget.value = transcript
if (event.results[event.results.length - 1].isFinal) {
this.stop()
// Fire standard events so other JS/Stimulus controllers react
this.inputTarget.dispatchEvent(new Event("input", { bubbles: true }))
this.inputTarget.dispatchEvent(new Event("change", { bubbles: true }))
// Optional: auto-submit the parent form
if (this.autoSubmitValue && transcript.trim()) {
const form = this.inputTarget.closest("form")
if (form) form.requestSubmit()
}
}
}
this.recognition.onerror = () => this.stop()
this.recognition.onend = () => { if (this.isListening) this.stop() }
}
toggle() {
if (!this.recognition) return
this.isListening ? this.stop() : this.start()
}
start() {
if (!this.recognition || this.isListening) return
this.isListening = true
this.inputTarget.placeholder = this.listeningTextValue
if (this.hasMicTarget) this.micTarget.classList.add("stt-active")
this.dispatch("start")
try { this.recognition.start() } catch (_) { this.stop() }
}
stop() {
if (!this.isListening) return
this.isListening = false
this.inputTarget.placeholder = this.originalPlaceholder
if (this.hasMicTarget) this.micTarget.classList.remove("stt-active")
this.dispatch("stop")
try { this.recognition?.stop() } catch (_) { /* noop */ }
}
}
That's the whole thing. Let's break down the design decisions.
How it works
1. Progressive enhancement
The controller checks for SpeechRecognition support on connect(). If the browser doesn't support it (Firefox), the mic button is hidden. No errors, no broken UI. The input works normally.
const SR = window.SpeechRecognition || window.webkitSpeechRecognition
if (!SR) {
if (this.hasMicTarget) this.micTarget.style.display = "none"
return
}
2. Interim results
interimResults: true means the user sees their words appear in real-time as they speak. It feels responsive. When the final result arrives, we fire input and change events so any other Stimulus controllers or event listeners react naturally.
3. Configurable via HTML
All behavior is controlled through Stimulus values β no JS changes needed:
<!-- English, auto-submit when done speaking -->
<div data-controller="speech-to-text"
data-speech-to-text-lang-value="en-US"
data-speech-to-text-auto-submit-value="true"
data-speech-to-text-listening-text-value="Listening...">
<textarea data-speech-to-text-target="input"></textarea>
<button data-speech-to-text-target="mic"
data-action="speech-to-text#toggle">π€</button>
</div>
4. Stimulus events for composition
The controller dispatches speech-to-text:start and speech-to-text:stop events. Other controllers can listen to them:
<div data-controller="speech-to-text my-other-controller"
data-action="speech-to-text:start->my-other-controller#onListening
speech-to-text:stop->my-other-controller#onDone">
Usage examples
Search bar with auto-submit
<form action="/search" method="get">
<div data-controller="speech-to-text"
data-speech-to-text-auto-submit-value="true">
<input name="q"
data-speech-to-text-target="input"
placeholder="Search...">
<button type="button"
data-speech-to-text-target="mic"
data-action="speech-to-text#toggle">π€</button>
</div>
</form>
Speak β text fills in β form submits automatically.
Textarea for notes
<div data-controller="speech-to-text"
data-speech-to-text-lang-value="fr-FR">
<textarea data-speech-to-text-target="input"
placeholder="Ajoutez une note..."
rows="4"></textarea>
<button type="button"
data-speech-to-text-target="mic"
data-action="speech-to-text#toggle">π€</button>
</div>
No auto-submit β the user dictates, reviews, then saves manually.
Styled mic button with active state
Add 4 lines of CSS for visual feedback:
.stt-active {
background: #ef4444 !important;
color: white !important;
animation: stt-pulse 1.5s infinite;
}
@keyframes stt-pulse {
0%, 100% { box-shadow: 0 0 0 0 rgba(239, 68, 68, 0.4); }
50% { box-shadow: 0 0 0 8px rgba(239, 68, 68, 0); }
}
The controller adds/removes the stt-active class on the mic target automatically.
Multiple instances on the same page
Each instance is independent. Put 10 on the same page, they all work:
<div data-controller="speech-to-text">
<input data-speech-to-text-target="input" placeholder="First name">
<button data-speech-to-text-target="mic" data-action="speech-to-text#toggle">π€</button>
</div>
<div data-controller="speech-to-text">
<input data-speech-to-text-target="input" placeholder="Last name">
<button data-speech-to-text-target="mic" data-action="speech-to-text#toggle">π€</button>
</div>
Browser support
| Browser | Support |
|---|---|
| Chrome / Edge | Full (v33+) |
| Safari | Full (v14.1+) |
| Firefox | Not supported |
| Mobile Chrome | Full |
| Mobile Safari | Full (iOS 14.5+) |
That covers ~85% of users. On unsupported browsers, the mic button simply disappears.
Why not use a paid API?
| Web Speech API | Whisper API | Google Cloud STT | |
|---|---|---|---|
| Cost | $0 | $0.006/min | $0.006/min |
| Latency | Instant | 1-3s roundtrip | 1-2s roundtrip |
| Privacy | On-device* | Sent to OpenAI | Sent to Google |
| Offline | Partial | No | No |
| Setup | 0 lines backend | API key + endpoint | API key + endpoint |
*Chrome may send audio to Google for processing, but it's handled transparently by the browser.
For a dictation use-case (filling forms, search bars, notes), the Web Speech API is more than enough. Save the paid APIs for when you need transcription of audio files or real-time streaming in production.
Going further: a Rails helper to drop it in one line
The HTML boilerplate works fine for a one-off. But if you're adding voice input to 10+ forms across your app, you'll want something shorter. Here's a Rails helper that reduces it to a single line:
# app/helpers/speech_to_text_helper.rb
module SpeechToTextHelper
def stt_input(form, field, lang: "fr-FR", **input_options)
disabled = input_options[:disabled]
input_html = (input_options.delete(:input_html) || {})
.merge(data: { speech_to_text_target: "input" })
input_options[:input_html] = input_html
wrapper_attrs = {
data: { controller: "speech-to-text", speech_to_text_lang_value: lang },
style: "position: relative;"
}
content_tag(:div, wrapper_attrs) do
concat form.input(field, **input_options)
concat stt_mic_button unless disabled
end
end
private
def stt_mic_button
content_tag(:button, type: "button",
data: { speech_to_text_target: "mic", action: "speech-to-text#toggle" },
style: "position: absolute; right: 1rem; top: 2.2rem; border-radius: 50%; " \
"width: 1.8rem; height: 1.8rem; padding: 0; display: flex; " \
"align-items: center; justify-content: center; " \
"border: 1px solid #d1d5db; background: white; cursor: pointer;") do
mic_svg_icon
end
end
def mic_svg_icon
tag.svg(xmlns: "http://www.w3.org/2000/svg", width: 12, height: 12,
viewBox: "0 0 24 24", fill: "none", stroke: "currentColor",
stroke_width: 2.5, stroke_linecap: "round", stroke_linejoin: "round") do
safe_join([
tag.path(d: "M12 1a3 3 0 0 0-3 3v8a3 3 0 0 0 6 0V4a3 3 0 0 0-3-3z"),
tag.path(d: "M19 10v2a7 7 0 0 1-14 0v-2"),
tag.line(x1: 12, y1: 19, x2: 12, y2: 23),
tag.line(x1: 8, y1: 23, x2: 16, y2: 23)
])
end
end
end
This works with simple_form (swap form.input for form.text_field if you use plain Rails form helpers). Now any form field becomes voice-enabled in one line:
<%# Notes field with mic β one line %>
<%= stt_input f, :note, input_html: { class: "form-control" } %>
<%# Disabled field β mic is hidden automatically %>
<%= stt_input f, :note, disabled: @record.locked? %>
<%# English recognition %>
<%= stt_input f, :description, lang: "en-US", as: :text %>
<%# Any simple_form option works %>
<%= stt_input f, :address, placeholder: "Dictate your address...",
input_html: { rows: 3 } %>
What the helper does:
- Wraps the input in a
divwithdata-controller="speech-to-text"and relative positioning - Adds
data-speech-to-text-target="input"to the field automatically - Appends the mic button (positioned absolute, top-right of the field)
- Hides the mic when
disabled: trueis passed - Forwards every
simple_formoption (as:,placeholder:,input_html:, etc.) to the original input
The mic SVG is rendered server-side β no icon library needed.
Before β 15 lines of HTML per field:
<div data-controller="speech-to-text" style="position: relative;">
<%= f.input :note, input_html: {
class: "form-control",
data: { speech_to_text_target: "input" }
} %>
<button type="button"
data-speech-to-text-target="mic"
data-action="speech-to-text#toggle"
style="position: absolute; right: 1rem; top: 2.2rem; ...">
<svg ...><!-- 4 paths --></svg>
</button>
</div>
After β 1 line:
<%= stt_input f, :note, input_html: { class: "form-control" } %>
Wrapping up
70 lines of JS. 40 lines of Ruby helper. Zero dependencies. Zero cost. Works on <input>, <textarea>, any Rails + Stimulus app.
The full controller and helper are copy-pasteable from the code blocks above. Drop them in your project and start adding voice to your forms β whether it's one field or fifty.
If you want to go further, you can combine this with the SpeechSynthesis API to read responses back β turning any UI into a voice assistant. But that's another article.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.