I Built Speech-to-Text That Runs 100% On My Machine — No Cloud, No API Key (Whisper)

#machinelearning #ai #beginners #tutorial

Every "AI dictation" app I tried had the same fine print: your microphone gets streamed to someone's servers. For quick notes that's fine. For a doctor's patient notes, a lawyer's case memo, or a journalist's source interview — it's a dealbreaker. So for Day 47 of my TechFromZero series I built dictation that runs entirely on your own machine. No cloud, no API key, no per-minute bill, and it works on a plane with the wifi off.

The secret is Whisper, running locally through whisper.cpp.

Cloud STT vs on-device STT

The difference is just where the model runs:

cloud:      mic → 🌐 someone's server → text   (private data leaves your box)
on-device:  mic → your own CPU       → text   (nothing leaves at all)

On-device wins on privacy, cost (free forever), offline support, and rate limits (there aren't any). It loses a little on raw accuracy versus the biggest cloud models — but with a good Whisper model the gap is small, and for most dictation it's invisible.

What Whisper actually is

Whisper is an open speech-recognition model. You hand it a chunk of audio, it hands back text — and it's impressively robust to accents and background noise. The important word is open: the weights are downloadable, so you can run the model yourself instead of renting an API.

But the original Whisper wants Python, PyTorch, and ideally a GPU. That's a lot to ask of a laptop. Enter whisper.cpp.

whisper.cpp: the same model, on a plain CPU

whisper.cpp is a tiny C++ port of Whisper with no Python and no heavy dependencies. Two things make it laptop-friendly:

It's native C++ — compiles to a single binary, runs on the CPU.
Quantization — the weights are stored in 4–5 bits instead of 32. That shrinks a model from gigabytes to ~75 MB and makes inference fast, with barely any accuracy loss.

Getting it running is genuinely three commands:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp && make
./models/download-ggml-model.sh base.en-q5_1   # ~75 MB quantized model

And transcribing a file is one more:

./main -m models/ggml-base.en-q5_1.bin -f my-recording.wav

Run that, then unplug your internet and run it again. It still works. That's the whole point.

How Whisper turns sound into words

A neural network can't read a raw waveform, so Whisper does a few things under the hood (whisper.cpp handles all of it for you):

Resample to 16 kHz mono. Whisper expects this exact format — it's why every example feeds it a 16 kHz WAV.
Convert to a log-mel spectrogram — basically a picture of which frequencies are loud over time. That image is what the model reads.
Encoder → decoder. Whisper is a transformer with two halves: an encoder turns the spectrogram into a rich numeric summary, and a decoder writes out text one token at a time while attending to that summary. It's the same encoder–decoder shape behind translation models — here it "translates" sound into words.

You don't have to implement any of that. But knowing it explains the quirks: why audio must be 16 kHz, why bigger models are slower, and why it occasionally invents a word when the audio is silent (the decoder is, at heart, a tiny language model).

Wrapping it in a real dictation app

A command-line transcriber isn't dictation yet. To make it feel like Apple's dictation, I wrapped whisper.cpp in a thin Electron shell. The architecture is dead simple — capture, transcribe, insert:

// renderer: capture mic as 16 kHz mono, the format Whisper wants
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const rec = new MediaRecorder(stream);
rec.ondataavailable = e => chunks.push(e.data);
rec.start();   // on stop(), hand the blob to the main process

// main: run the binary, read the text back. zero network calls.
const { execFile } = require("child_process");
execFile("./main", ["-m", model, "-f", wavPath, "-otxt"], (err, text) => {
  win.webContents.send("transcript", text);   // paste wherever the cursor is
});

That's it. Mic → whisper.cpp → text, and the audio never leaves the device.

Try it without installing anything

The full project needs the build step above, so for the live page I added a zero-install taste using the browser's built-in Web Speech API — press the mic and watch your words appear in real time (grey while it's still listening, black once final). It's not Whisper, but the shape is identical: capture → transcribe → insert. It's the fastest way to feel what on-device dictation is like before you build the real thing.

Why this matters

We've gotten used to "AI feature = send my data to a server." Whisper is a reminder that a genuinely good model can run on the computer you already own. Private, offline, free, unlimited. Once you've built dictation this way, cloud STT starts to feel like an odd default rather than a necessity.

👉 Try the live demo + read the full walkthrough: https://dev48v.infy.uk/tech/day47-whisper.html

💻 Code: https://github.com/dev48v/whisper-from-zero

🌐 All 47 days: https://dev48v.infy.uk/techfromzero.php

This is Day 47 of a 50-day series — a new technology every day, built from scratch. Day 48 lands next.