I built a pay-per-use video transcription tool with Next.js and Whisper — here's the full breakdown

#webdev #ai #nextjs #sideprojects

Why I built this

I kept running into the same frustration: I had a video file and I needed the text inside it.

The available options were either too manual (typing it yourself), too unreliable (YouTube auto-captions), or too expensive for occasional use (most transcription SaaS tools charge a flat monthly fee regardless of how much you actually transcribe).

So I built Tonivox — a web app that accepts a video file, extracts the audio, runs it through a transcription model, and returns the full text. Pay per transcription, no subscription.

This post covers the technical decisions, the problems I ran into, and what I'd do differently.

The stack

Next.js 15 (App Router) — frontend and API routes
Prisma + PostgreSQL — data layer
Better Auth — authentication (email/password + email verification)
Stripe — credit purchases via Checkout Sessions
OpenAI Whisper — transcription model
FFmpeg — audio extraction from video files
Tailwind CSS — styling

How it works

The flow is straightforward:

User uploads a video file (MP4, MOV, AVI, WebM — up to 60 min / 25 MB)
The server extracts the audio track using FFmpeg
The audio is sent to Whisper via the OpenAI API
The transcription is returned and stored
One credit is deducted atomically from the user's balance

The credit model: users buy credits upfront (three tiers, $0.99 to $6.99). Each transcription costs one credit. No monthly commitment.

The problems

FFmpeg on serverless

Running FFmpeg in a serverless environment is not as simple as npm install ffmpeg. You need a static binary compatible with the runtime's OS. I ended up using @ffmpeg-installer/ffmpeg and validating that the binary path was actually accessible at runtime — something that failed silently in a few deployment configurations before I caught it.

File size and timeouts

Serverless functions have timeout limits. A 60-minute video file takes time to process, and the default limits on most platforms are not built for that. I moved the heavy processing (audio extraction + Whisper API call) to a dedicated background worker. The API route just enqueues the job and the client polls for status.

This also meant I had to handle the "queued" and "processing" states in the UI, which added complexity but made the experience significantly more reliable.

Atomic credit deduction

I wanted to make sure credits were never deducted if the transcription failed, and never skipped if it succeeded. I handled this with a database transaction that wraps the Whisper API call result — if the API throws, the transaction rolls back and the credit is preserved.

Portuguese accuracy

Most transcription tools are tuned for English. Whisper handles Portuguese well out of the box, but the language detection needs to be explicit — if you don't specify the language, it can misidentify short clips and produce garbage output. I pass language: "pt" or language: "en" based on a user preference stored at account level.

What I deliberately left out

I made a list of features I wanted but decided not to ship in v1:

Speaker diarization (who said what)
Word-level timestamps
SRT/VTT subtitle export
Audio-only file support (MP3, WAV)
Live recording

Each of these is buildable. I left them out because I wanted to test whether the core loop — upload a video, get text — was something people actually paid for before investing more time.

What I'd do differently

Start with the worker architecture from day one. I initially built the transcription as a synchronous API call. Refactoring to async/poll later was more work than just building it right the first time would have been.

Set up error observability earlier. For the first few weeks I was flying blind on server-side failures. Adding structured logging and an alerting channel earlier would have saved me from discovering bugs through confused users instead of dashboards.

Spend less time on the landing page, more time talking to potential users. I polished the UI for longer than I should have before showing it to anyone. The feedback I got in the first 48 hours after launch pointed to things I never would have caught by looking at my own code.

Where it stands now

The app is live at tonivox.com. It supports English and Portuguese. The codebase is a few thousand lines, built and maintained solo.

If you're working on something similar — pay-per-use SaaS, file processing on serverless, or Whisper integration — happy to answer questions in the comments.

And if you work with video or audio professionally and want to try it, I'm actively looking for feedback on what's missing or broken.