Last year I transcribed forty hours of developer interviews by hand because I didn't trust the AI tools. My wrists hurt. I missed a deadline. I still botched a quote. One participant said they hated Docker. I typed "loved Docker." That single error skewed my feature priority matrix for a week.
Now I use a workflow that is boring, repeatable, and won't let garbage audio wreck your dataset. It goes like this.
1. Record audio that won't wreck your accuracy
I learned this in a glass-walled conference room with a laptop mic. The echo was so bad that "Git" became "get" for two straight hours. I had to guess context on twelve different lines. Never again.
Use a directional mic or a decent USB interface. Laptop mics grab keyboard clatter and fan hum. Record in a small, carpeted room if you can. Hard surfaces bounce sound and confuse speech engines.
For remote sessions, make participants wear headphones. It stops their speakers from bleeding into your recording. Ask people not to talk over each other. If you need speaker labels later, have them introduce themselves at the top.
If you are pulling audio from a Zoom recording, normalize it first. ffmpeg handles this in one pass:
ffmpeg -i interview.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 interview.wav
Mono WAV at 16kHz. Speech engines prefer it. Mono removes stereo separation weirdness, and 16kHz covers the vocal range without bloating your file.
2. Generate the draft automatically
I upload the file to whatever engine I'm using that month. Lately it is Whisper.cpp running locally because I got paranoid about participant data hitting cloud APIs. Last year I burned through Otter credits. The service matters less than the settings.
I pick verbatim mode when I am looking for hesitation or power dynamics. It keeps the ums, false starts, and pauses. I pick clean mode for thematic analysis or when I am handing quotes to a PM who just wants the point, not the verbal stumbles.
If the tool offers auto-detect for language, verify it. I once ran a mixed English-German session and the engine tagged the whole file as Dutch. The gibberish propagated for pages before I noticed.
Do not treat the raw file as final. Automated transcripts hit maybe 85-95% accuracy in ideal conditions. Accents, jargon, and crosstalk drop that number fast.
3. Review like you are being audited
This is the step I used to skip. It cost me.
Open the draft next to the audio. Play it at 1.0x or 1.25x. Fix these exact things:
- Misheard domain terms. "React" becomes "reactant." "Kubernetes" turns into phonetic mush. These look tiny but destroy coding accuracy.
- Speaker labels. Auto-tools merge speakers during crosstalk. Label each turn yourself.
-
Crosstalk and drops. When two people talk at once, the transcript may mash both voices into nonsense. Mark gaps with
[inaudible]so you do not code silence as agreement. - Punctuation for meaning. A missing comma can flip enthusiasm into sarcasm.
- Nonverbal cues only if they matter. I mark laughter or long pauses with a standard notation. I skip them if I am just hunting for feature requests.
Here is the template I paste into my editor:
[00:03:15] Interviewer: Walk me through how you deploy to production.
[00:03:18] Participant: Usually we just run the script, wait for the build, and then... actually, sometimes we check the logs first.
[00:03:24] [pause 3s]
[00:03:27] Participant: If it's a Friday, we don't deploy at all.
[00:03:30] [laughter]
ISO timestamps and bracketed tags. I keep lines under 100 characters so they import cleanly into qualitative tools.
4. Format for your analysis stack
I've imported transcripts into NVivo, Dovetail, Atlas.ti, and plain Git repos. Consistency matters more than aesthetics. NVivo chokes on inconsistent timestamps. Dovetail gets weird if your speaker labels change format between files.
Standardize these before you call it done:
- Speaker names. Pick "Interviewer / Participant" or "P1 / P2" and stick with it across every file.
- Paragraph breaks. Start a new paragraph when the topic shifts, not just when someone stops talking.
- Timestamps. Drop them every 30-60 seconds, or at every speaker turn if your tool requires it.
If your team works async across time zones, timestamps are the only way another researcher can pull the original clip for context.
When I store transcripts in Git for team review, I add YAML frontmatter so we can search later:
---
project: onboarding-research
session_id: 2024-06-12_p5
participant_id: P5
method: semi-structured-interview
transcript_type: clean
duration_minutes: 42
---
This turns a folder of text files into something you can actually query.
5. Export and version your files
Save two copies. Every time.
- Raw automated output. This is your audit trail.
- Corrected transcript with final labels and formatting.
Export depends on your pipeline:
- TXT or MD for coding platforms.
- DOCX if your team lives in Word comments.
- JSON if you are feeding them into a custom NLP pipeline.
Keep both. Six months later, when a stakeholder asks if that brutal quote is real, you need to trace it back to the source audio without starting from zero.
Verbatim vs. clean: pick one and lock it in
I once switched formats mid-project because I got lazy. I had to re-review every file. Do not be me.
Use verbatim when:
- You are studying language patterns, pauses, or interaction dynamics.
- Your methodology is discourse or conversation analysis.
Use clean when:
- You are hunting for themes, pain points, or feature requests.
- A PM or executive will read it and only cares about the content, not the delivery.
Most of my UX work uses clean transcripts. Academic work usually needs verbatim. You can generate both. Keep the verbatim file as your source of truth, then derive a cleaned copy for reporting.
Closing the loop
Good transcription is a quality gate for your whole project. A transcript with mislabeled speakers or missing context will send your coding sideways. You will not notice until you are writing findings at 11 PM and questioning your sanity.
Record clean audio. Generate a draft. Review it line by line with the original playing. Format it consistently. Lock in your raw and edited versions.
Your future self, staring at fifty coded segments at 11 PM, will thank you.
Top comments (0)