DEV Community

Alex Neamtu
Alex Neamtu

Posted on

How We Built Transcript-Powered Video Editing in Go

You record a five-minute product walkthrough. You say "um" eleven times. The first thirty seconds are you fumbling with screen share. The title reads "Recording 2/20/2026 3:45:12 PM."

Fixing any of this used to mean dragging tiny handles on a timeline, scrubbing back and forth to find the right millisecond. If you wanted to remove filler words, you'd have to find each one manually and trim them out one at a time.

We already had word-level transcripts from whisper.cpp. The data was sitting right there — timestamps, text, segment boundaries. We just weren't using it for editing.

SendRec v1.45.0 adds three features that change that: trim by transcript, filler word removal, and AI-generated title suggestions. All three build on the same transcript data.

Trim by transcript

The existing trim tool works with draggable handles on a time bar. Fine for rough cuts, but hard to use when you need to cut at the exact moment a sentence starts.

The trim modal now shows the video's transcript segments below the timeline. Two mode buttons — "Set Start" and "Set End" — control what happens when you click a segment.

Click a segment in "Set Start" mode and the trim start jumps to that segment's timestamp. The mode automatically switches to "Set End." Click another segment and you've defined your range. Segments inside the range highlight green. Segments outside dim to 40% opacity.

The existing drag handles still work. The transcript panel is an addition, not a replacement. And since the trim endpoint hasn't changed, this was a frontend-only enhancement — no backend work needed.

If the video doesn't have a transcript yet (still processing, or transcription failed), the modal shows the original timeline-only UI.

Filler word removal

This is the feature that saves the most time. Instead of finding and trimming each "um" individually, you preview all of them at once and remove them in a single operation.

Click "Remove fillers" in the Library overflow menu and a modal scans the transcript for filler-only segments. The detection uses a simple regex:

/^\s*(?:um+|uh+|uhh+|hmm+|ah+|er+|you know|like|so|basically|actually|right|i mean)[.,!?\s]*$/i
Enter fullscreen mode Exit fullscreen mode

The key word is "filler-only." A segment that says "um" gets flagged. A segment that says "um, so the next thing we need to discuss" does not. We only flag segments where the entire text is a filler word — no risk of cutting real content.

The modal shows every detected filler with a checkbox, its timestamp, and the text. All are checked by default. A summary line shows the count and total time you'd save: "Found 11 filler words (4.2s total)." Uncheck any you want to keep, then click Remove.

The backend: one ffmpeg pass

The removal endpoint accepts a list of time ranges to cut:

POST /api/videos/{id}/remove-segments

{
  "segments": [
    {"start": 3.2, "end": 3.8},
    {"start": 12.1, "end": 12.9},
    {"start": 25.0, "end": 25.6}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Validation rejects overlapping segments, unsorted segments, negative timestamps, segments beyond the video duration, more than 200 segments per request, and any cut that would leave less than one second of video.

The actual removal uses ffmpeg's select and aselect filters to keep only the frames NOT in the removed ranges:

ffmpeg -i input -filter_complex
  "[0:v]select='not(between(t,3.2,3.8)+between(t,12.1,12.9)+between(t,25.0,25.6))',setpts=N/FRAME_RATE/TB[v];
   [0:a]aselect='not(between(t,3.2,3.8)+between(t,12.1,12.9)+between(t,25.0,25.6))',asetpts=N/SR/TB[a]"
  -map "[v]" -map "[a]" output
Enter fullscreen mode Exit fullscreen mode

This handles any number of segments in a single pass. The setpts and asetpts filters reset the timestamps so the output plays continuously without gaps.

After processing, the video gets a new thumbnail and goes back through transcription — the old transcript's timestamps no longer match.

Why not word-level timestamps?

Whisper segments are phrase-level, not word-level. A segment might be "um, so basically" — three words, one timestamp range. We can't cut just the "um" and keep "so basically" because we don't know exactly when each word starts and ends within the segment.

This means some fillers get missed: the ones embedded in longer phrases. That's a deliberate tradeoff. We flag the segments we're certain about and leave the rest alone. No false positives.

AI title suggestions

After transcription completes, the summary worker checks whether the video still has an auto-generated title — patterns like "Recording 2/20/2026 3:45:12 PM" or "Untitled Recording." If it does, it sends the first portion of the transcript to the AI client and asks for a concise title.

The prompt is simple:

Given this video transcript, generate a concise title (3-8 words)
that captures the main topic. Return ONLY the title text, no quotes,
no explanation. Write in the same language as the transcript.
Enter fullscreen mode Exit fullscreen mode

The suggestion is stored in a suggested_title column. It doesn't overwrite anything — the original title stays until the user decides.

In the Library, videos with a suggested title show a small indicator below the title text. Click it to see the suggestion with Accept and Dismiss buttons. Accept updates the title. Dismiss clears the suggestion. Either way, it's a one-click decision.

The title generation piggybacks on the existing summary worker. No separate queue, no extra infrastructure. If the AI service is down, no suggestion is stored — the video just keeps its original title.

What we didn't build

  • Word-level transcript timestamps (whisper segments are phrase-level)
  • Real-time filler detection during recording
  • Automatic filler removal without user confirmation
  • Batch filler removal across multiple videos

Each of these has tradeoffs that would add complexity without clear benefit yet. Word-level timestamps would improve filler detection but require a different whisper model configuration. Auto-removal without preview would be faster but risky — "like" and "so" are sometimes real words, not fillers.

Try it

Transcript-powered editing is live at app.sendrec.eu in v1.45.0. Self-hosters get all three features on upgrade — the migration runs automatically on startup. AI title suggestions require the AI_ENABLED environment variable and a compatible LLM endpoint.

If you're self-hosting SendRec, check out the self-hosting guide.

Top comments (0)