Alex Neamtu

Posted on Feb 20 • Originally published at sendrec.eu

How We Added AI Summaries and Chapters to Video Recordings

#go #ai #opensource #webdev

You record a 10-minute product demo. Someone on your team gets the link a week later. They want to know: what's in this video? Is this the one about the new API, or the one about the dashboard redesign?

Transcripts help — you can skim the text. But skimming a wall of timestamped segments isn't much faster than watching the video. What you really want is a summary at the top and chapter markers that let you jump to the part you care about.

SendRec already transcribes every video with whisper.cpp. We had the transcript sitting in the database as structured JSON segments with timestamps. All we needed was something to read that transcript and produce a summary and chapter list.

Why an LLM and not heuristics

The straightforward approach would be to split the transcript into chunks and use the first sentence of each chunk as a chapter title. That works for scripted content with clear section breaks. It falls apart for real recordings — people repeat themselves, go on tangents, circle back to earlier points.

An LLM can understand the semantic structure of a conversation. It knows that the section about "deployment" started at 3:42 even though the speaker said "anyway, let me show you how we push this to production" — not the word "deployment" at all.

We use the OpenAI-compatible chat completions API. The default provider is Mistral, but any compatible endpoint works — OpenAI, Ollama, anything that speaks /v1/chat/completions.

The prompt

The system prompt asks for structured JSON with two fields:

You are a video content analyzer. Given a timestamped transcript, produce a JSON object with:
- "summary": A 2-3 sentence overview of what the video covers.
- "chapters": An array of objects with "title" (string, 3-6 words) and "start" (number, seconds)
  marking major topic changes. Include 2-8 chapters depending on video length.
  The first chapter should start at 0.

Write the summary and chapter titles in the same language as the transcript.
Return ONLY valid JSON, no markdown formatting.

The user message is the transcript, formatted as timestamped lines:

[00:00] Hi everyone, welcome to this walkthrough
[00:05] Today I'm going to show you the new dashboard
[00:12] Let's start with the navigation changes
...

The timestamps come directly from the whisper output. The LLM uses them to determine where each chapter starts, so the chapter markers align with the actual video timeline.

The language instruction matters. Whisper detects the spoken language automatically and produces transcripts in that language. Without the explicit instruction, the LLM tends to respond in English regardless of the transcript language. With it, a German transcript gets a German summary and German chapter titles.

Cost guards

LLM API calls cost money. We add two guards before sending a transcript to the API.

Minimum segment count. Videos with fewer than 2 transcript segments are skipped. A 5-second screen recording of a button click doesn't need a summary.

Maximum transcript length. The formatted transcript is capped at 30,000 characters. For a typical speaking pace, that covers roughly 60 minutes of video. Longer transcripts are truncated — the summary covers what fits, and the chapters span the included portion. This keeps token costs bounded.

const maxTranscriptChars = 30000

func formatTranscriptForLLM(segments []TranscriptSegment) string {
    var result string
    for _, seg := range segments {
        minutes := int(seg.Start) / 60
        seconds := int(seg.Start) % 60
        line := fmt.Sprintf("[%02d:%02d] %s\n", minutes, seconds, seg.Text)
        if len(result)+len(line) > maxTranscriptChars {
            break
        }
        result += line
    }
    return result
}

The background worker

Summaries are generated by a polling worker, the same pattern we use for transcription. The worker runs on a 10-second ticker and claims one job at a time using FOR UPDATE SKIP LOCKED:

UPDATE videos SET summary_status = 'processing', summary_started_at = now()
WHERE id = (
    SELECT id FROM videos
    WHERE summary_status = 'pending' AND status != 'deleted'
    ORDER BY updated_at ASC LIMIT 1
    FOR UPDATE SKIP LOCKED
)
RETURNING id, transcript_json

This is the same atomic claim pattern we use for the transcription worker. It handles concurrent instances gracefully — if you scale to multiple app containers, they won't double-process the same video.

Stuck jobs (processing for more than 10 minutes) are reset to pending at the start of each tick. If the AI API times out or the container restarts mid-summary, the job retries automatically.

The status flow mirrors transcription: none → pending → processing → ready or failed.

Triggering summaries

When AI is enabled, summaries are queued automatically after transcription completes. The transcription worker already updates the transcript status and stores the segments — we added one line to also set summary_status = 'pending' in the same query:

UPDATE videos
SET transcript_status = 'ready', transcript_json = $1, transcript_key = $2,
    summary_status = 'pending'
WHERE id = $3

The summary worker picks it up on the next tick. From the user's perspective, the video finishes transcribing and the summary appears a few seconds later.

Users can also manually trigger summaries from the Library. The "Summarize" button resets the summary status to pending, and the worker picks it up. If a summary already exists, the button says "Re-summarize" — useful after re-transcribing a video or if the first summary wasn't quite right.

Parsing LLM output

LLMs don't always follow instructions precisely. The prompt says "return ONLY valid JSON, no markdown formatting." Most of the time, that works. Sometimes the response comes wrapped in markdown code fences.

We try parsing the raw response first. If that fails, we strip markdown fences and try again:

func parseSummaryJSON(content string) (*SummaryResult, error) {
    var result SummaryResult
    if err := json.Unmarshal([]byte(content), &result); err == nil {
        return &result, nil
    }

    stripped := stripMarkdownFences(content)
    if err := json.Unmarshal([]byte(stripped), &result); err != nil {
        return nil, fmt.Errorf("parse summary JSON: %w", err)
    }
    return &result, nil
}

This handles both cases without requiring the LLM to be perfectly compliant.

The watch page

When both a transcript and a summary are ready, the watch page shows a tabbed panel with two tabs: Summary and Transcript. The Summary tab is shown by default.

The summary text is displayed at the top, followed by a chapter list. Each chapter is a clickable row with a formatted timestamp and title:

<div class="chapter-item" data-start="{{.Start}}">
    <span class="chapter-time">{{formatTimestamp .Start}}</span>
    <span class="chapter-title">{{.Title}}</span>
</div>

Clicking a chapter seeks the video to that timestamp — the same pattern we use for transcript segments. The Transcript tab shows the full clickable transcript that was already there.

If the video has a transcript but no summary (AI is disabled, or summarization hasn't run yet), the watch page shows just the transcript panel without tabs. The summary is additive — it doesn't change the experience for users who don't enable AI.

Polling for completion

The watch page already polled for transcript completion. We extended it to also wait for summary completion. If a viewer arrives while the summary is still processing, the page polls every 10 seconds and reloads when both the transcript and summary have reached terminal states (ready or failed).

This means a viewer who opens a freshly recorded video sees the summary appear automatically within a minute — first the transcript lands, then the summary follows a few seconds later.

Provider flexibility

The AI client speaks the OpenAI chat completions API. Three environment variables configure it:

AI_ENABLED: "true"
AI_BASE_URL: "https://api.mistral.ai"
AI_API_KEY: "your-key"
AI_MODEL: "mistral-small-latest"

Swap the base URL and model to use OpenAI:

AI_BASE_URL: "https://api.openai.com"
AI_MODEL: "gpt-4o-mini"

Or point it at a local Ollama instance for fully offline operation:

AI_BASE_URL: "http://localhost:11434"
AI_MODEL: "llama3"
AI_API_KEY: ""

Self-hosters who don't want AI summaries leave AI_ENABLED unset. The summary worker doesn't start, the Library doesn't show the Summarize button, and transcription works exactly as before.

Try it

SendRec is open source (AGPL-3.0) and self-hostable. AI summaries and chapters are live at app.sendrec.eu — upload a video, wait for transcription, and the summary appears automatically. The implementation is in ai_client.go and summary_worker.go.

DEV Community