Mason K

Posted on May 21

Building a shot-detection worker for an upload pipeline with PySceneDetect 0.7

#ffmpeg #video #tutorial #python

📦 Code: github.com/USER/shot-detection-worker (replace before publishing)

TL;DR

We are going to build an upload worker that runs shot detection on a video, writes the boundary list to Postgres, and produces a 6-frame storyboard PNG for use as a hover-preview sprite. Stack is PySceneDetect 0.7 (released earlier this month), FFmpeg 8.1.1, Python 3.12, and a Redis-backed queue. About 120 lines of code.

Shot detection is one of those building blocks that quietly powers a lot of features people associate with "video AI": chapter generation, smart thumbnails, hover-preview sprites, the first step of any auto-clipping pipeline. The open-source path covers more ground than people expect, and PySceneDetect 0.7 (which dropped on 2026-05-03) is the version I would build on today.

Let's wire it up end to end.

🛠️ 1. Setup

# bash
mkdir shot-worker && cd shot-worker
python3.12 -m venv .venv && source .venv/bin/activate
pip install scenedetect==0.7 opencv-python-headless==4.10.0.84 \
            rq==1.16 psycopg[binary]==3.2.0 boto3==1.34.0

We also need FFmpeg on the host:

# bash
$ ffmpeg -version | head -1
ffmpeg version 8.1.1 ...

A throwaway DB for the boundaries table:

-- schema.sql
CREATE TABLE shots (
    asset_id   TEXT NOT NULL,
    shot_index INT  NOT NULL,
    start_s    NUMERIC(10, 3) NOT NULL,
    end_s      NUMERIC(10, 3) NOT NULL,
    metric     NUMERIC(10, 3),
    PRIMARY KEY (asset_id, shot_index)
);

CREATE INDEX shots_asset_idx ON shots (asset_id);

# bash
psql -d shotworker -f schema.sql

🎬 2. The detector picker

PySceneDetect 0.7 ships five detectors, and the right one depends on the content. A small router up front saves a lot of pain later.

# worker/detectors.py
from scenedetect import AdaptiveDetector, ContentDetector, ThresholdDetector
from scenedetect.detectors import HistogramDetector, HashDetector

def pick_detector(content_class: str):
    """Pick the PySceneDetect detector that suits the content class.

    content_class comes from the upload metadata: 'talking_head', 'sports',
    'animation', 'screen_recording', etc. Tune thresholds against your own
    samples; the defaults here are starting points, not ground truth.
    """
    match content_class:
        case "sports" | "action":
            # Rolling-average baseline handles fast camera motion better.
            return AdaptiveDetector(adaptive_threshold=3.0, min_scene_len=15)
        case "animation":
            # Histogram delta works better than HSV deltas on animated content.
            return HistogramDetector(threshold=0.05, min_scene_len=15)
        case "screen_recording":
            # Perceptual hashing skips long static stretches.
            return HashDetector(threshold=0.395, min_scene_len=30)
        case "fade_heavy":
            # Threshold-based detection catches fade-to-black act breaks.
            return ThresholdDetector(threshold=12.0, min_scene_len=15)
        case _:
            # Default: HSV content delta. Works on most diverse content.
            return ContentDetector(threshold=27.0, min_scene_len=15)

💡 Tip: keep a small set of labeled test clips (one per content class) and re-run the detector on them whenever you touch thresholds. The "did my change regress" question gets cheaper to answer fast.

🔍 3. The detection function

PySceneDetect's high-level API is small enough that the whole detection step is a dozen lines:

# worker/detect.py
from pathlib import Path

from scenedetect import open_video, SceneManager
from worker.detectors import pick_detector


def detect_shots(video_path: Path, content_class: str = "default"):
    """Run shot detection on a video file.

    Returns a list of (start_seconds, end_seconds, metric_score) tuples.
    """
    video = open_video(str(video_path))
    scene_manager = SceneManager()
    detector = pick_detector(content_class)
    scene_manager.add_detector(detector)

    scene_manager.detect_scenes(video=video, show_progress=False)

    shots = []
    for i, (start, end) in enumerate(scene_manager.get_scene_list()):
        shots.append((
            start.get_seconds(),
            end.get_seconds(),
            None,  # metric per scene not exposed directly in 0.7
        ))
    return shots

You can also grab the per-frame metric data by adding a StatsManager and writing it to CSV, which is invaluable when threshold tuning:

# worker/detect.py (extended version with stats)
from scenedetect import StatsManager

def detect_shots_with_stats(video_path, content_class, stats_csv):
    video = open_video(str(video_path))
    stats = StatsManager()
    scene_manager = SceneManager(stats_manager=stats)
    scene_manager.add_detector(pick_detector(content_class))
    scene_manager.detect_scenes(video=video, show_progress=False)
    stats.save_to_csv(stats_csv)
    return [(s.get_seconds(), e.get_seconds(), None)
            for s, e in scene_manager.get_scene_list()]

🖼️ 4. The storyboard

For a hover-preview sprite, we want one frame per shot. FFmpeg does the heavy lifting; PySceneDetect tells it where to look:

# worker/storyboard.py
import subprocess
from pathlib import Path


def extract_keyframes(video_path: Path, shots, out_dir: Path, max_frames: int = 6):
    """Extract one frame per shot, capped at max_frames total."""
    out_dir.mkdir(parents=True, exist_ok=True)

    # If there are more shots than slots, sample evenly.
    if len(shots) > max_frames:
        step = len(shots) / max_frames
        sampled = [shots[int(i * step)] for i in range(max_frames)]
    else:
        sampled = shots

    paths = []
    for i, (start_s, end_s, _) in enumerate(sampled):
        midpoint = start_s + (end_s - start_s) / 2
        out_path = out_dir / f"frame_{i:02d}.jpg"
        subprocess.run(
            [
                "ffmpeg", "-ss", f"{midpoint:.3f}", "-i", str(video_path),
                "-frames:v", "1", "-q:v", "3", "-vf", "scale=320:-2",
                "-y", str(out_path),
            ],
            check=True,
            stderr=subprocess.DEVNULL,
        )
        paths.append(out_path)

    # Stitch into a 1x6 sprite.
    sprite_path = out_dir / "sprite.png"
    subprocess.run(
        ["ffmpeg", "-i", str(out_dir / "frame_%02d.jpg"),
         "-vf", f"tile={max_frames}x1", "-y", str(sprite_path)],
        check=True,
        stderr=subprocess.DEVNULL,
    )
    return sprite_path

-ss before -i is intentional: it lets FFmpeg seek before decoding, which makes the extract cheap on long videos. Putting -ss after -i reads from the start of the file every time, which is fine on a 30-second clip and miserable on a 90-minute one.

🔌 5. The worker

We tie everything together with an RQ job. Whichever queue you use, the shape is the same: pull asset, detect, write boundaries, produce storyboard, mark asset ready.

# worker/job.py
import json
import logging
from pathlib import Path
import tempfile

import boto3
import psycopg

from worker.detect import detect_shots_with_stats
from worker.storyboard import extract_keyframes

logger = logging.getLogger(__name__)
s3 = boto3.client("s3")


def process_upload(asset_id: str, bucket: str, key: str,
                   content_class: str, dsn: str):
    """End-to-end shot detection + storyboard for a single uploaded asset."""
    with tempfile.TemporaryDirectory() as tmp:
        local = Path(tmp) / "input.mp4"
        s3.download_file(bucket, key, str(local))
        logger.info("downloaded %s/%s -> %s", bucket, key, local)

        stats_csv = Path(tmp) / "stats.csv"
        shots = detect_shots_with_stats(local, content_class, stats_csv)
        logger.info("detected %d shots in %s", len(shots), asset_id)

        # Persist boundaries
        with psycopg.connect(dsn) as conn:
            with conn.cursor() as cur:
                cur.execute(
                    "DELETE FROM shots WHERE asset_id = %s", (asset_id,),
                )
                cur.executemany(
                    "INSERT INTO shots (asset_id, shot_index, start_s, end_s, metric) "
                    "VALUES (%s, %s, %s, %s, %s)",
                    [(asset_id, i, s, e, m) for i, (s, e, m) in enumerate(shots)],
                )

        # Storyboard
        out_dir = Path(tmp) / "storyboard"
        sprite = extract_keyframes(local, shots, out_dir, max_frames=6)
        s3.upload_file(str(sprite), bucket, f"storyboards/{asset_id}.png")
        s3.upload_file(str(stats_csv), bucket, f"shot-stats/{asset_id}.csv")

        return {
            "asset_id": asset_id,
            "shots": len(shots),
            "sprite_key": f"storyboards/{asset_id}.png",
        }

Hooking it up to RQ:

# worker/run.py
from rq import Queue, Worker
from redis import Redis
import os

if __name__ == "__main__":
    redis = Redis.from_url(os.environ["REDIS_URL"])
    q = Queue("uploads", connection=redis)
    with Worker([q], connection=redis) as w:
        w.work()

Enqueue an upload job from your upload handler:

# in your upload handler
from rq import Queue
from redis import Redis

q = Queue("uploads", connection=Redis.from_url(REDIS_URL))
q.enqueue(
    "worker.job.process_upload",
    asset_id="upload_abc123",
    bucket="my-uploads",
    key="raw/abc123.mp4",
    content_class="talking_head",
    dsn=os.environ["DB_DSN"],
    job_timeout=600,
)

⚡ 6. Throughput and gotchas

A few notes from real deployments:

PySceneDetect is CPU-bound and decode-heavy. The detector itself is fast; the bottleneck is OpenCV decoding the video. On commodity CPU (4 vCPU, 8 GB RAM), 1080p talking-head content processes faster than real time. Mileage varies on dense 4K content.
Use opencv-python-headless on servers. The full opencv-python package pulls in GTK/Qt and breaks in containerized environments.
Sample frames if you do not need pixel-precise boundaries. PySceneDetect has a downscale_factor argument that subsamples the input. For chapter-generation use cases, a 2x downscale halves processing time and changes the boundary list by a frame or two at most.
The CSV stats file is gold. Save it. The day a content class regresses and the PM asks "why did chapters drop", that file is the answer.

# faster detection at the cost of frame-level precision
video = open_video(str(video_path), backend="opencv")
scene_manager = SceneManager()
scene_manager.auto_downscale = True
scene_manager.detect_scenes(video=video, frame_skip=2)

⚠️ Note: frame_skip improves throughput but means the detector misses very short shots (a 4-frame quick cut at frame_skip=2 may not register). Tune to your content.

🧪 7. Verifying the output

A small script that opens the storyboard and prints the boundary list:

# bash
$ python -m worker.cli verify upload_abc123
asset upload_abc123
  shots: 14
  total: 73.21s
  storyboard: s3://my-uploads/storyboards/upload_abc123.png
  sample boundaries:
    [00:00:00.000 - 00:00:04.120]  (shot 0)
    [00:00:04.120 - 00:00:09.480]  (shot 1)
    [00:00:09.480 - 00:00:13.000]  (shot 2)
    ...

Open the sprite. One frame per shot, six total, evenly distributed.

What's next

A few directions to take this once the baseline works:

Chapter generation. Group shots that are at least 30 seconds long; the rest is a rules engine on top of the boundary list.
Smart thumbnails. Pipe the keyframes through a sharpness + face score and pick the best per shot.
Auto-clipping. Detect "interesting" moments by some other signal (audio energy, transcript keywords) and snap them to the nearest shot boundary; the clips stop looking like clips and start looking like edits.

The library does the boring part well, so you get to spend the engineering budget on the parts that actually feel like product.

video #python #tutorial #ffmpeg

DEV Community