๐ฆ Code: github.com/USER/shot-detection-worker (replace before publishing)
TL;DR
We are going to build an upload worker that runs shot detection on a video, writes the boundary list to Postgres, and produces a 6-frame storyboard PNG for use as a hover-preview sprite. Stack is PySceneDetect 0.7 (released earlier this month), FFmpeg 8.1.1, Python 3.12, and a Redis-backed queue. About 120 lines of code.
Shot detection is one of those building blocks that quietly powers a lot of features people associate with "video AI": chapter generation, smart thumbnails, hover-preview sprites, the first step of any auto-clipping pipeline. The open-source path covers more ground than people expect, and PySceneDetect 0.7 (which dropped on 2026-05-03) is the version I would build on today.
Let's wire it up end to end.
๐ ๏ธ 1. Setup
# bash
mkdir shot-worker && cd shot-worker
python3.12 -m venv .venv && source .venv/bin/activate
pip install scenedetect==0.7 opencv-python-headless==4.10.0.84 \
rq==1.16 psycopg[binary]==3.2.0 boto3==1.34.0
We also need FFmpeg on the host:
# bash
$ ffmpeg -version | head -1
ffmpeg version 8.1.1 ...
A throwaway DB for the boundaries table:
-- schema.sql
CREATE TABLE shots (
asset_id TEXT NOT NULL,
shot_index INT NOT NULL,
start_s NUMERIC(10, 3) NOT NULL,
end_s NUMERIC(10, 3) NOT NULL,
metric NUMERIC(10, 3),
PRIMARY KEY (asset_id, shot_index)
);
CREATE INDEX shots_asset_idx ON shots (asset_id);
# bash
psql -d shotworker -f schema.sql
๐ฌ 2. The detector picker
PySceneDetect 0.7 ships five detectors, and the right one depends on the content. A small router up front saves a lot of pain later.
# worker/detectors.py
from scenedetect import AdaptiveDetector, ContentDetector, ThresholdDetector
from scenedetect.detectors import HistogramDetector, HashDetector
def pick_detector(content_class: str):
"""Pick the PySceneDetect detector that suits the content class.
content_class comes from the upload metadata: 'talking_head', 'sports',
'animation', 'screen_recording', etc. Tune thresholds against your own
samples; the defaults here are starting points, not ground truth.
"""
match content_class:
case "sports" | "action":
# Rolling-average baseline handles fast camera motion better.
return AdaptiveDetector(adaptive_threshold=3.0, min_scene_len=15)
case "animation":
# Histogram delta works better than HSV deltas on animated content.
return HistogramDetector(threshold=0.05, min_scene_len=15)
case "screen_recording":
# Perceptual hashing skips long static stretches.
return HashDetector(threshold=0.395, min_scene_len=30)
case "fade_heavy":
# Threshold-based detection catches fade-to-black act breaks.
return ThresholdDetector(threshold=12.0, min_scene_len=15)
case _:
# Default: HSV content delta. Works on most diverse content.
return ContentDetector(threshold=27.0, min_scene_len=15)
๐ก Tip: keep a small set of labeled test clips (one per content class) and re-run the detector on them whenever you touch thresholds. The "did my change regress" question gets cheaper to answer fast.
๐ 3. The detection function
PySceneDetect's high-level API is small enough that the whole detection step is a dozen lines:
# worker/detect.py
from pathlib import Path
from scenedetect import open_video, SceneManager
from worker.detectors import pick_detector
def detect_shots(video_path: Path, content_class: str = "default"):
"""Run shot detection on a video file.
Returns a list of (start_seconds, end_seconds, metric_score) tuples.
"""
video = open_video(str(video_path))
scene_manager = SceneManager()
detector = pick_detector(content_class)
scene_manager.add_detector(detector)
scene_manager.detect_scenes(video=video, show_progress=False)
shots = []
for i, (start, end) in enumerate(scene_manager.get_scene_list()):
shots.append((
start.get_seconds(),
end.get_seconds(),
None, # metric per scene not exposed directly in 0.7
))
return shots
You can also grab the per-frame metric data by adding a StatsManager and writing it to CSV, which is invaluable when threshold tuning:
# worker/detect.py (extended version with stats)
from scenedetect import StatsManager
def detect_shots_with_stats(video_path, content_class, stats_csv):
video = open_video(str(video_path))
stats = StatsManager()
scene_manager = SceneManager(stats_manager=stats)
scene_manager.add_detector(pick_detector(content_class))
scene_manager.detect_scenes(video=video, show_progress=False)
stats.save_to_csv(stats_csv)
return [(s.get_seconds(), e.get_seconds(), None)
for s, e in scene_manager.get_scene_list()]
๐ผ๏ธ 4. The storyboard
For a hover-preview sprite, we want one frame per shot. FFmpeg does the heavy lifting; PySceneDetect tells it where to look:
# worker/storyboard.py
import subprocess
from pathlib import Path
def extract_keyframes(video_path: Path, shots, out_dir: Path, max_frames: int = 6):
"""Extract one frame per shot, capped at max_frames total."""
out_dir.mkdir(parents=True, exist_ok=True)
# If there are more shots than slots, sample evenly.
if len(shots) > max_frames:
step = len(shots) / max_frames
sampled = [shots[int(i * step)] for i in range(max_frames)]
else:
sampled = shots
paths = []
for i, (start_s, end_s, _) in enumerate(sampled):
midpoint = start_s + (end_s - start_s) / 2
out_path = out_dir / f"frame_{i:02d}.jpg"
subprocess.run(
[
"ffmpeg", "-ss", f"{midpoint:.3f}", "-i", str(video_path),
"-frames:v", "1", "-q:v", "3", "-vf", "scale=320:-2",
"-y", str(out_path),
],
check=True,
stderr=subprocess.DEVNULL,
)
paths.append(out_path)
# Stitch into a 1x6 sprite.
sprite_path = out_dir / "sprite.png"
subprocess.run(
["ffmpeg", "-i", str(out_dir / "frame_%02d.jpg"),
"-vf", f"tile={max_frames}x1", "-y", str(sprite_path)],
check=True,
stderr=subprocess.DEVNULL,
)
return sprite_path
-ss before -i is intentional: it lets FFmpeg seek before decoding, which makes the extract cheap on long videos. Putting -ss after -i reads from the start of the file every time, which is fine on a 30-second clip and miserable on a 90-minute one.
๐ 5. The worker
We tie everything together with an RQ job. Whichever queue you use, the shape is the same: pull asset, detect, write boundaries, produce storyboard, mark asset ready.
# worker/job.py
import json
import logging
from pathlib import Path
import tempfile
import boto3
import psycopg
from worker.detect import detect_shots_with_stats
from worker.storyboard import extract_keyframes
logger = logging.getLogger(__name__)
s3 = boto3.client("s3")
def process_upload(asset_id: str, bucket: str, key: str,
content_class: str, dsn: str):
"""End-to-end shot detection + storyboard for a single uploaded asset."""
with tempfile.TemporaryDirectory() as tmp:
local = Path(tmp) / "input.mp4"
s3.download_file(bucket, key, str(local))
logger.info("downloaded %s/%s -> %s", bucket, key, local)
stats_csv = Path(tmp) / "stats.csv"
shots = detect_shots_with_stats(local, content_class, stats_csv)
logger.info("detected %d shots in %s", len(shots), asset_id)
# Persist boundaries
with psycopg.connect(dsn) as conn:
with conn.cursor() as cur:
cur.execute(
"DELETE FROM shots WHERE asset_id = %s", (asset_id,),
)
cur.executemany(
"INSERT INTO shots (asset_id, shot_index, start_s, end_s, metric) "
"VALUES (%s, %s, %s, %s, %s)",
[(asset_id, i, s, e, m) for i, (s, e, m) in enumerate(shots)],
)
# Storyboard
out_dir = Path(tmp) / "storyboard"
sprite = extract_keyframes(local, shots, out_dir, max_frames=6)
s3.upload_file(str(sprite), bucket, f"storyboards/{asset_id}.png")
s3.upload_file(str(stats_csv), bucket, f"shot-stats/{asset_id}.csv")
return {
"asset_id": asset_id,
"shots": len(shots),
"sprite_key": f"storyboards/{asset_id}.png",
}
Hooking it up to RQ:
# worker/run.py
from rq import Queue, Worker
from redis import Redis
import os
if __name__ == "__main__":
redis = Redis.from_url(os.environ["REDIS_URL"])
q = Queue("uploads", connection=redis)
with Worker([q], connection=redis) as w:
w.work()
Enqueue an upload job from your upload handler:
# in your upload handler
from rq import Queue
from redis import Redis
q = Queue("uploads", connection=Redis.from_url(REDIS_URL))
q.enqueue(
"worker.job.process_upload",
asset_id="upload_abc123",
bucket="my-uploads",
key="raw/abc123.mp4",
content_class="talking_head",
dsn=os.environ["DB_DSN"],
job_timeout=600,
)
โก 6. Throughput and gotchas
A few notes from real deployments:
- PySceneDetect is CPU-bound and decode-heavy. The detector itself is fast; the bottleneck is OpenCV decoding the video. On commodity CPU (4 vCPU, 8 GB RAM), 1080p talking-head content processes faster than real time. Mileage varies on dense 4K content.
-
Use
opencv-python-headlesson servers. The fullopencv-pythonpackage pulls in GTK/Qt and breaks in containerized environments. -
Sample frames if you do not need pixel-precise boundaries. PySceneDetect has a
downscale_factorargument that subsamples the input. For chapter-generation use cases, a 2x downscale halves processing time and changes the boundary list by a frame or two at most. - The CSV stats file is gold. Save it. The day a content class regresses and the PM asks "why did chapters drop", that file is the answer.
# faster detection at the cost of frame-level precision
video = open_video(str(video_path), backend="opencv")
scene_manager = SceneManager()
scene_manager.auto_downscale = True
scene_manager.detect_scenes(video=video, frame_skip=2)
โ ๏ธ Note:
frame_skipimproves throughput but means the detector misses very short shots (a 4-frame quick cut atframe_skip=2may not register). Tune to your content.
๐งช 7. Verifying the output
A small script that opens the storyboard and prints the boundary list:
# bash
$ python -m worker.cli verify upload_abc123
asset upload_abc123
shots: 14
total: 73.21s
storyboard: s3://my-uploads/storyboards/upload_abc123.png
sample boundaries:
[00:00:00.000 - 00:00:04.120] (shot 0)
[00:00:04.120 - 00:00:09.480] (shot 1)
[00:00:09.480 - 00:00:13.000] (shot 2)
...
Open the sprite. One frame per shot, six total, evenly distributed.
What's next
A few directions to take this once the baseline works:
- Chapter generation. Group shots that are at least 30 seconds long; the rest is a rules engine on top of the boundary list.
- Smart thumbnails. Pipe the keyframes through a sharpness + face score and pick the best per shot.
- Auto-clipping. Detect "interesting" moments by some other signal (audio energy, transcript keywords) and snap them to the nearest shot boundary; the clips stop looking like clips and start looking like edits.
The library does the boring part well, so you get to spend the engineering budget on the parts that actually feel like product.
Top comments (0)