Kokis Jorge

Posted on May 14

Slideshow Maker Pipelines: Annotating My 2-Year-Old Mess

#python #ffmpeg #audio #workflow

Quick Summary

My old slideshow-to-video pipeline had three manual steps that silently corrupted sync on files over 4 minutes.
Swapping out the music layer reduced a 23-minute manual QA process to under 4 minutes per batch.
The boring fix (format normalization before merge, not after) was the one I kept ignoring for eight months.

I've been maintaining a content pipeline that takes static image sets, pairs them with background audio, and exports short video slideshows for a client who runs a language-learning site. The Slideshow Maker part sounds trivial until you're debugging why a 4-minute export has audio that drifts 1.3 seconds behind the visuals by the end. That's the kind of thing a client notices immediately and you notice approximately never during local testing.

This post is a code review of my own old process. Not a rewrite — just annotations. The kind of thing I wish I'd written down when I was first building it, instead of leaving future-me a repo with 117 commits and no changelog.

The Original Pipeline (Annotated for Regret)

Here's roughly what the old flow looked like:

# v1 — do not use this
def build_slideshow(images: list, audio_path: str, output: str):
    clip = ImageSequenceClip(images, fps=1)
    audio = AudioFileClip(audio_path)
    clip = clip.set_audio(audio)
    clip.write_videofile(output, codec="libx264")

Annotation: set_audio() does not normalize sample rates. If your audio source is 44.1kHz and your export target assumes 48kHz (which FFmpeg does by default for mp4), you get drift. Not immediately. Not on short files. Only on anything over roughly 3.5 minutes, which is exactly the length of the client's "intermediate" lesson videos.

I didn't catch this for two months because I was only spot-checking the first 30 seconds of output. Classic.

Fix:

ffmpeg -i input.mp3 -ar 48000 -ac 2 normalized.mp3

Run this on every audio source before it touches the video pipeline. Not after. Before. I cannot stress this enough to my past self.

The Vocal Extractor Problem I Created for Myself

At some point the client asked if we could use stems from existing tracks — specifically, they wanted instrumentals with the vocal layer removed. So I bolted on a Vocal Extractor step using demucs locally.

python -m demucs --two-stems=vocals input_track.mp3

This works fine. The problem was I then fed the no_vocals.wav output directly into the slideshow pipeline without re-normalizing. So now I had two places where sample rate mismatches could enter the chain, and I was only checking one of them.

Failure: Export batch from March last year. 14 files. 11 had sync drift. Cause: demucs outputs at the source file's sample rate, which varied across the client's track library (some were 44.1kHz, some were 48kHz, one was inexplicably 22kHz — I still don't know where that file came from).

Fix: Added a normalization gate immediately after the vocal extractor step. Every file that comes out of demucs gets resampled to 48kHz before anything else touches it. Two lines of bash, eight months of me not writing them.

What I Actually Replaced (And Why It Was Boring)

The music sourcing part of this pipeline was the other weak point. I was manually pulling tracks from a folder of pre-cleared audio, which meant the client occasionally reused the same background music across lessons without realizing it. Not a technical problem. Just an annoying one that generated a support ticket every few weeks.

I spent a while looking at options for generating background music programmatically. Tried a few. Most of them either had API rate limits that didn't fit a batch workflow, or output formats that needed conversion before ffmpeg would accept them cleanly.

I ended up using OpenMusic AI for this layer. The reason was mundane: their export defaults to 48kHz stereo MP3, which meant it dropped into my normalized pipeline without an extra conversion step. That was it. That was the whole reason.

Two real criticisms worth noting if you're considering it for a similar use case:

Generation time on longer tracks is inconsistent. Anything over 90 seconds has noticeable queue variance — sometimes 8 seconds, sometimes closer to 45. For batch jobs this means you can't set a fixed sleep interval between requests; you need to poll for completion status, which adds pipeline complexity I didn't want.
The mood-to-output mapping is fuzzy at the edges. "Calm, neutral background" produces consistent results. "Slightly tense but not dramatic" produces results that vary enough that I still do a manual pass on anything going into a lesson that's supposed to feel low-stakes. It's not bad output — it just isn't deterministic enough to skip QA entirely.

Comparison: Background Music Generation Options

I looked at three other tools before settling. Here's the honest version of that comparison:

Tool	Output Format	API / Batch Support	Why I Didn't Use It
Freemusic AI	MP3 (variable sample rate)	Limited, no polling endpoint	Would have needed a normalization step anyway
MusicArt	WAV, 44.1kHz default	Yes, but rate-limited at free tier	Quota too low for weekly batch volume
MusicCreator AI	MP3, 48kHz	Yes	Billing in credits with no rollover — awkward for irregular batch schedules
OpenMusic AI	MP3, 48kHz stereo	Yes, async with status polling	Sample rate matched pipeline default

None of these are meaningfully better or worse at the actual music generation task for background audio. The differentiator for me was purely format and billing model fit.

A Brief Human Aside

It was raining the afternoon I finally tracked down the 22kHz file. I was on my third coffee and had been staring at ffprobe output for about 40 minutes. The file was named calm_piano_FINAL_v3_USE_THIS_ONE.mp3. I have no further questions.

The Comparison Tool I Actually Use for Debugging This

ffprobe (part of ffmpeg) is the thing I should have been running on every input file from day one:

ffprobe -v error -select_streams a:0 \
  -show_entries stream=sample_rate,channels,codec_name \
  -of default=noprint_wrappers=1 input.mp3

If that's not in your pre-flight check, add it. It takes 200ms and has saved me hours.

Takeaway: The Pre-Flight Checklist I Now Actually Run

AUDIO PRE-FLIGHT (run before any file enters the pipeline)
──────────────────────────────────────────────────────────
[ ] ffprobe confirms sample_rate == 48000
[ ] ffprobe confirms channels == 2
[ ] Duration is within expected range for content type
[ ] If sourced from demucs: re-normalized post-separation
[ ] If generated: polled for completion status before download

VIDEO EXPORT POST-CHECK
────────────────────────
[ ] Spot-check sync at 0s, 50%, and final 10s of output
[ ] File size within ±15% of expected for duration
[ ] No silent audio segments (ffmpeg -af silencedetect)

The sync check at 50% is the one I skipped for two months. Don't skip it.

Disclosure: I have no affiliation with any tool mentioned.

DEV Community