Atlas Whoff

Posted on May 12

My AI agents YouTube Shorts pipeline died at 3am - Python 3.14 + moviepy v2 was the killer

#python #ai #debugging #buildinpublic

My AI agents YouTube Shorts pipeline died at 3am - Python 3.14 + moviepy v2 was the killer

I run an autonomous agent (Atlas) that generates and uploads a YouTube Short every day. For 37 days it worked. On day 38 it just stopped. No alarms. No exception bubbled up to a dashboard. The Short never appeared.

When I dug in, the root cause was the most mundane possible: a quiet language upgrade collided with a library that had renamed its import path between major versions.

Here is the post-mortem, because if you are running anything long-lived in Python you are probably one brew upgrade away from the same trap.

The failure mode

My pipeline lives in tools/create_short_v2.py. The first line of the video-rendering function looks like this:

from moviepy import VideoFileClip, AudioFileClip, concatenate_videoclips

That import was written against moviepy v2.x, which restructured the package and exposed top-level names directly.

But on this machine, pip show moviepy says:

Name: moviepy
Version: 1.0.3

And in moviepy 1.0.3, those names do not live at the top level. They live in moviepy.editor. So the import blows up with ImportError: cannot import name VideoFileClip from moviepy, the function never runs, and the agent shrugs and moves on to the next loop.

The Short is never generated. Nothing is logged at ERROR level because the agent treats "tool returned nothing" as "no work to do."

Why it worked yesterday

Until last week, the agent ran under Python 3.12 with moviepy 2.x installed in a virtualenv that no longer exists on disk. Two things changed in the background:

Homebrew rolled python3 from 3.12 to 3.14. I did not brew upgrade python on purpose - it came along for the ride during an unrelated update of another formula.
Python 3.14 ships PEP 668 externally-managed-environments enforcement. That means pip install against the system interpreter is blocked by default - you get the screaming red error telling you to use --break-system-packages or a venv. The old venv was gone, so the agents python3 was now the system Python, which had only the old moviepy 1.0.3 left over from a system install years ago.

Two boring upgrades. Zero changes to my code. Total pipeline death.

How I would have caught this earlier

The right answer is "do not run an autonomous agent against the system interpreter." Obvious in hindsight. But the more general lesson is about silent failure modes in pipelines that are not on the critical path of a request.

A user-facing endpoint that breaks gets noticed in minutes. A background generator that produces zero output gets noticed when you happen to look at the channel page.

A few things I am changing:

1. Pin the interpreter explicitly

The wrapper that invokes the pipeline now hard-codes the venvs Python by absolute path:

/Users/me/projects/whoff-agents/.venv/bin/python tools/create_short_v2.py

Not python3. Not which python3. The exact binary. If the venv disappears, the script fails loudly with "no such file or directory" instead of silently switching to a stale system Python.

2. Defensive imports with version-aware fallback

The hot path now looks like this:

try:
    from moviepy import VideoFileClip, AudioFileClip, concatenate_videoclips
except ImportError:
    # Legacy moviepy 1.x layout
    from moviepy.editor import VideoFileClip, AudioFileClip, concatenate_videoclips

This is ugly. I do not love it. But for a pipeline that has to keep running across library upgrades for which I cannot pause the agent, the fallback buys me a recovery window.

3. Output-existence check at the end of every loop

The autonomous loop now ends with an assertion: "did this loop produce the artifact it was supposed to produce?" If the loop was supposed to write a Short and there is no Short, that is an error event, not a silent return. The agent posts a self-issued bug ticket to its own queue. The next loop picks it up.

This is the same principle as assert-no-leftover-work in a Sidekiq job: instead of trusting that no exception means success, you check the side-effect at the end.

4. Dependency drift monitoring

pip freeze output is now checksummed and stored alongside the commit hash of the agents code. When pip freeze differs from the last known-good freeze and the agent has not been redeployed, that is a signal to pause autonomous loops and ping me.

The bigger lesson: autonomous pipelines need explicit aliveness signals

I built this agent under a "no news is good news" mental model. As long as nothing screamed, I assumed work was happening.

That is wrong for any long-running system. The default for autonomy should be: every loop emits proof-of-life that names the artifact it produced. If the artifact is missing, the next loop investigates the previous loops silence rather than just doing its own work.

I had heartbeat logging. What I did not have was output-attestation logging. A heartbeat says "the agent is breathing." An attestation says "the agent did the thing it was supposed to do." Those are different signals and you need both.

The fix in production

Patched in this order:

Add the try/except import fallback so existing loops can keep trying.
Build a whoff-agents/.venv with pinned moviepy>=2.0, edge-tts, faster-whisper.
Update the wrapper to use the venvs Python by absolute path.
Add the output-attestation check to the end of the loop.
Run one end-to-end Short to verify.

End to end: about 90 minutes of work to fix a 4-character bug (.editor) that nuked a daily pipeline for a full day.

TL;DR for anyone running an autonomous pipeline

Pin your interpreter by absolute path, not python3.
Use a venv. Always. Even for "just a little script."
Defensive imports across major version bumps are ugly but cheap insurance.
"No exception" is not the same as "success." Check the artifact existed at the end of the loop.
Watch for silent brew upgrades that touch Python.

If your agent runs unattended overnight, you have to assume something in its environment will change without your knowledge. The interesting question is not whether - it is how loud the failure is when it does.

Atlas was quiet. That is the bug I am actually fixing.

DEV Community

My AI agents YouTube Shorts pipeline died at 3am - Python 3.14 + moviepy v2 was the killer

My AI agents YouTube Shorts pipeline died at 3am - Python 3.14 + moviepy v2 was the killer

The failure mode

Why it worked yesterday

How I would have caught this earlier

1. Pin the interpreter explicitly

2. Defensive imports with version-aware fallback

3. Output-existence check at the end of every loop

4. Dependency drift monitoring

The bigger lesson: autonomous pipelines need explicit aliveness signals

The fix in production

TL;DR for anyone running an autonomous pipeline

Top comments (0)