Erik anderson

Posted on Apr 5 • Originally published at pas.it.com

My YouTube Automation Uploaded 29 Videos in One Afternoon — Here is What Broke

#python #devops #ai #automation

My YouTube Automation Uploaded 29 Videos in One Afternoon. Here's What Broke.

I run 57 projects autonomously on two servers in my basement. One of them is a YouTube Shorts pipeline that generates, reviews, and uploads videos every day without me touching it.

Yesterday it uploaded 29 videos in a single afternoon. That was not the plan.

Here's the postmortem — what broke, why, and the 5-minute fix that stopped it.

The Architecture

The pipeline works like this:

Cron job fires — triggers a pipeline (market scorecard, daily tip, promo, etc.)
AI generates a script — based on market data, tips, or trending topics
FFmpeg renders the video — text overlays, stock footage, voiceover
Review panel scores it — if it scores above 6/10, it proceeds
Uploader publishes — uploads to YouTube, posts to Twitter, announces on Discord

All of this runs on a NATS JetStream message bus called PrimeBus. Every step publishes an event. Every component listens for events it cares about.

The Retry Handler

Because videos sometimes get rejected by the review panel (score too low), I built a retry handler. It's a separate service that listens for app.youtube.pipeline.rejected events on the bus.

When a video gets rejected, the retry handler re-runs the pipeline with a fresh attempt. Max 5 retries per day. Sounds reasonable.

What Actually Happened

Here's the bug:

if success:
    # Reset counter on success
    _retry_counts[key] = 0

When a retry succeeded and the video uploaded, the counter reset to zero. So the next rejection event — from a completely normal pipeline run — would start the retry cycle all over again. The system forgot it had already succeeded.

But it gets worse.

I also had a PrimeBus rule called youtube-video-missing-alert that listened for *_failed events. Its job was to diagnose failures. But step 5 of its instructions said:

"Re-run the pipeline"

So now I had two independent systems both trying to retry failed videos. The retry handler AND a PrimeBus automation rule. They didn't know about each other. They both re-ran pipelines. Some of those runs succeeded and uploaded. Some failed and triggered more retries.

29 videos. One afternoon.

The Root Cause

The retry handler only listened for failure events. It had no idea when a video successfully uploaded. It was flying blind on the success side.

The success event (app.youtube.upload.complete) was being published by the uploader — but nobody was listening for it in the retry logic.

The Fix (3 Changes, 5 Minutes)

1. Listen for success, not just failure

async def handle_upload_success(msg):
    """When a video uploads, mark that pipeline as done for today."""
    pipeline_type = payload.get("type", "unknown")
    _uploaded_today.add(_retry_key(pipeline_type))
    _retry_counts[key] = MAX_RETRIES + 1  # no more retries

# Subscribe to BOTH events
await js.subscribe("app.youtube.pipeline.*_uploaded", cb=handle_upload_success, ...)
await js.subscribe("app.youtube.upload.complete", cb=handle_upload_success, ...)

2. Guard at the top of every retry

if _already_uploaded_today(pipeline_type):
    log.info("SKIP: %s already uploaded today. No retry needed.", pipeline_type)
    await msg.ack()
    return

This checks both an in-memory set AND the actual log file as a belt-and-suspenders fallback.

3. Stop the competing automation

The PrimeBus youtube-video-missing-alert rule no longer re-runs pipelines. It just diagnoses and sends a Discord alert. One system handles retries. Not two.

The Lesson

Every retry system needs to listen for success, not just failure.

If your healing logic doesn't know when to stop healing, it becomes the problem. Self-healing systems are powerful — but the stop condition is more important than the retry logic.

The retry handler was the last thing I built. It should have been the first thing I tested end-to-end.

The Stack

For anyone curious, here's what's running this:

PrimeBus — NATS JetStream event bus with rule-based automation
Python pipelines — Claude AI for scripts, FFmpeg for rendering, YouTube Data API for uploads
Systemd services — retry handler runs as youtube-retry-handler.service
Cron — 10+ scheduled pipeline types across the week

All running on an Ubuntu server in my basement. No cloud. No team. Just systems.

Build daily. Break things. Fix them fast.

If you're building autonomous systems — I write about the wins, the failures, and everything in between. Follow for more real postmortems from production.

DEV Community