Sebastian Chedal

Posted on Apr 7 • Originally published at fountaincity.tech

Completion-Triggered Orchestration: Why We Stopped Scheduling Our AI Pipeline

#agents #ai #automation #devops

The Scheduling Problem

Completion-triggered orchestration is an architectural pattern where only the pipeline’s entry point runs on a schedule. Every downstream stage fires automatically when its predecessor completes.

We run a multi-stage autonomous content pipeline on fixed schedules — or we did, until the scheduling layer became the bottleneck. This article is about the scheduling architecture underneath the pipeline, and why we replaced it.

AI stages have variable execution times. LLM inference isn’t predictable the way a database query or file transform is. A research stage might take 8 minutes on Monday and 22 minutes on Tuesday, depending on topic complexity, number of sources, and model load. Writing a draft might take 12 minutes or 40. When every stage has variable duration, fixed scheduling always creates gaps.

This isn’t unique to content pipelines. Any multi-agent workflow where tasks involve LLM inference, image generation, or other AI operations faces the same problem. The execution time is inherently unpredictable, and cron jobs don’t care.

How It Worked Before: 9 Crons and a Lot of Waiting

The original architecture used 9 scheduled cron jobs. Three per downstream stage, each running at fixed intervals with 30- to 60-minute gaps between them. Research at 7 AM, 11 AM, 7 PM. Writing at 8 AM, 12 PM, 8 PM. Self-review at 9 AM, 1 PM, 9 PM. And so on through the remaining stages.

The scheduling looked tidy on paper. In practice, it created three failure modes that compounded each other.

Dead time between stages. If research finished at 7:12 AM, the write cron wouldn’t fire until 8:00 AM. That’s 48 minutes where a completed item sits idle, waiting for its number to be called. Multiply that across 6 stages, and a brief that could publish in 2 to 3 hours was taking 6 to 12.

Window misses. A stage completes at 9:01 AM. The next stage’s cron was at 9:00 AM. That item now waits until the 1:00 PM run, losing 4 hours to a one-minute timing gap. Roughly 30% of pipeline items were missing their window on any given day.

Invisible stalls. When an item got stuck between stages, there was no mechanism to detect it automatically. Someone had to notice the gap in the output, check the logs, and manually trigger the next stage. One to two manual interventions per day became the norm.

None of these problems were bugs. The pipeline stages all worked correctly. The scheduling layer was the bottleneck, and it was a bottleneck by design. Fixed-interval scheduling is built for predictable workloads. AI workloads aren’t predictable.

The Fix: Only Schedule the Entry Point

The architectural change was simple to describe: keep only the research crons on a schedule. Make everything downstream fire on completion.

When a research stage finishes and marks an item as “researched,” it calls a trigger function that immediately fires the write stage. When writing completes and marks the item “drafted,” it triggers self-review. Reviewed triggers dedup check. Dedup-checked triggers art direction. Art-directed triggers the final improve-and-publish stage.

The trigger chain looks like this:

Research completes → triggers Write
Write completes → triggers Self-Review
Self-Review completes → triggers Dedup Check
Dedup Check completes → triggers Art Direction
Art Direction completes → triggers Improve + Publish

Only research still runs on a schedule (three times daily). It’s the entry point, the only stage that needs to poll for new work. Everything else reacts to what actually happened instead of guessing when it might happen.

The implementation lives in a pipeline state manager. When a stage updates an item’s status, it calls a trigger-next function that maps the new status to the correct downstream cron and fires it immediately through the OpenClaw API. The original 9 downstream crons were disabled but not deleted, preserving their configuration as a fallback.

The Pattern, Generalized

Strip away the content pipeline specifics and you get a pattern that applies to any multi-stage AI workflow:

Keep only the entry point scheduled. The first stage in your pipeline, the one that picks up new work, runs on a cron. Everything else is reactive.
Make every downstream stage completion-triggered. When Stage N finishes, it fires Stage N+1. No polling, no waiting for a scheduled slot.
Add a recovery sweep on a longer interval. More on this below, but you need a safety net for stuck items.
Preserve original schedules as disabled fallback. If the trigger mechanism fails, you can re-enable the old crons in minutes.

This works anywhere execution time varies between runs:

RAG pipelines (ingest → chunk → embed → index): embedding time varies with document length and complexity.
Code review chains (lint → test → security scan → deploy): security scans vary enormously depending on codebase size and vulnerability count.
Data processing (extract → transform → validate → load): transform complexity varies by source format and data volume.
Customer support automation (classify → route → respond → audit): response generation time varies by ticket complexity and required tool calls.

Scheduling only makes sense at the boundary where new work enters the system. Inside the system, work should flow based on actual completion, not estimated timing.

Why You Still Need a Recovery Cron

Completion-triggered execution trades one problem for another. Scheduled crons waste time but never lose items. Completion triggers eliminate waste but introduce a new failure mode: if a trigger fails silently or a stage crashes without calling its successor, the item vanishes into a gap with nothing scheduled to pick it up.

The solution is a recovery cron that runs on a longer interval (in our case, every 2 hours). It scans the pipeline state for items whose last activity is more than 2 hours old, checks whether they’re in a triggerable state, and fires the appropriate downstream stage.

The tricky part is doom spiral protection. Without safeguards, the recovery cron can make things worse. An item fails because the art direction model is down. Recovery fires it again. It fails again. Recovery fires it again. Now you have an infinite retry loop burning tokens on an item that will never succeed until the underlying issue is fixed.

Our recovery system handles this with four rules:

Max 3 recovery attempts per item per day. After three tries, the item is flagged for human attention.
Skip non-retryable errors. If a stage failed due to content policy rejection, authentication failure, or model refusal, retrying won’t help. The recovery cron reads the error classification and skips these.
Cooldown period. Skip any item that was already attempted in the last 2 hours. This prevents rapid-fire retries when the underlying issue is transient but hasn’t resolved yet.
Escalation alert. When an item hits max attempts, the system sends a notification to the team. The item doesn’t disappear; it sits in a known state with a clear error trail.

Without doom spiral protection, a recovery system is just an automated way to compound failures. The protection logic is arguably more important than the recovery logic itself.

Trade-offs: What We Gained and What We Lost

Dimension	Scheduled (Before)	Completion-Triggered (After)
End-to-end time	6–12 hours	2–3 hours
Window-miss rate	~30% of items daily	0% (no windows to miss)
Manual interventions	1–2 per day	Rare (recovery catches most)
Predictability	High (items publish at known times)	Lower (items publish “when ready”)
Trigger complexity	None (cron handles everything)	Moderate (stage-to-cron mapping, trigger API)
Failure visibility	Low (stuck items go unnoticed)	High (recovery cron detects and alerts)
Debugging	Check cron run history	Read pipeline activity log

The advantages are clear: no more scheduling waste, self-healing for most failures, a simpler mental model (items flow through the pipeline instead of waiting for time slots), and natural backpressure (if Stage 3 is slow, Stage 4 doesn’t fire pointlessly).

The disadvantages are real too. The trigger logic adds a layer of complexity. The recovery cron becomes a critical dependency, essentially a new single point of failure. Timing becomes less predictable, which matters if downstream consumers expect content at specific hours. And debugging shifts from “check when the cron ran” to “trace the item through the activity log.”

For our use case, the trade-off was unambiguous. But a system that runs once daily on a 3-stage pipeline probably doesn’t need this. The pattern earns its complexity when stages take 15 to 60 minutes and you’re processing multiple items per day.

What We’d Do Differently

Three things we’d change if we were building this from scratch.

A proper dead-letter queue. Items that fail recovery 3 times currently sit in the pipeline with a “max attempts” flag. They’re visible, but they’re not in a distinct state that separates them from items still in progress. A dedicated dead-letter queue would give us a single place to look for items that need human attention, with full error context attached.

Trigger latency instrumentation. We know the trigger fires, and we know the downstream cron starts. We don’t measure the gap between them. If the orchestration platform introduces latency (queuing, rate limiting, cold starts), we’d want to see it in a dashboard rather than discovering it when end-to-end times creep up without explanation.

Circuit breakers per stage. Currently, each item is independent. If the art direction model goes down, each item discovers this separately and fails separately. A circuit breaker that detects “Stage 4 has failed 5 times in the last hour” and pauses all triggers to that stage would prevent wasted cycles and give clearer signal about systemic issues versus item-specific problems.

The Results

End-to-end pipeline time dropped from 6 to 12 hours under the scheduled system to 2 to 3 hours with completion triggers. Items missing their execution window went from roughly 30% per day to zero, because there are no windows to miss. Daily manual interventions dropped from a consistent 1 to 2 to rare occurrences that the recovery cron doesn’t catch.

Throughput on healthy days is 2 to 3 published pieces. On days when a stage has issues, recovery keeps it at 1 to 2 rather than dropping to zero while someone notices the problem.

According to Atlan, research shows that event-driven architectures can reduce AI agent latency by 70 to 90% compared to polling approaches. Our experience lands within that range. The scheduled system wasn’t polling exactly, but the principle is the same: reacting to actual events beats waiting for scheduled check-ins.

Scheduling is a proxy for responsiveness. When you can replace the proxy with the real thing — reacting to actual completions instead of guessing when they might happen — the gains compound at every stage. For teams building managed autonomous AI agents, this distinction between scheduled and completion-triggered execution is one of the first architectural decisions that shapes everything downstream.

Frequently Asked Questions

Can I use this pattern with LangGraph, Airflow, or n8n?

Yes. The pattern is tool-agnostic. Any system with an API to trigger workflow runs works. LangGraph can call downstream graphs on completion. Airflow has TriggerDagRunOperator. n8n supports webhook triggers between workflows. The implementation details change; the architecture doesn’t.

What happens if the recovery cron itself fails?

It’s a single point of failure, and we acknowledge that. Health monitoring on the recovery cron itself is the practical answer. In our setup, a separate system-level check verifies that the recovery cron ran within the last 3 hours. If it didn’t, an alert fires. You’re watching the watcher, but the watcher is simple enough that it rarely fails.

Does this work for pipelines with branching logic?

Yes, but the trigger map gets more complex. Instead of a linear chain (Stage 1 → Stage 2 → Stage 3), you need a router stage that reads the item’s state and decides which branch to trigger. Conditional triggers are a natural extension of the pattern. The recovery cron needs to understand all branches, which is the main added complexity.

Won’t completion triggers fire too fast and overwhelm downstream stages?

Completion-triggered systems have natural backpressure built in. The next stage can’t start until the current one completes. There’s no scenario where 10 triggers fire simultaneously unless 10 items all complete their current stage at once. If a stage fails repeatedly, doom spiral protection (described above) prevents runaway retries. For additional safety, circuit breakers can pause triggers to a failing stage entirely.

How do you debug when something gets stuck?

Every trigger, recovery attempt, and error is logged in a pipeline activity log with timestamps. The debugging workflow is: find the item by ID, read its activity entries in chronological order, and identify where the chain broke. It’s more reading than the old system (where you’d just check “did the cron run?”), but it gives you a complete picture of what happened and why.

Is this overkill for a simple 2 to 3 stage pipeline?

Probably. If your stages complete in minutes and you run once per hour, cron scheduling is perfectly fine. The dead time between stages is small relative to the interval. This pattern starts earning its complexity when stages take 15 to 60 minutes, you process multiple items daily, and the accumulated dead time across stages becomes a significant fraction of your total pipeline time.

DEV Community