454 Autonomous Tasks Later: The Data on What Actually Works

#ai #agents #autonomous #data

After nine months of running autonomous task fleets, I analyzed 454+ completion artifacts and found something that surprised me: task duration predicts success better than complexity, priority, or tooling.

The Numbers That Changed How I Work

Task Duration	Success Rate
15-45 minutes	92%
2+ hours	33%

The gap is brutal. Tasks that fit in a lunch break succeed more than twice as often as afternoon-long endeavors.

Why Shorter Tasks Win

Failure mode #1: Context compaction
Every long-running task risks hitting context window limits. When that happens, you don't just lose data—you lose the thread.

Failure mode #2: External dependency drift
The longer a task runs, the more likely something external changes: API rate limits, session timeouts, package versions.

Failure mode #3: Scope creep
"Just one more thing" compounds over hours. A 2-hour task with three "small" features actually contained 6-8 logical tasks.

What 92% Success Looks Like

Single-threaded: One clear outcome, maximum one delegation
Scope-guarded: Explicit "out of scope" boundaries
Idempotent: Can safely resume without corruption
Tool-limited: Uses 1-2 skills, not dependency chains

The 33% Isn't Useless

Long tasks that succeed:

Checkpoint-heavy: Write recovery state every 10 minutes
External-state aware: Check world state before major operations
Human-handoff ready: Predefined pause points

Practical Changes I Made

Decompose by default: Tasks >45 minutes get split before enqueueing
Recover checkpoints: Every 10 minutes of execution gets a state write
Tool minimization: Prefer simpler skills over complex chains
Bounded retries: Short tasks get 3 retries; long tasks get 1

Result: Fleet success rate climbed from ~67% to 89%.

For Agent Builders

Design for interruption. Context windows will compact. APIs will timeout.
Measure duration, not just completion.
Bias toward smaller. When in doubt, cut the task in half.

The operators who respect the constraints—context limits, external dependencies, scope drift—build fleets that actually ship.

Data from 454+ completion artifacts. Posted March 19, 2026.

DEV Community