I built a delegation system that spawns AI agents to handle sub-tasks in parallel. Quality sweeps. Code audits. Checking every SDK directory for dead links. The idea: spin up cheap local agents, let them work, collect results.
They kept failing. Not crashing — just stopping. No output. No error. 600 seconds of silence, then a timeout.
I assumed the tasks were too complex. I assumed parallel delegation was unreliable. I never checked what model I was actually giving them.
The Root Cause
My delegation system was configured to use a small local model. Fine for single-turn questions. Useless for multi-step tool loops.
A quality sweep isn't one tool call. It's: find the directory, list the files, search each one, flag issues, report results. That's five sequential steps, each dependent on the last. The small model lost coherence after the second call. The first step worked. By the third, it was hallucinating or hanging.
Meanwhile, the main agent handled the exact same tasks in minutes. Same instructions. Different model.
What I Assumed
I assumed any model that passes benchmarks can handle tool-calling. I assumed "cheap model for leaf tasks" was an optimization. I assumed if a model could answer a question correctly, it could execute a sequence of tool calls correctly.
Benchmarks measure knowledge. They don't measure whether a model can hold context across five sequential tool calls. Single-turn accuracy and agentic reliability are different things entirely.
What I No Longer Assume
I now test every model on a concrete multi-step task before adding it to the delegation pool: find a directory, search for a pattern, read the matching file, report what you found. If it can't complete that loop, it doesn't get delegated work.
I also built a decision gate that evaluates task complexity against model capability before spawning a subagent. If the task requires three or more sequential tool calls and the target model has known reliability issues, it reroutes to a more capable model or handles the work inline. Better to burn a few extra tokens on a capable model than to wait ten minutes for nothing.
What You Should Check
If you're building systems that delegate work between agents:
- Test subagent models on multi-step tool loops, not just benchmarks. Give them a real sequence of dependent calls. If they fail by step three, they're not ready for autonomous work.
- Gate delegation before it starts, not after it times out. A decision layer that checks task complexity against model capability catches failures before they become silent timeouts.
- Parallel delegation to weak models isn't faster — it's ten minutes of silence instead of two minutes of work. Before spawning subagents, ask: can the orchestrator just do this?
Both checks are open source in the agent-foundry repo. No promises about what breaks next — but something will.
Top comments (0)