Ian

Posted on May 20 • Originally published at github.com

I spawned 25 Claude Code subagents in one night. Here's what I learned.

#claudecode #ai #multiagent #productivity

I gave myself $1,000 and 24 hours to ship something live. By hour 8 I had spawned roughly 25 Claude Code subagents in parallel, built 37 Apify Actors, and pushed all of them into Apify's publish pipeline. As of this morning, 5 are LIVE on the Apify Store; the other 31 are sitting in a queue waiting for Apify's 5-actors-per-day publishing quota to drip them out over the next week.

This is the postmortem. Real numbers, real prompts, real failures. No "10x productivity" framing — just what worked and what didn't.

What got built

37 Apify Actors total — each one is a src/main.js, a .actor/input_schema.json, a .actor/dataset_schema.json, an actor.json, a README, and an apify push to the platform.
5 LIVE as of this writing (apify.com/ianymu): llms-txt-converter, claudemd-security-auditor, gh-issue-to-claude-prompts, mcp-server-catalog, claudemd-generator.
31 BUILT but not yet public — published-state set to private, sitting in a daemon's queue, waiting for the quota window.
~25 subagent processes spawned over ~8 hours, mostly running 4-at-a-time in the background.

The Actors themselves aren't the interesting part. The interesting part is how a single human in the loop can keep 25 background processes from drifting into garbage.

The four things that worked

1. Constrained prompts, not vague ones

Every subagent got a prompt of about 200-300 words with explicit, non-negotiable constraints. Here's a redacted skeleton of what I actually sent (this one was for the mcp-server-catalog actor):

You are building one Apify Actor: `mcp-server-catalog`.

Constraints:
- src/main.js uses the apify-client Actor.init() pattern. ESM.
- Input schema: { maxServers: integer 1-100, default 20; keywordFilter: string optional }
- Output to default dataset, one row per server, with this exact shape:
  { fullName: string, stars: int, qualityScore: int (0-100),
    license: string|null, language: string|null, description: string,
    scoreBreakdown: { stars: int, recency: int, license: int,
                      description: int, docs: int, activity: int } }
- Sources to merge: punkpeye/awesome-mcp-servers, modelcontextprotocol/servers,
  wong2/awesome-mcp-servers. Dedupe by fullName.
- README.md must include: 1-line purpose, input example, output example
  with one real row, "Try it" link placeholder.
- Run `apify push` at the end. Do not run `apify call` (costs money).
- Do NOT add tests, CI, TypeScript, eslint, or extra files I didn't ask for.
- Done = you can show me the actor page URL and one sample dataset row.

The constraints I learned to put in writing, one by one, as earlier subagents broke them:

"Do NOT add TypeScript" — one drifted into a tsconfig.json and a half-converted .ts file. Cost 20 minutes to clean up.
"Do NOT run apify call" — one happily burned ~$0.30 of platform credit running its own actor to "verify it works." It did work. That wasn't the point.
"Exact dataset shape" — three actors invented their own keys (name vs fullName, score vs qualityScore). Made the downstream comparison spreadsheet useless until I refactored.

Vague prompts produce vague output, every time. A subagent that's free to interpret will interpret in whatever direction lets it finish faster.

2. `run_in_background: true` was the unlock

The default in the Agent tool is foreground — you wait for the subagent to finish before the next tool call returns. With run_in_background: true, you spawn it, get a process handle back, and immediately spawn the next one. Four actors building in parallel was roughly 4x the throughput of building them one at a time. Eight in parallel was not 8x — I think because the model has finite attention for reviewing returning outputs, and they started arriving faster than I could read them.

The sweet spot in this run was four parallel subagents. Past that, I started missing drift signals.

3. Self-correction when the parent reviewed

A handful of subagents handed back output that didn't match the spec — wrong dataset shape, an extra tests/ directory I'd told them not to create, a package.json with dependencies I hadn't listed. In every case, sending the original prompt back with a one-line addendum ("You wrote X. The spec says Y. Fix it.") got a correct second pass in under a minute. Subagents don't argue.

What does NOT work: trying to debug what they did wrong. Just re-state the spec.

4. The daily quota forced a different design

Apify lets free accounts publish 5 actors per day to the public store. Build throughput was effectively unlimited (I could push 37 private actors in an evening). Publish throughput was hard-capped at 5/day.

The naive flow — "build it, immediately publish it, move on" — broke at actor #6. The platform returned daily-publication-limit-exceeded and the work stalled.

The fix was an auto-publish daemon: a Python loop that reads a queue file, tries to PUT each actor to isPublic: true, recognizes the quota-exceeded error, leaves the actor in the queue, sleeps 10 minutes, and tries again. It runs forever and survives the UTC-midnight quota reset without intervention.

The core of it (real code, slightly trimmed):

QUOTA_MARKERS = (
    "daily-publication-limit-exceeded",
    "daily publication limit",
    "publication-limit",
)

def try_publish(actor_id: str, token: str) -> str:
    actor = get_actor(actor_id, token)
    if actor is None:
        return "error"
    if actor.get("isPublic") is True:
        return "already_public"

    body = {
        "isPublic": True,
        "categories": ["AI", "DEVELOPER_TOOLS"],
        **derive_seo(actor),
    }
    status, payload = put_actor(actor_id, body, token)
    if 200 <= status < 300:
        return "published"

    blob = json.dumps(payload).lower()
    if any(m in blob for m in QUOTA_MARKERS):
        return "quota"
    return "error"

while True:
    queue = read_queue()
    keep = []
    for actor_id in queue:
        result = try_publish(actor_id, token)
        if result not in ("published", "already_public"):
            keep.append(actor_id)
    write_queue(keep)
    time.sleep(600)

That's the whole pattern. Rate-limited API + retry loop + persistent queue file. Nothing clever. But it meant I could close the laptop at 3am and wake up to find the next 5 actors live.

The general lesson: a rate-limited dependency changes your whole design. The build pipeline and the publish pipeline have to be decoupled — they can't share a process, because one of them runs at human-typing speed and the other runs at platform-quota speed.

The two things that didn't work

Subagents drifting from spec

Three out of ~25 subagents went off-script in a way I didn't catch until reviewing the output. The most expensive one decided to add a "Try it locally" section with a Docker setup that didn't exist. It looked plausible. It would have shipped if I hadn't randomly opened that README.

After that I added a step: every subagent's README got grep'd for invented commands, fake URLs, and Docker references before apify push. Two more were caught that way.

The pattern: subagents fabricate when the spec has a gap. Every gap in the prompt is an invitation to hallucinate something reasonable-looking.

The TODO.md file beat my memory, badly

I kept a TODO.md in the actor-factory directory and updated it after every state change. Several times during the 8 hours, the human in the loop (a friend on a Discord call) said something like "you forgot the README for ai-tool-stack-detector" — and I checked the file, and yes, I had forgotten.

The file was right. My working memory across 25 subagents was not.

If I were doing this again, the TODO would be a structured JSON state file written automatically by each subagent on completion, not a markdown file I update by hand. But even the hand-updated markdown beat trying to remember.

The false positive that saved my reputation

One of the actors I built is claudemd-security-auditor — it scans GitHub repos for dangerous patterns in CLAUDE.md files and .claude/hooks/* scripts. I ran it against three repos to dogfood it. It came back with one HIGH-severity finding in disler/claude-code-hooks-mastery — a rm -rf / pattern at line 128 of user_prompt_submit.py.

My first instinct was to file a GitHub issue against the repo. That would have been embarrassing.

Instead I told a subagent: "verify this finding manually before I file the issue." The subagent opened the file, read 10 lines of context, and reported back:

blocked_patterns = [
    # Add any patterns you want to block
    # Example: ('rm -rf /', 'Dangerous command detected'),
]

The rm -rf / was inside a Python comment. It was an example of what TO block, not an actual command. The repo I was about to publicly accuse of having a destructive command is in fact one of the good repos defending against exactly that pattern. (Their sibling pre_tool_use.py actively blocks rm -rf with exit code 2.)

The regex had no awareness of comments, string literals inside blocked_patterns = [...], or markdown fences. So I tightened the heuristic in the next version of the actor: strip leading whitespace, check for # / // / -- prefixes, look at surrounding identifier names (blocked_patterns, BLOCKLIST, denylist), and downgrade or skip when the match is clearly defensive context.

The lesson: confidence is cheap. A model that returns "HIGH severity finding" with 95% certainty will be wrong some percentage of the time, and that percentage matters when the action you take is irreversible (filing a public issue, sending an email, deleting a file). Build a verify step. Make it specific. I wrote the longer version of this story as my second dev.to post — link at the end.

Tiny details that compounded

Hand-curated stickers beat AI-narrator stickers. The factory ran on a public livestream URL. Without sticker_keys on each event, the feed was a wall of text. Three explicit keys per payload (Microsoft Fluent 3D emoji via jsdelivr) made it scannable in 1 second instead of 5:

data = json.dumps({
    "sticker_keys": ["package", "globe-network", "sparkles"],
    "actor_id": actor_id,
    "actor_name": actor_name,
})

The event logger had a 3-part --detail field: what happened, why it mattered, what's next. Without that structure, events read like a log file. With it, they read like a PM update.
Cross-audit before any outreach. I nearly emailed the same person twice across two lists. Every batch send now reads prior sent*.csv files and refuses any address that appeared within the last 7 days. Stupidly simple, prevents stupid damage.
2GB memory beats 4GB on Apify free tier. Default is 4GB. 5 actors at 4GB = 20GB requested; free tier ceiling is 8GB. Workaround: ?memory=2048 on every run URL. Every actor I built runs fine on 2GB.

What I'd give credit to

Anthropic's subagent design is doing more work here than it gets credit for. Specifically:

The Agent tool's description + subagent_type separation lets the parent stay coherent while the subagent burns context on a narrow task.
The run_in_background: true flag is the difference between a pipeline and a sequence.
The fact that subagents don't share parent context by default forced me to write better prompts. If they had inherited everything, the prompts would have been lazier, and the output would have been worse.

This wasn't an "AI did it all" night. It was a "AI did the typing, the human did the framing" night. The 8 hours of focused review and prompt-tightening were necessary. The 25 subagents made the output volume possible. Neither side substitutes for the other.

What's live and where the artifacts are

The 5 Actors currently public on the Apify Store: apify.com/ianymu. The other 31 are dripping out at 5/day as the quota allows. The case studies — actual runs with real dataset IDs and findings — are documented at hook-pack-launch/outreach/actor-case-studies.md in the repo and reproducible from the public actor pages.

DEV Community

I spawned 25 Claude Code subagents in one night. Here's what I learned.

What got built

The four things that worked

1. Constrained prompts, not vague ones

2. `run_in_background: true` was the unlock

3. Self-correction when the parent reviewed

4. The daily quota forced a different design

The two things that didn't work

Subagents drifting from spec

The TODO.md file beat my memory, badly

The false positive that saved my reputation

Tiny details that compounded

What I'd give credit to

What's live and where the artifacts are

More reading

Top comments (0)

What got built

The four things that worked

1. Constrained prompts, not vague ones

2. run_in_background: true was the unlock

3. Self-correction when the parent reviewed

4. The daily quota forced a different design

The two things that didn't work

Subagents drifting from spec

The TODO.md file beat my memory, badly

The false positive that saved my reputation

Tiny details that compounded

What I'd give credit to

What's live and where the artifacts are

More reading

2. `run_in_background: true` was the unlock