Ranjith Kumar

Posted on May 22 • Edited on Jun 1 • Originally published at Medium

From Routines to a Crew: Building a System That Plans Its Own Work & executes it

#ai #productivity #automation #software

The Gap

In Part 1, I built a routine board, a system that runs Claude on a schedule, defined as Markdown files, backed by a Rust engine with cron scheduling, crash recovery, and a self-contained dashboard. It works well for what it does.

But real work isn't a cron job.

Consider this: you need to audit all the places a deprecated API is referenced across a large codebase. That means searching multiple code areas, cross-referencing findings, identifying which references are active vs. dead code, and producing a prioritized cleanup plan. No single Claude session handles this well. The context is too broad, the work needs decomposition, and some pieces depend on others.

That’s the difference between “task execution” and “work management.” Execution is running a prompt. Work management is deciding what to run, in what order, with what context, and what to do when something fails

The Build Sprint

I built the entire system, from nothing to a multi-persona task engine with a dashboard in 3 days of evenings.

Day 1 was intense: core orchestrator (Phase 0.5), a full Rust dashboard with CRUD operations (Phase 0.75), and a round of silent-failure bug fixes (Phase 0.9): all in one session. Day 3: hardening with task creation/editing/deletion from the dashboard, POSIX file locking for concurrency safety, and launchd scheduling (Phase 1), then the big architectural addition: planner/worker decomposition with a Worker Hive visualization (Phase 2).

The stack: Python for the orchestrator (subprocess management, YAML parsing, straightforward scripting), Rust for the dashboard (HTTP server, real-time worker status, the same "single binary" philosophy from the routine engine).

The recursive part: Claude helped build the system that orchestrates Claude. The design document, the orchestrator code, the Rust dashboard: all built with Claude as a pair programmer. I was designing a system for autonomous AI work while doing autonomous AI work. It's turtles all the way down.

The Task Schema and Activity Log

Every task lives in a YAML file with a rich schema:

id: "TASK-001"
name: "Audit all deprecated API references"
description: |
  Search codebase for deprecated API references.
  Check related tasks for migration status.
priority: P0
size: L
status: open
type: investigation
requires_human: review
human_loop_mode: blocking
dependencies: []
sub_tasks: []
activity_log: []

Priority (P0–P3), size (XS–XL), status, dependencies, human intervention config, the usual project management primitives. But the real innovation is the activity log.

Every action on a task gets timestamped with who did it (which persona), what they did, and what they found:

activity_log:
  - ts: "2026-03-02T22:37:29"
    persona: worker
    action: picked_up
    detail: "Selected as highest priority unblocked task"
  - ts: "2026-03-02T22:42:10"
    persona: worker
    action: completed
    detail: "Research complete. Found 27 references across 6 categories."

This is the system's memory and it's surprisingly powerful for how simple it is. When a task fails and retries, the retrying worker sees what was already attempted and tries a different approach. When a planner decomposes a task, it reads the log to understand context. When I look at a task at the end of the day, I can trace exactly what happened, which tools were used, what was found, what failed, without reading pages of raw Claude output.

The action types tell the story: picked_up, progress, planned, failed, retry, blocked, human_requested, human_responded, completed. You can scan a task's log and understand its entire lifecycle in seconds. It's the simplest possible implementation of agent memory, and it carries surprisingly far.

┌──────────┐
│   Open   │
└────┬─────┘
     │
┌────▼─────┐
│ Planning  │◄──────────────────────┐
└────┬─────┘                        │
     │                              │
┌────▼──────┐                       │
│ In Progress├──────────────────────┘
└────┬──────┘   (worker rejects plan → re-plan)
     │
┌────▼─────┐
│ Blocked   │  (needs human input or dependency)
└────┬─────┘
     │
┌────▼─────┐     ┌──────────────┐
│   Done   ├────►│ Spawn Follow-│
└──────────┘     │ up Tasks     │
                 └──────────────┘

Human-in-the-Loop as a Dial

One of the most useful design decisions was making human involvement a dial, not a switch. It's a 2×2 matrix:

requires_human	blocking	non_blocking
none	fully autonomous	fully autonomous
review	pauses before close	closes, sends summary
intervention	pauses at checkpoints	continues with best guess
approval	waits for plan sign-off	—

A P0 investigation should pause for human review — the stakes are too high for full autonomy. A P3 documentation task can run end-to-end without anyone looking at it. A task that needs plan approval waits after the planner proposes sub-tasks, showing them in the dashboard with Approve/Reject buttons.

Different tasks need different autonomy levels, and the system supports that as a per-task configuration rather than a global setting. In practice, I found that most tasks start as requires_human: none (fully autonomous) and I only add friction for high-stakes work. The default is trust, with guardrails where they matter.

The Bugs That Taught Me

The most instructive bugs were all variations on the same theme: silent failure.

Empty output as success. Workers were returning exit code 0 with empty stdout, they'd gotten stuck on a permission prompt and hung until timeout. The orchestrator saw exit code 0 and marked the task as done. Fix: treat empty output as failure. A single if not output: check that routes through the failure handler:

if not output:
    msg = "Worker returned exit code 0 but produced no output"
    return False, msg

Timeout gap in loop mode. The continuous mode spawned workers as background processes and polled for completion, but it wasn't tracking when each worker started. Workers could run forever, accumulating memory and burning API credits. Fix: track spawn_time per worker in the PID file (later enriched to full JSON metadata with persona and start time), check elapsed time each poll cycle, proc.kill() overdue workers.

Lock contention silence. When the orchestrator tried to run but another instance already held the POSIX flock, it would silently exit. No log entry, no notification, nothing. From the outside, it looked like the system stopped working you'd check the schedule and see it should have run, but there's no evidence it even tried. Fix: write a "Skipped:lock held" entry to the run log before exiting.

The meta-lesson: autonomous systems fail silently by default. You have to instrument every exit path, every edge case, every "this shouldn't happen" branch. If a human isn't watching, nobody is, unless you build the observability in.

Phase 2: Planners and Workers

The architecture shift in Phase 2 was routing tasks through different personas based on their size. Tasks sized M, L, or XL go through a "planner" that decomposes them into smaller sub-tasks. XS and S tasks go directly to workers, backward compatible with Phase 1 behavior.

The routing logic is remarkably compact, three checks that determine the entire system's behavior:

def needs_planning(task):
    if task.get("size") not in {"M", "L", "XL"}: return False
    if task.get("parent_task"): return False  # sub-tasks skip planning
    if task.get("sub_tasks"): return False     # already planned
    return True

Not big? Not a parent's child? Not already planned? Send it to the planner. Everything else goes to a worker. That's the entire routing layer.

The planner gets a specialized prompt asking it to output structured JSON with $N dependency references. The orchestrator resolves $1, $2 etc. to actual TASK-NNN IDs when materializing sub-tasks:

id_map = {}  # $N -> actual task ID (1-indexed)
for i, spec in enumerate(sub_tasks_spec):
    new_id = f"TASK-{next_num:03d}"
    id_map[f"${i + 1}"] = new_id
    # Resolve $N dependencies to real IDs
    resolved_deps = [id_map.get(dep, dep) for dep in spec.get("dependencies", [])]

One subtle but important design choice: planners run before workers in the task queue. When the orchestrator selects the next task, it prioritizes planner work over worker work. This unblocks sub-tasks sooner, you don't want a planner waiting behind three workers when its output would spawn three more parallelizable tasks.

When all sub-tasks complete, the parent auto-closes.

                    ┌──────────────────┐
                    │  Parent Task     │
                    │  (Size: L)       │
                    │  "Audit all API  │
                    │   endpoints"     │
                    └────────┬─────────┘
                             │
                       ┌─────▼─────┐
                       │  Planner   │
                       │  (Claude)  │
                       └─────┬─────┘
                             │ JSON plan with $N deps
               ┌─────────────┼─────────────┐
               │             │             │
        ┌──────▼──────┐ ┌───▼────────┐ ┌──▼───────────┐
        │ Sub-task 1  │ │ Sub-task 2 │ │ Sub-task 3   │
        │ (Size: XS)  │ │ (Size: S)  │ │ (Size: S)    │
        │ No deps     │ │ No deps    │ │ Depends: 1,2 │
        └──────┬──────┘ └─────┬──────┘ └──────┬───────┘
               │              │               │
          ┌────▼────┐   ┌─────▼────┐    ┌─────▼────┐
          │ Worker  │   │ Worker   │    │ Worker   │
          │ (Claude)│   │ (Claude) │    │ (Claude) │
          └─────────┘   └──────────┘    └──────────┘
               │              │               │
               └──────────────┼───────────────┘
                              │
                    All done → Parent auto-closes

The Plan Rejection Protocol

Here's where it gets interesting. Workers can say "this plan is bad."

If a sub-task's plan is fundamentally unworkable, missing prerequisites, contradictory requirements, impossible constraints, the worker outputs PLAN_REJECTED: <reason> instead of completing the task. The orchestrator detects this marker, removes all the old sub-tasks, resets the parent task for re-planning, and includes the rejection reason in the next planner prompt.

┌──────────┐     plan      ┌──────────┐    sub-tasks    ┌──────────┐
│  Parent   │─────────────►│ Planner  │────────────────►│ Workers  │
│  Task     │              │          │                 │          │
└──────────┘              └──────────┘                 └────┬─────┘
     ▲                         ▲                            │
     │                         │     PLAN_REJECTED:         │
     │    reset parent,        │     "missing prereq X"     │
     │    include rejection    └────────────────────────────┘
     │    context
     │
     │   After 3 iterations:
     └── escalate to human intervention

max_plan_iterations defaults to 3. After three failed plan-reject cycles, the system escalates to human intervention, it sets requires_human: intervention and writes a notification explaining what happened.

This isn't error handling. It's a feedback loop between two AI personas. The planner proposes, the worker evaluates, and if the proposal doesn't survive contact with reality, the system iterates. With context. The rejection reason is fed back to the planner, so each iteration is informed by what went wrong before.

Real Results

The system's first real test was a codebase audit, finding all references to a deprecated API across a large repository, checking related tasks for migration status, and producing a prioritized cleanup plan. This is exactly the kind of task that's painful to do manually: boring, sprawling, requires checking dozens of files and cross-referencing with issue trackers.

The planner decomposed it into 8 sub-tasks, each focused on a different code area or a different type of investigation (search this directory, check that task tracker, analyze this migration path). Workers ran independently, some completing in minutes (quick grep-style searches), others taking longer for deeper analysis.

Here's what the workers found, collectively:

27 distinct references across 6 categories, ranked into 3 priority tiers
3 out-of-scope items correctly identified as false positives — things that looked like matches but were actually unrelated (different product's API, different naming convention, already fully migrated). That's human time saved: instead of chasing false positives, I got a pre-filtered list
3 targets confirmed already removed : one sub-task discovered its target had been cleaned up in a previous effort over a year ago. The worker correctly reported "no changes needed" and moved on
One sub-task found that a supposedly abandoned migration task had actually been stalled since 2022, useful context for prioritization

Final output: a structured action plan with 5 independent code changes, prioritized by risk and effort, with pre-checks and test plans for each. All the changes were small (XS or S sized) and could be submitted in parallel.

What would have taken a full day of manual investigation, opening files, cross-referencing tasks, checking git history, reading old code reviews, etc was done by 8 coordinated AI workers. And because each sub-task produced a standalone output file, I could review them individually, at my own pace, in whatever order made sense.

What I Learned

The evolution tells a story: cron job → scheduler → orchestrator → planner/worker system. Each step was driven by a real limitation of the previous one, not by architectural ambition.

V0 : Serial                 One worker, one task at a time
╔═══╗     ╔══════╗     ╔══════╗
║ W ║────►║ Task ║────►║ Task ║────► ...
╚═══╝     ╚══════╝     ╚══════╝


V1 : Worker Pool             Independent tasks, concurrent workers
╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗
║ W ║ ║ W ║ ║ W ║ ║ W ║      worker queue
╚═╤═╝ ╚═╤═╝ ╚═╤═╝ ╚═╤═╝
  │     │     │     │
  ▼     ▼     ▼     ▼
┌───┐ ┌───┐ ┌───┐ ┌───┐
│ T │ │ T │ │ T │ │ T │      task pool
└───┘ └───┘ └───┘ └───┘


V2 : Planners + Workers      Decomposition before execution
╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗
║ W ║ ║ W ║ ║ W ║ ║ W ║ ║ W ║ ║ W ║  workers
╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝
╔═══╗ ╔═══╗ ╔═══╗
║ P ║ ║ P ║ ║ P ║                     planners
╚═══╝ ╚═══╝ ╚═══╝


V3 : Team (future)           Specialized personas, handoff protocol
╔═══╗ ╔═══╗
║ P ║ ║ P ║                           planners
╚═══╝ ╚═══╝
╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗ ╔═══╗
║ W ║ ║ W ║ ║ W ║ ║ W ║ ║ W ║ ║ W ║  workers
╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝ ╚═══╝
╔════╗ ╔═══╗
║ PM ║ ║ TL║                          creators
╚════╝ ╚═══╝
       ▲ handoff protocol ▲

Four patterns emerged that I think are generalizable:

Activity log as memory. Context survives across retries and sessions because every action is recorded. A retry doesn't start from zero, it starts from "here's what was tried and why it failed." This is the simplest possible implementation of agent memory, and it's surprisingly effective.

Personas as routing logic. "Planner" and "worker" aren't separate systems or separate models. They're the same Claude CLI called with different prompts. The persona distinction is a function call needs_planning() returns True, you use the planner prompt template. Returns False, you use the worker template. That's it. No framework, no agent registry, no complex orchestration layer.

Human-in-the-loop as a dial. The 2×2 matrix of requires_human × blocking/non_blocking lets each task declare its own autonomy level. This is more useful in practice than a global "autonomous mode" toggle.

Plan rejection as protocol. Not error handling, a first-class feedback mechanism. PLAN_REJECTED: is part of the prompt contract. Both sides know the rules. The system iterates with context rather than retrying blindly.

The broader ecosystem is exploring similar ideas. Tools and frameworks for autonomous AI workflows are emerging rapidly. There's no canonical architecture for this yet and that's what makes it an exciting space to build in. We're all figuring this out in real time.

What's Next

Phase 3 is where it gets ambitious. There were a few paths i considered for this. Either going deep on exactly defining the personas, for example: TL and PM personas that create new tasks etc. The other option was to focus on the shared ecosystem instead of the individual personas.
There will always be more & more creative agents with different capabilities continue to show up, so rather than focusing on a single agent with different flavor, i found it to be both very interesting & challenging to rather focus on the shared ecosystem they’d operate under. A space where devs and their agents could co-work in a productive way.

It stops being a tool you use and becomes a team you manage.

Closing

From a bash cron job to a multi-persona planning system in about 3 days of evenings. That's not a testament to my engineering speed, rather it's a testament to what becomes possible when you have an AI pair programmer that can help build the infrastructure for its own autonomy.

Honest assessment: it's still experimental. The failure rate is real. Silent failures lurk in every corner, and you have to instrument your way to reliability. But the ceiling is visible.

The pattern of "structured task → AI decomposition → parallel execution → human review" works, and it works better than doing everything interactively.

If you have Claude Code or any AI with a CLI, try building something autonomous. Start with a single routine: a cron job that generates your standup. See where it takes you. You might end up with a planning system that argues with itself about how to approach your work.

And that's a surprisingly useful thing to have.

This is Part 2 of a multi-part series. Part 1: "The Routine Board" covers the routine engine that started it all.

DEV Community