Alessio Masucci

Posted on Mar 13

I Ported OpenAI's Symphony to Claude Code: A Complete Build Diary

#ai #typescript #automation #programming

A few weeks ago, OpenAI open-sourced Symphony — an Elixir-based orchestrator that polls a project board, claims tickets, spins up Codex agents in isolated workspaces, manages multi-turn sessions, and handles retries. It's the missing piece between "AI can write code" and "AI can work through a backlog."

I watched the demo and immediately thought: I want this, but for Claude Code.

This is the build diary of that port. Every phase, every bug, every design decision — and the one architectural flaw that made me question how AI agents should handle failure.

Credit first: this isn't my idea

Let me be clear upfront: the architecture is OpenAI's. Symphony is their project, their design, their Elixir implementation. The polling loop, workspace isolation, multi-turn agent sessions, exponential backoff — all of that comes from their work.

What I built is a port. Same concept, different stack:

	OpenAI Symphony	My Port
Language	Elixir	TypeScript
Agent	Codex	Claude Code CLI
Tracker	Linear	Linear
Runtime	BEAM/OTP	Node.js

I ported it because I use Claude Code, not Codex. And I wanted to understand every layer deeply enough to extend it for my own workflow.

I had a side project — Rentello — with a Linear backlog growing faster than I could work through it. Well-scoped tickets, clear acceptance criteria. A capable AI agent could handle most of them. I just needed the orchestration layer.

So I started building.

Phase 1: One file to rule them all

The first design decision was putting everything in a single WORKFLOW.md file per project.

---
tracker:
  kind: linear
  project_slug: "symphony-claude-12ab34cd56fg78"
  active_states: [Todo, In Progress, Merging, Rework]
workspace:
  root: ~/symphony-workspaces
hooks:
  after_create: |
    git clone --depth 1 https://github.com/mscalessio/symphony-claude .
agent:
  max_concurrent_agents: 10
  max_turns: 20
codex:
  command: claude
  approval_policy: bypassPermissions
---

You are working on Linear ticket `{{ issue.identifier }}`...

YAML front-matter defines the config. Everything below the separator is a Liquid template that becomes the agent's prompt. One file controls everything: which project to watch, which states trigger work, how to set up workspaces, and the full behavioral instructions.

I used Zod for schema validation — type safety at both compile time and runtime. If the YAML is malformed, the system says exactly what's wrong before anything starts. The config is hot-reloadable: a file watcher detects changes to WORKFLOW.md, re-parses, re-validates, and swaps the live config without restarting.

Phase 2: Talking to Linear — GraphQL lessons

Connecting to Linear's GraphQL API seemed straightforward. I needed three queries:

Candidate issues — all tickets matching the project and active states
State refresh — current state for running workers' tickets
Terminal cleanup — issues in terminal states that need workspace removal

The candidate query worked immediately. Then I tried to filter blocker relations:

inverseRelations(type: "blocks") {
  nodes {
    issue { id identifier state { name } }
  }
}

HTTP 400. GRAPHQL_VALIDATION_FAILED.

Linear's inverseRelations field doesn't accept a type argument. Unlike GitHub's API, there's no server-side relation filtering. Every poll tick was failing silently — no candidates fetched, no agents dispatched. The system looked idle, but it was actually crashing every 5 seconds.

The fix: Fetch all relations, filter client-side in the normalization layer. Simple, but it cost me an afternoon of staring at "0 candidates" wondering why.

Lesson: GraphQL APIs are snowflakes. What works on one platform won't work on another. Read the schema, don't assume.

Phase 3: Building the brain — the orchestration engine

The orchestrator is a state machine running on a timer. Every tick runs three phases:

Reconciliation

Check every running agent for stalls (no activity in 5+ minutes)
Refresh ticket states from Linear — if a human moved a ticket to "Done" while an agent was working, kill the agent

Validation

Sanity-check the config (project slug not empty, API key set, etc.)

Dispatch

Find eligible tickets: not already running, in an active state, has a concurrency slot, no unresolved blockers
Sort by priority, then creation date
Spawn workers

Each agent is tracked as a RunningEntry with:

An AbortController for clean cancellation
Token counters (input + output)
A stall timer (last activity timestamp)
The current turn number and session ID
The most recent stream event (for TUI display)

The retry system was the trickiest part. I needed two modes:

Normal retries — the agent completed a turn but the ticket is still in an active state. Quick 1-second delay, re-dispatch.
Abnormal retries — crashes, timeouts, API errors. Exponential backoff: 10s, 20s, 40s, up to a configurable maximum (default 5 minutes).

This prevents thundering herd behavior when Linear has a transient outage.

Phase 4: The subprocess bet

The biggest architectural decision was how to run Claude. Two options:

Option A: Claude SDK. Make API calls directly. Manage conversation state, tool execution, and permission enforcement in my code.

Option B: Claude CLI. Spawn claude -p --output-format stream-json as a child process. Parse the JSON event stream from stdout.

I chose the CLI. Here's the reasoning:

Concern	SDK	CLI
Tool execution	I implement it	CLI handles it
Session state	I manage it	`--resume` handles it
Permissions	I enforce them	CLI enforces them
Observability	I build it	Stream events give it to me
Coupling	Tight to SDK version	Loose to CLI interface

The CLI does the hard work. The orchestrator just observes and reacts. The multi-turn loop is elegant:

while (true) {
  const result = await spawnClaudeTurn({
    prompt,
    cwd: workspacePath,
    sessionId: sessionId ?? undefined, // resume if we have one
    mcpConfigPath,
    // ...
  });

  if (!result.success) break;

  // Check if ticket is still active
  const state = await tracker.fetchIssueStatesByIds([issue.id]);
  if (!activeStates.includes(state)) break;

  if (turnNumber >= maxTurns) break;
}

The downside — dependency on CLI argument parsing — would bite me later. Hard.

Phase 5: The MCP server trick

Each agent needs to interact with Linear: read ticket details, post workpad comments, update states. But I didn't want API keys in the prompt.

I built a tiny MCP (Model Context Protocol) server — a standalone Node.js process exposing a single linear_graphql tool. For each agent session, the orchestrator writes a per-workspace MCP config:

{
  "mcpServers": {
    "symphony-linear": {
      "command": "node",
      "args": ["/path/to/linear-graphql-server.js"],
      "env": {
        "LINEAR_API_KEY": "lin_api_...",
        "LINEAR_ENDPOINT": "https://api.linear.app/graphql"
      }
    }
  }
}

The agent calls linear_graphql like any other tool. It doesn't know about authentication. It doesn't know about the orchestrator. Complete isolation.

This turned out to be the cleanest design decision in the project. Each workspace is a self-contained unit: its own directory, its own git clone, its own MCP config, its own credentials.

Phase 6: The terminal dashboard

I wanted the same visual experience as the original Elixir Symphony — a live terminal dashboard showing running agents, token throughput, and retry queues.

I built it with raw ANSI escape codes. No blessed, no ink, no external library.

 SYMPHONY  3 agents | 1,247 tok/s | 0:14:32 | 847,291 tokens
 ─────────────────────────────────────────────────────────
 ID      STATE        PID    AGE    TURN  TOKENS  EVENT
 MAS-5   In Progress  4821   2:31   1     12,847  Writing tests...
 MAS-8   In Progress  4935   1:07   1      5,221  Reading file...
 MAS-12  Todo         5012   0:03   1        423  Planning...

 BACKOFF QUEUE
 MAS-3   retry in 18s (attempt 2) — Linear API timeout

Alternate screen buffer (like vim), hidden cursor, 1-second refresh, SIGWINCH resize handling. When the TUI is active, Pino logs redirect to a file so they don't corrupt the display.

The renderer is a pure function — state in, ANSI string out. Easy to test: 24 tests covering column formatting, truncation, empty states, and terminal width edge cases.

Phase 7: The cascade — three bugs hiding behind each other

Everything was built. 144 unit tests passing. TypeScript compiling clean. Time for the real test: a live ticket against a real Linear board.

I started the orchestrator. It claimed MAS-5. And crashed in 200 milliseconds.

Bug 1: ENAMETOOLONG

Error: Invalid MCP configuration:
Failed to read file: Error: ENAMETOOLONG: name too long, open

The system was trying to open() a 3,000-character string as a filename. But what string?

I traced the shell command being constructed:

claude -p --output-format stream-json --mcp-config /path/config.json "You are working on..."

That last argument — the rendered prompt template — was a positional arg. With --mcp-config present, Claude CLI changes how positional arguments are parsed. The prompt was being treated as a file path.

Fix: Always pipe the prompt through stdin. Never pass it as a positional argument.

Bug 2: The silent flag requirement

Fixed Bug 1. Rebuilt. Ran again.

Error: When using --print, --output-format=stream-json requires --verbose

A new CLI validation rule. Not in any changelog I could find. Just a new gate that appeared between "last time this worked" and now.

Fix: Add --verbose to the args. One line.

Bug 3: The GraphQL ghost

Fixed Bug 2. An agent spawned. It read the ticket. It planned. It wrote code. It completed its first turn successfully.

Then the orchestrator tried to check if the ticket was still active:

Cannot query field "nodes" on type "Query".

The state-refresh query was using nodes(ids: [...]) — a Relay convention for fetching any node by global ID. GitHub has it. Shopify has it. Linear doesn't.

This query had been broken since day one. It never ran because Bug 1 killed the process before any agent could complete a turn. Fix Bug 1, Bug 2 appears. Fix Bug 2, Bug 3 appears.

Fix: Replace with issues(filter: { id: { in: $ids } }).

The meta-lesson

Layered failures are the defining challenge of agent orchestration systems. In a traditional web app, a database error is a database error. In an agent system, a CLI parsing quirk masks a missing GraphQL field, which masks a design flaw in the retry loop.

You have to fix bugs in strict sequential order. Each fix reveals the next failure. And unit tests — all 144 of them — caught none of these because each component worked perfectly in isolation.

Phase 8: The design flaw

With all three code bugs fixed, the system worked end-to-end. An agent claimed a ticket, implemented it, opened a PR, and moved the card to "Human Review."

But while tracing the full lifecycle during debugging, I read the Rework flow. When a reviewer requests changes:

Close the existing PR
Delete all progress notes
Create a fresh branch from main
Start the entire implementation from scratch

A reviewer says "rename this variable" and the agent throws away everything — 47 changed files, hours of compute — to re-implement from zero.

This isn't a code bug. It's a design flaw in the prompt template. The Rework instruction treats every review cycle as a total reset rather than an incremental fix.

The right approach: read the review comments, address each one on the existing branch, push updates to the same PR. But that requires a more sophisticated prompt — one that can parse GitHub review comments, map them to specific code locations, and make targeted changes while preserving everything else.

I haven't fixed it yet. It's the next evolution. But I would never have found it if the three code bugs hadn't forced me to trace the full lifecycle manually.

Sometimes the best thing a bug gives you is a reason to actually read your own system.

The final architecture

WORKFLOW.md
    |
    v
[Config Loader] ---> [Zod Validator] ---> [Resolver ($VAR, ~)]
    |
    v
[Orchestrator]
    |-- tick() every N seconds
    |-- [Reconciler] --> stall detection + Linear state refresh
    |-- [Dispatcher] --> eligibility + priority sorting
    |-- [Retry Queue] --> exponential backoff
    |
    v
[Worker Runner]
    |-- create workspace
    |-- write MCP config
    |-- run hooks
    |-- multi-turn loop:
    |     spawn claude CLI --> parse stream --> accumulate usage
    |     check ticket state --> continue or break
    |-- cleanup
    |
    v
[Claude CLI Subprocess]
    |-- stdin: prompt
    |-- stdout: stream-json events
    |-- MCP: linear_graphql tool
    |
    v
[Observers]
    |-- TUI Dashboard (ANSI)
    |-- HTTP API + Web Dashboard
    |-- Pino structured logs

~2,200 lines of source code. 144 tests. 11 test files. 10+ modular layers.

What I'd do differently

1. Integration tests from day one. I had excellent unit test coverage and zero integration tests. The three-bug cascade would have been caught by a single test that spawned a real Claude process against a mock Linear API.

2. Stdin from the start. Positional arguments for the CLI prompt were a ticking bomb. Stdin is universally stable.

3. Incremental rework, not reset. The "throw everything away" approach to review feedback was baked into the initial prompt design. I should have designed for the review cycle from the beginning, not as an afterthought.

4. GraphQL schema introspection. Instead of assuming query patterns from other APIs, I should have introspected Linear's schema on startup. Would have caught the nodes and inverseRelations issues immediately.

What I learned

AI agent systems have emergent behavior. Individual components work perfectly. The failures only appear at integration boundaries — between your code and the CLI, between the CLI and the API, between the API and the schema.

CLI tools are moving targets. When you shell out to an external tool, you're taking a dependency on its argument parsing, its validation rules, its undocumented behaviors. Treat it like an untrusted external service, not a function call.

Run the full loop early. The gap between "all tests pass" and "the system works" was enormous. End-to-end execution with real APIs is where the truth lives.

Debug sessions are design reviews. I found the biggest design flaw — destructive rework — while debugging unrelated code bugs. Tracing the full lifecycle forced me to read my system as a user would experience it.

The system is running. It watches my Linear board, claims tickets, writes code, and opens PRs. Most of the time it works. Sometimes it throws away a perfectly good PR because someone asked it to rename a variable.

We're working on that part.

If you're building similar systems — autonomous agents operating on real codebases with real project trackers — I'd love to hear what's working for you. The tooling is evolving fast, and half the battle is just keeping up with the tools we depend on.

Top comments (2)

Mihir kanzariya • Mar 13

The "failures only appear at integration boundaries" insight is so real. I've been building with Claude Code for a few months now and this is the thing that gets me every time. Each piece works fine in isolation, but the second you wire them together through real APIs, everything breaks in ways unit tests won't catch.

Your point about treating CLI tools as untrusted external services is really smart. I hadn't thought about it that way but yeah, you're basically taking a runtime dependency on someone else's arg parsing and error codes. That's risky.

How's the Linear integration holding up? I'm curious if the GraphQL schema changes have caused any drift since you shipped this.

Alessio Masucci • Mar 13

Thanks! Yeah, the integration boundary thing is the biggest mental shift coming from traditional app dev. You can have 100% unit test coverage and the system is completely broken — that three-bug cascade was a humbling reminder.

On the CLI-as-dependency point — I've started thinking of it almost like an external API with no SLA. The stdin change was exactly that mindset: minimize the surface area you depend on, assume the contract will change.

The Linear integration has been mostly stable since the initial fixes. The two gotchas I hit (inverseRelations not accepting a type filter, and nodes not being a root query field) were both "wrong assumptions about the schema" rather than schema drift. Once I matched the actual API, it's been solid. That said, I should probably add schema introspection on startup as a safety net — it's on the list.

If you're building similar stuff with Claude Code, curious what patterns you've landed on for the agent lifecycle. The rework flow (incremental vs. full reset) is the thing I'm still iterating on.