DEV Community: Alessio Masucci

I Tried to Run Symphony for Real on Rentello. It Broke in Exactly the Right Place.

Alessio Masucci — Thu, 26 Mar 2026 08:10:26 +0000

A few weeks ago, I published a build diary about porting OpenAI's Symphony to Claude Code.

That piece was about getting the machine to exist.

This one is about what happened when I tried to trust it.

Not in a toy repo. Not with a fake ticket. Not with a carefully staged demo that only exercised the happy path. I wired the workflow into a real project, Rentello, pointed it at a real Linear board, and tried to make Codex, Claude, and Symphony-family orchestrators coexist on the same execution model.

That is when the most useful bug of the whole project appeared.

It wasn't a crash inside the orchestration loop.
It wasn't a retry bug.
It wasn't even a Linear label bug.

It was a hidden architectural assumption inside one innocent file: WORKFLOW.md.

Phase two of the same story

The first article was the infrastructure story.

I had taken OpenAI's Symphony architecture, ported the orchestration ideas to work with Claude Code, and documented the engineering journey: configuration, polling, workspaces, retries, MCP tooling, the terminal dashboard, the CLI quirks, the Linear GraphQL edges. It was the "can I build this?" phase.

This sequel starts after that.

Once the port existed, the obvious next question was: can this become a reliable workflow across agent surfaces?

I didn't just want a Claude-only orchestrator anymore.

I wanted a system where:

planning could happen in Codex App or Codex web
work could be split into Linear issues that were safe for parallel or sequential execution
exec:agent work could be handled by generic agent surfaces
exec:symphony work could be handled by Symphony-family orchestrators
the same repository could remain deterministic whether the active surface was Codex or Claude

That sounds tidy when you say it fast.

In practice, it meant turning "AI can work on tickets" into an actual contract.

The experiment: make the workflow explicit

The first big change was conceptual, not technical.

I stopped treating the workflow as an informal set of conventions and turned it into a real repository contract:

shared execution-owner labels: exec:agent and exec:symphony
wave labels like wave:1, wave:2
a stable issue template
a single persistent ## Workpad comment per issue
a shared machine-readable contract plus shared docs/templates

The important detail here is that I did not want separate "Codex rules" and "Claude rules" at the planning layer.

The workflow itself needed to be generic.

The orchestrator choice should affect how the work is run, not how the work is defined.

So I introduced a shared contract and kept runtime-specific prompts only where they belonged: under .codex/ and .claude/, as adapters to the same underlying rules.

That gave me a clean execution model:

exec:agent means "owned by the non-Symphony agent surface"
exec:symphony means "owned by a Symphony-compatible orchestrator"
blockedBy remains authoritative
wave labels are scheduling metadata, not permission to ignore dependencies

That was the theory.

Then I decided to test it for real.

The smoke test: two issues, one dependency, no excuses

I created a new Linear project called Agent Workflow Smoke Test.

Then I used the repo's planning contract to generate the smallest meaningful DAG I could think of:

MAS-26 — exec:agent, wave:1, research, docs
MAS-27 — exec:symphony, wave:2, research, docs, blockedBy MAS-26

The goal was deliberately low-risk: a docs-only smoke test with a real sequencing edge.

If the contract was sound, I should be able to verify:

planning against the real repo
issue creation in Linear with the right labels and dependency edges
Codex App dispatch respecting exec:agent
Symphony respecting exec:symphony
a single persistent workpad model across the whole system

The first pass went well.

The repo-level checks passed.
The Linear issues were created correctly.
The labels were right.
The blockedBy relation was correct.

Then I moved MAS-26 into Todo and let Codex App handle the bootstrap.

That part worked exactly the way I wanted.

Codex App picked up only the exec:agent issue, moved it to In Progress, and created a single ## Workpad comment. More importantly, the next run reused that same comment and updated it in place instead of spamming the issue with milestone chatter.

That detail matters more than it sounds.

A lot of agent workflows die the death of a thousand comments. If every automation pass creates another "status update," the issue becomes unreadable. Reusing one persistent workpad is the difference between an automation system that looks operationally credible and one that looks like a bot farm.

At that point, the first conclusion felt obvious:

the workflow contract was working.

That conclusion was wrong.

The first false conclusion

Once the Codex App leg worked, I moved to the Symphony leg.

I launched openai/symphony against the repo's WORKFLOW.md.

And immediately noticed something odd: it was looking at the real Rentello project instead of the smoke-test one.

The reason was simple in hindsight.

WORKFLOW.md was doing two very different jobs at once:

it was the shared execution contract
it was the runtime entrypoint for a specific orchestrator, with a specific project slug and a specific launcher configuration

That meant my "shared workflow file" wasn't really shared at all.

It was secretly carrying environment-specific assumptions:

which Linear project to poll
which states count as active
which command to spawn for the agent runtime
what approval policy and sandbox mode to use

So before I even got to the deeper integration problem, I had already hit a design smell:

I was treating a runtime wrapper as if it were a universal contract.

I worked around it temporarily with a copied workflow file that targeted the smoke-test project.

That got me to the real failure.

The real failure: not labels, not blockers, not Linear

The next visible symptom was noisy and misleading.

Symphony would pick up the smoke-test issue, then back off with timeouts and failed runs.

At first glance, it looked like the kind of thing that sends you down the wrong rabbit hole:

maybe the new exec:* label routing was wrong
maybe the issue state transitions were inconsistent
maybe the blockedBy logic had a bug

None of that was the root cause.

The actual problem was much deeper and much simpler:

openai/symphony expected a Codex-compatible app-server runtime, while the repo's WORKFLOW.md was still configured for Claude-style execution.

In other words, the orchestrator and the launcher contract disagreed about what "the agent" even was.

The workflow file still contained a Claude-oriented runtime block. openai/symphony expected something more like:

a Codex app-server command
a compatible approval policy
the sandbox settings the OpenAI implementation expects

So the visible failure was timeout and retry noise.
The real failure was a runtime mismatch.

That distinction matters.

If I had blamed the new workflow model, I would have "fixed" the wrong layer.

The issue was not exec:agent vs exec:symphony.
The issue was not the shared issue template.
The issue was not the workpad model.

The issue was that one literal WORKFLOW.md cannot honestly be the runtime entrypoint for both openai/symphony and a Claude-oriented Symphony port.

The fix that changed the architecture

This was the moment the architecture got better.

The correct lesson was not "make WORKFLOW.md more clever."

It was: stop asking one file to represent two incompatible runtime contracts.

So I split the model into three layers:

a shared execution contract
a shared Symphony instruction body
runtime-specific workflow wrappers

Concretely, the repo now has:

a shared machine-readable contract for labels, states, and workpad expectations
shared docs/templates for planning and issue execution
a shared Symphony body template
WORKFLOW.openai.md
WORKFLOW.claude.md
a small render script to generate the wrappers from the shared body

That was the missing separation all along.

The shared workflow body defines how the issue should be executed.

The runtime wrapper defines how the orchestrator should launch the agent.

Those are not the same concern.

Trying to force them into one file had worked only because I was not yet testing both runtimes seriously enough.

Once I made that split, the system became easier to reason about immediately.

The repository now says, in effect:

the execution rules are shared
the runtime launcher is not

That is a much healthier contract.

The second failure: the GraphQL ghost was still waiting

Of course, fixing the runtime mismatch didn't magically make the full run clean.

After the split, Symphony could start correctly.
It could pick up MAS-27.
It could begin doing real work.

And then it hit another boundary failure: stale assumptions in the Linear GraphQL layer.

The logs told the story:

Field "identifier" is not defined by type "IssueFilter".
Cannot query field "blockedByIssues" on type "Issue".

This was a different class of bug entirely.

Now the runtime was correct, but the orchestrator's GraphQL expectations were out of sync with the actual Linear schema. The result was a partial execution: the issue moved into In Progress, but the run did not complete cleanly, and no workpad comment was created on MAS-27.

That was important to verify because from the outside, a half-working orchestration system can look deceptively healthy.

The dashboard shows activity.
The issue state changes.
The agent session starts.

But the operational contract is still broken if the workpad never appears and the run dies on a schema edge.

What actually worked

This is the part I care about most, because I don't want the story to sound more broken than it really was.

A lot did work:

the shared execution-owner model with exec:agent and exec:symphony
the planning contract based on shared docs/templates
the Linear issue generation for MAS-26 and MAS-27
blocker gating via blockedBy
Codex App automation dispatching only exec:agent
in-place reuse of a single ## Workpad comment on Linear
Symphony respecting the exec:symphony lane
the runtime split into WORKFLOW.openai.md and WORKFLOW.claude.md

That is not a small list.

In fact, the test did exactly what a good smoke test should do:

it validated the architectural core, then exposed the next real boundary failure.

First the runtime wrapper problem.
Then the stale GraphQL problem.

That is progress.

The actual lesson

If you want cross-agent orchestration, the wrong abstraction is "one workflow file for everything."

What you actually need is:

one shared execution contract
one shared behavioral prompt body, if the execution model is the same
separate runtime launch wrappers for each orchestrator/runtime pair

The reason is simple.

Codex and Claude are not just different models. In this kind of system, they are different runtime surfaces with different launcher assumptions, different session protocols, and different orchestration expectations.

The workflow itself should stay stable across those surfaces.

The launcher should not.

That is the architectural correction this test forced me to make.

And in hindsight, it's exactly the kind of correction you only get from trying to run the system for real.

The happy path rarely teaches you where your abstractions are lying.
Live orchestration does.

Where this goes next

The runtime split is in.
The generic exec:agent / exec:symphony workflow is in.
The Codex App leg is validated.

The next cleanup is clear:

patch the stale Linear GraphQL assumptions in the Symphony runtime
rerun the MAS-27 path cleanly
verify the full end-to-end chain with both executor lanes

That will be the point where the system stops being "an interesting orchestration prototype" and starts becoming something I would trust against a real backlog.

And honestly, that is the part I find most interesting now.

Porting the orchestrator was fun.

Finding the seam between shared execution contracts and runtime-specific launch contracts was the part that actually made it usable.

I Ported OpenAI's Symphony to Claude Code: A Complete Build Diary

Alessio Masucci — Fri, 13 Mar 2026 15:20:17 +0000

A few weeks ago, OpenAI open-sourced Symphony — an Elixir-based orchestrator that polls a project board, claims tickets, spins up Codex agents in isolated workspaces, manages multi-turn sessions, and handles retries. It's the missing piece between "AI can write code" and "AI can work through a backlog."

I watched the demo and immediately thought: I want this, but for Claude Code.

This is the build diary of that port. Every phase, every bug, every design decision — and the one architectural flaw that made me question how AI agents should handle failure.

Credit first: this isn't my idea

Let me be clear upfront: the architecture is OpenAI's. Symphony is their project, their design, their Elixir implementation. The polling loop, workspace isolation, multi-turn agent sessions, exponential backoff — all of that comes from their work.

What I built is a port. Same concept, different stack:

	OpenAI Symphony	My Port
Language	Elixir	TypeScript
Agent	Codex	Claude Code CLI
Tracker	Linear	Linear
Runtime	BEAM/OTP	Node.js

I ported it because I use Claude Code, not Codex. And I wanted to understand every layer deeply enough to extend it for my own workflow.

I had a side project — Rentello — with a Linear backlog growing faster than I could work through it. Well-scoped tickets, clear acceptance criteria. A capable AI agent could handle most of them. I just needed the orchestration layer.

So I started building.

Phase 1: One file to rule them all

The first design decision was putting everything in a single WORKFLOW.md file per project.

---
tracker:
  kind: linear
  project_slug: "symphony-claude-12ab34cd56fg78"
  active_states: [Todo, In Progress, Merging, Rework]
workspace:
  root: ~/symphony-workspaces
hooks:
  after_create: |
    git clone --depth 1 https://github.com/mscalessio/symphony-claude .
agent:
  max_concurrent_agents: 10
  max_turns: 20
codex:
  command: claude
  approval_policy: bypassPermissions
---

You are working on Linear ticket `{{ issue.identifier }}`...

YAML front-matter defines the config. Everything below the separator is a Liquid template that becomes the agent's prompt. One file controls everything: which project to watch, which states trigger work, how to set up workspaces, and the full behavioral instructions.

I used Zod for schema validation — type safety at both compile time and runtime. If the YAML is malformed, the system says exactly what's wrong before anything starts. The config is hot-reloadable: a file watcher detects changes to WORKFLOW.md, re-parses, re-validates, and swaps the live config without restarting.

Phase 2: Talking to Linear — GraphQL lessons

Connecting to Linear's GraphQL API seemed straightforward. I needed three queries:

Candidate issues — all tickets matching the project and active states
State refresh — current state for running workers' tickets
Terminal cleanup — issues in terminal states that need workspace removal

The candidate query worked immediately. Then I tried to filter blocker relations:

inverseRelations(type: "blocks") {
  nodes {
    issue { id identifier state { name } }
  }
}

HTTP 400. GRAPHQL_VALIDATION_FAILED.

Linear's inverseRelations field doesn't accept a type argument. Unlike GitHub's API, there's no server-side relation filtering. Every poll tick was failing silently — no candidates fetched, no agents dispatched. The system looked idle, but it was actually crashing every 5 seconds.

The fix: Fetch all relations, filter client-side in the normalization layer. Simple, but it cost me an afternoon of staring at "0 candidates" wondering why.

Lesson: GraphQL APIs are snowflakes. What works on one platform won't work on another. Read the schema, don't assume.

Phase 3: Building the brain — the orchestration engine

The orchestrator is a state machine running on a timer. Every tick runs three phases:

Reconciliation

Check every running agent for stalls (no activity in 5+ minutes)
Refresh ticket states from Linear — if a human moved a ticket to "Done" while an agent was working, kill the agent

Validation

Sanity-check the config (project slug not empty, API key set, etc.)

Dispatch

Find eligible tickets: not already running, in an active state, has a concurrency slot, no unresolved blockers
Sort by priority, then creation date
Spawn workers

Each agent is tracked as a RunningEntry with:

An AbortController for clean cancellation
Token counters (input + output)
A stall timer (last activity timestamp)
The current turn number and session ID
The most recent stream event (for TUI display)

The retry system was the trickiest part. I needed two modes:

Normal retries — the agent completed a turn but the ticket is still in an active state. Quick 1-second delay, re-dispatch.
Abnormal retries — crashes, timeouts, API errors. Exponential backoff: 10s, 20s, 40s, up to a configurable maximum (default 5 minutes).

This prevents thundering herd behavior when Linear has a transient outage.

Phase 4: The subprocess bet

The biggest architectural decision was how to run Claude. Two options:

Option A: Claude SDK. Make API calls directly. Manage conversation state, tool execution, and permission enforcement in my code.

Option B: Claude CLI. Spawn claude -p --output-format stream-json as a child process. Parse the JSON event stream from stdout.

I chose the CLI. Here's the reasoning:

Concern	SDK	CLI
Tool execution	I implement it	CLI handles it
Session state	I manage it	`--resume` handles it
Permissions	I enforce them	CLI enforces them
Observability	I build it	Stream events give it to me
Coupling	Tight to SDK version	Loose to CLI interface

The CLI does the hard work. The orchestrator just observes and reacts. The multi-turn loop is elegant:

while (true) {
  const result = await spawnClaudeTurn({
    prompt,
    cwd: workspacePath,
    sessionId: sessionId ?? undefined, // resume if we have one
    mcpConfigPath,
    // ...
  });

  if (!result.success) break;

  // Check if ticket is still active
  const state = await tracker.fetchIssueStatesByIds([issue.id]);
  if (!activeStates.includes(state)) break;

  if (turnNumber >= maxTurns) break;
}

The downside — dependency on CLI argument parsing — would bite me later. Hard.

Phase 5: The MCP server trick

Each agent needs to interact with Linear: read ticket details, post workpad comments, update states. But I didn't want API keys in the prompt.

I built a tiny MCP (Model Context Protocol) server — a standalone Node.js process exposing a single linear_graphql tool. For each agent session, the orchestrator writes a per-workspace MCP config:

{
  "mcpServers": {
    "symphony-linear": {
      "command": "node",
      "args": ["/path/to/linear-graphql-server.js"],
      "env": {
        "LINEAR_API_KEY": "lin_api_...",
        "LINEAR_ENDPOINT": "https://api.linear.app/graphql"
      }
    }
  }
}

The agent calls linear_graphql like any other tool. It doesn't know about authentication. It doesn't know about the orchestrator. Complete isolation.

This turned out to be the cleanest design decision in the project. Each workspace is a self-contained unit: its own directory, its own git clone, its own MCP config, its own credentials.

Phase 6: The terminal dashboard

I wanted the same visual experience as the original Elixir Symphony — a live terminal dashboard showing running agents, token throughput, and retry queues.

I built it with raw ANSI escape codes. No blessed, no ink, no external library.

 SYMPHONY  3 agents | 1,247 tok/s | 0:14:32 | 847,291 tokens
 ─────────────────────────────────────────────────────────
 ID      STATE        PID    AGE    TURN  TOKENS  EVENT
 MAS-5   In Progress  4821   2:31   1     12,847  Writing tests...
 MAS-8   In Progress  4935   1:07   1      5,221  Reading file...
 MAS-12  Todo         5012   0:03   1        423  Planning...

 BACKOFF QUEUE
 MAS-3   retry in 18s (attempt 2) — Linear API timeout

Alternate screen buffer (like vim), hidden cursor, 1-second refresh, SIGWINCH resize handling. When the TUI is active, Pino logs redirect to a file so they don't corrupt the display.

The renderer is a pure function — state in, ANSI string out. Easy to test: 24 tests covering column formatting, truncation, empty states, and terminal width edge cases.

Phase 7: The cascade — three bugs hiding behind each other

Everything was built. 144 unit tests passing. TypeScript compiling clean. Time for the real test: a live ticket against a real Linear board.

I started the orchestrator. It claimed MAS-5. And crashed in 200 milliseconds.

Bug 1: ENAMETOOLONG

Error: Invalid MCP configuration:
Failed to read file: Error: ENAMETOOLONG: name too long, open

The system was trying to open() a 3,000-character string as a filename. But what string?

I traced the shell command being constructed:

claude -p --output-format stream-json --mcp-config /path/config.json "You are working on..."

That last argument — the rendered prompt template — was a positional arg. With --mcp-config present, Claude CLI changes how positional arguments are parsed. The prompt was being treated as a file path.

Fix: Always pipe the prompt through stdin. Never pass it as a positional argument.

Bug 2: The silent flag requirement

Fixed Bug 1. Rebuilt. Ran again.

Error: When using --print, --output-format=stream-json requires --verbose

A new CLI validation rule. Not in any changelog I could find. Just a new gate that appeared between "last time this worked" and now.

Fix: Add --verbose to the args. One line.

Bug 3: The GraphQL ghost

Fixed Bug 2. An agent spawned. It read the ticket. It planned. It wrote code. It completed its first turn successfully.

Then the orchestrator tried to check if the ticket was still active:

Cannot query field "nodes" on type "Query".

The state-refresh query was using nodes(ids: [...]) — a Relay convention for fetching any node by global ID. GitHub has it. Shopify has it. Linear doesn't.

This query had been broken since day one. It never ran because Bug 1 killed the process before any agent could complete a turn. Fix Bug 1, Bug 2 appears. Fix Bug 2, Bug 3 appears.

Fix: Replace with issues(filter: { id: { in: $ids } }).

The meta-lesson

Layered failures are the defining challenge of agent orchestration systems. In a traditional web app, a database error is a database error. In an agent system, a CLI parsing quirk masks a missing GraphQL field, which masks a design flaw in the retry loop.

You have to fix bugs in strict sequential order. Each fix reveals the next failure. And unit tests — all 144 of them — caught none of these because each component worked perfectly in isolation.

Phase 8: The design flaw

With all three code bugs fixed, the system worked end-to-end. An agent claimed a ticket, implemented it, opened a PR, and moved the card to "Human Review."

But while tracing the full lifecycle during debugging, I read the Rework flow. When a reviewer requests changes:

Close the existing PR
Delete all progress notes
Create a fresh branch from main
Start the entire implementation from scratch

A reviewer says "rename this variable" and the agent throws away everything — 47 changed files, hours of compute — to re-implement from zero.

This isn't a code bug. It's a design flaw in the prompt template. The Rework instruction treats every review cycle as a total reset rather than an incremental fix.

The right approach: read the review comments, address each one on the existing branch, push updates to the same PR. But that requires a more sophisticated prompt — one that can parse GitHub review comments, map them to specific code locations, and make targeted changes while preserving everything else.

I haven't fixed it yet. It's the next evolution. But I would never have found it if the three code bugs hadn't forced me to trace the full lifecycle manually.

Sometimes the best thing a bug gives you is a reason to actually read your own system.

The final architecture

WORKFLOW.md
    |
    v
[Config Loader] ---> [Zod Validator] ---> [Resolver ($VAR, ~)]
    |
    v
[Orchestrator]
    |-- tick() every N seconds
    |-- [Reconciler] --> stall detection + Linear state refresh
    |-- [Dispatcher] --> eligibility + priority sorting
    |-- [Retry Queue] --> exponential backoff
    |
    v
[Worker Runner]
    |-- create workspace
    |-- write MCP config
    |-- run hooks
    |-- multi-turn loop:
    |     spawn claude CLI --> parse stream --> accumulate usage
    |     check ticket state --> continue or break
    |-- cleanup
    |
    v
[Claude CLI Subprocess]
    |-- stdin: prompt
    |-- stdout: stream-json events
    |-- MCP: linear_graphql tool
    |
    v
[Observers]
    |-- TUI Dashboard (ANSI)
    |-- HTTP API + Web Dashboard
    |-- Pino structured logs

~2,200 lines of source code. 144 tests. 11 test files. 10+ modular layers.

What I'd do differently

1. Integration tests from day one. I had excellent unit test coverage and zero integration tests. The three-bug cascade would have been caught by a single test that spawned a real Claude process against a mock Linear API.

2. Stdin from the start. Positional arguments for the CLI prompt were a ticking bomb. Stdin is universally stable.

3. Incremental rework, not reset. The "throw everything away" approach to review feedback was baked into the initial prompt design. I should have designed for the review cycle from the beginning, not as an afterthought.

4. GraphQL schema introspection. Instead of assuming query patterns from other APIs, I should have introspected Linear's schema on startup. Would have caught the nodes and inverseRelations issues immediately.

What I learned

AI agent systems have emergent behavior. Individual components work perfectly. The failures only appear at integration boundaries — between your code and the CLI, between the CLI and the API, between the API and the schema.

CLI tools are moving targets. When you shell out to an external tool, you're taking a dependency on its argument parsing, its validation rules, its undocumented behaviors. Treat it like an untrusted external service, not a function call.

Run the full loop early. The gap between "all tests pass" and "the system works" was enormous. End-to-end execution with real APIs is where the truth lives.

Debug sessions are design reviews. I found the biggest design flaw — destructive rework — while debugging unrelated code bugs. Tracing the full lifecycle forced me to read my system as a user would experience it.

The system is running. It watches my Linear board, claims tickets, writes code, and opens PRs. Most of the time it works. Sometimes it throws away a perfectly good PR because someone asked it to rename a variable.

We're working on that part.

If you're building similar systems — autonomous agents operating on real codebases with real project trackers — I'd love to hear what's working for you. The tooling is evolving fast, and half the battle is just keeping up with the tools we depend on.