Raffaele Pizzari

Posted on May 4 • Originally published at pixari.dev

AI-Assisted Product Engineering: Orchestrating Claude Code Across the Software Development Lifecycle

#ai #engineering

Most LLM coding tools live inside a single editor session. They suggest, complete, and refactor inside one file at a time. That is useful, but it is not where real product engineering happens.

Real engineering spans ticket breakdown, cross-repository implementation, code review, merge request management, and the knowledge that has to survive between sessions. None of that fits in one tool window.

I built a system that orchestrates Claude Code across that full lifecycle. It has been running daily for months. This post describes how it works, why it is structured the way it is, and what I have learned from the parts that broke.

The core thesis is one sentence.

The right unit of agent invocation is the judgment step, not the workflow.

Mechanical steps, the API calls, the test runs, the git operations, do not need an LLM. They need deterministic code. The agent should be invoked only when something genuinely requires judgment: writing the code, evaluating a review finding, choosing between two architectural options. Conflating these two categories is the most expensive mistake I see in agent systems.

The architecture

A terminological note before going further. Claude Code is not a raw API. It is an agent runtime: an LLM with tool use (file reads, shell commands), file system access, and a multi-turn loop. When the orchestrator "hands off to Claude Code", it is not a single API call. It is transferring control to an autonomous process that may read dozens of files, run commands, and iterate before returning. I will use "the agent" or "Claude Code" for what the system invokes, and "LLM" only when discussing the underlying model's behavior.

Three principles guide the design.

Python orchestrates, the agent reasons. Every workflow is split into phases. Phases that involve API calls, file operations, test execution, or data transformation are deterministic Python scripts. Claude Code is invoked only when the task requires judgment. This separation reduces token consumption, improves latency (mechanical phases complete in under two seconds), and makes the system auditable.

Propose, do not execute. The system never performs irreversible external actions (merging code, closing tickets, sending messages) without explicit human approval. It creates structured proposals that surface in a dashboard for review. This makes the system safe to leave running unattended.

Compound knowledge, do not re-derive it. Engineering context (architectural decisions, team ownership, ticket history) is captured in a persistent wiki and an operational database. Each session starts with this accumulated context rather than re-deriving it from scratch.

The six layers

┌─────────────────────────────────────────────────────────┐
│  1. User          CLI + Dashboard                       │
├─────────────────────────────────────────────────────────┤
│  2. Skill         Command → orchestrator routing        │
├─────────────────────────────────────────────────────────┤
│  3. Orchestrator  Python, phased, JSON I/O              │
├─────────────────────────────────────────────────────────┤
│  4. Agent         Claude Code + specialized subagents   │
├─────────────────────────────────────────────────────────┤
│  5. Data          SQLite + Markdown wiki + ChromaDB     │
├─────────────────────────────────────────────────────────┤
│  6. External      Jira, GitLab, Confluence, K8s         │
└─────────────────────────────────────────────────────────┘

Layers 1 to 3 are deterministic. Layer 4 is where Claude Code operates. Layers 5 and 6 are stateful backends. The skill layer maps user commands to orchestrators via a YAML manifest, so the system's capabilities are explicit. Specialized agents (code review, knowledge synthesis, planning) run in isolated context windows with explicitly scoped tool permissions. The code review agent, for instance, cannot edit files.

When a skill needs an orchestrator

Not every skill needs the full structure. The deciding factor is side effects.

Orchestrated skills have multi-step workflows with external side effects: ticket implementation, MR creation, CI analysis, code review remediation. They need deterministic coordination (create branch, run tests, push code) interleaved with agent judgment.

Agent-native skills are single-turn reasoning tasks: debugging a service issue, classifying an unknown input, generating a standup summary. The agent reads context and produces an output. There is nothing mechanical worth extracting.

If a skill creates branches, runs tests, calls external APIs, or modifies shared state, it gets an orchestrator. If it only reads and reasons, the agent handles it directly. Adding an orchestrator has a real cost: more code to maintain, more failure modes, more surface area to test. It is justified only when the mechanical steps are complex enough that the agent would be unreliable executing them.

A ticket from start to finish

To make this concrete, here is the lifecycle of a single ticket implementation.

                    ┌──────────────────────┐
                    │    User: /ticket      │
                    │    <ticket-id>        │
                    └──────────┬───────────┘
                               │
              ┌────────────────▼────────────────┐
              │  Phase 1: Context Assembly       │
              │  (Python orchestrator)           │
              │                                  │
              │  • Fetch Jira ticket             │
              │  • Search wiki for decisions     │
              │  • Create worktree + branch      │
              │  • Extract implementation brief  │
              │  • Return JSON bundle            │
              └────────────────┬────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │  Phase 2: Implementation         │
              │  (Claude Code)                   │
              │                                  │
              │  • Read brief + standards        │
              │  • Write / modify code           │
              └────────────────┬────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │  Phase 3: Validation             │
              │  (Orchestrator + Review Agent)   │
              │                                  │
              │  • Run tests, lint, format       │
              │  • If fail → back to agent (3x)  │
              │  • Dispatch code review agent    │
              │  • If blockers → back to agent   │
              └────────────────┬────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │  Phase 4: Proposal + Ship        │
              │  (Orchestrator → Human → Orch.)  │
              │                                  │
              │  • Create exchange proposal      │
              │  • ── HUMAN DECISION POINT ──    │
              │  • On approve: push + create MR  │
              │  • Log to activity trail         │
              └────────────────────────────────┘

Claude Code is invoked only in Phase 2 and during fix iterations in Phase 3. Everything else is deterministic Python.

Before and after

The first version of the system did not look like this. The agent orchestrated everything. It read 150 to 200 line configuration files, made API calls through tool use, managed git operations, and tracked its own progress.

That version had three problems.

Latency. A complete ticket workflow took several minutes, dominated by the agent parsing configuration and deciding which API call to make next.

Token consumption. The agent's context window filled with mechanical details (API responses, git output, test logs) that displaced the actual implementation context.

Brittleness. The agent would skip steps, hallucinate API parameters, or lose track of which phase it was in. These failures were non-deterministic and hard to reproduce.

After moving mechanical steps to Python orchestrators, Claude Code receives a 30 to 50 line context brief instead of navigating 200 lines of configuration. Workflow latency dropped by roughly an order of magnitude. Mechanical phases now complete in under two seconds. Failures produce deterministic error messages instead of vague agent confusion. Token consumption dropped substantially, because the agent no longer processes responses it only needs to pass through.

A second-order benefit is testability. Python orchestrators can be unit-tested with mock data, so I can verify the mechanical pipeline independently of the agent. That is not possible when the agent is the orchestrator.

Separation pays off immediately. It is the single most impactful design decision in the system.

Memory and observability

A system that acts on your behalf needs two things: memory, so it does not re-derive context every session, and transparency, so you can trust what it is doing. These are deeply intertwined.

The semantic wiki

Long-term memory is a collection of Markdown pages organized by category (features, tickets, teams, decisions, architectural concepts). Each page follows a structured template with metadata, cross-references, confidence tiers, and a changelog.

A specialized knowledge agent creates and maintains the pages, synthesizing information from Jira, Confluence, GitLab, and prior conversations. The wiki distinguishes between three kinds of facts:

Verified facts: directly cited from an authoritative source with a reference ID.
Inferred facts: synthesized from patterns across multiple sources.
Human-provided facts: explicitly stated by a user in an exchange response.

This provenance tracking matters more than I expected. It prevents the most common failure mode of LLM-driven knowledge bases: the model fabricates context, the system stores it, and a week later that fabrication is being cited as truth.

Wiki pages have field-level staleness thresholds. Team ownership becomes stale after 14 days. Architectural decisions remain fresh for 90 days. Ticket status is never cached, because it changes too often. When a stale page is queried, the knowledge agent silently re-ingests it before using it.

After sustained use, the wiki has become one of the most valuable parts of the system. It contains synthesized knowledge about ownership, decisions, and cross-repository dependencies that would take hours to reconstruct from scratch. The confidence tiers are essential. Without them, agents treat inferred knowledge as if it were verified, and you compound hallucinations into authoritative-looking documentation.

The operational database

Short-term state lives in SQLite and tracks four things:

Work items: tickets, MRs, and plans with current status, CI state, and cross-repo dependencies.
Exchange items: structured proposals from agents to humans (more on these below).
To-do items: a prioritized task queue with urgency levels and ownership.
Activity log: an append-only audit trail of every external action.

This database is the substrate for the dashboard and the heartbeat process.

The dashboard

A lightweight web dashboard, a single-file application with no external dependencies, gives real-time visibility into active work, pending proposals, the to-do queue, recent activity, knowledge health (stale pages, open questions, broken cross-references), and a heartbeat indicator.

The dashboard is also the primary approval surface for exchange items, with controls for approve, defer, and dismiss. It refreshes every five seconds.

The heartbeat indicator turned out to be unexpectedly important. Knowing that the background process is alive and polling gives me confidence that the system is aware of its environment. A stale heartbeat is an immediate signal that something needs attention.

Activity logging

Every external write is logged on success. The log captures the action type, the affected resource, the associated ticket, the skill that triggered it, and the target repository. Reads and internal state changes are not logged, which keeps the trail focused on externally visible effects.

The activity log powers the dashboard's feed, generates standup reports ("what did the system do yesterday?"), prevents duplicate work (the heartbeat checks the log before re-proposing), and gives me a forensic trail when I need to debug something unexpected.

Human-in-the-loop controls

Hard limits

Some operations are never performed without explicit human confirmation:

Merging an MR (the system creates MRs, humans merge them).
Transitioning a ticket to "Done".
Deleting branches, files, or database records.
Creating tickets in protected project spaces.
Force-pushing to protected branches.
Running database schema migrations.
Sending messages to external communication channels.

These are enforced at the agent level through explicit constraints. The code review agent, for example, has a deny-list of tools (Edit, Write) so it cannot modify code. Review and implementation are separate by construction.

Labeling

All agent-created artifacts carry explicit labels:

Tickets created by agents include an ai-generated label.
MRs created by agents include an ai-automated label.
Commit messages follow a conventional format that includes the originating ticket key.

This means human team members can identify agent-produced work at a glance during review, triage, and audit. No guessing.

The exchange protocol

The governance model rests on the exchange protocol: a structured format for agent-to-human communication that replaces ad-hoc permission checks with explicit proposals.

Each exchange item has an intent (approval, decision, question, blocker, or flag), an urgency level, a Markdown body with relevant links, and a human answer field. There is no informational intent. Every exchange item requires human action. If the system cannot ask for something, it should not be telling you about it.

Items move from open to answered (when the human responds) to done (when execution completes). If execution fails, the system retries up to three times, preserving the original approval. After three failures it escalates by creating a new blocker item. Users can defer proposals for 24 hours; deferred items re-surface when the deferral expires.

I tried building a permission model first. Define what the system can do autonomously, define what needs approval. It was fragile. The risk of an action depends on context. Pushing to a feature branch is routine. Pushing to main is dangerous. Same operation, different risk.

The proposal-approval model sidesteps this entirely. The system proposes everything and executes nothing without approval, with a small list of hard-coded exceptions (like creating a to-do for CI failure triage). Simpler, easier to reason about, more trustworthy.

It also solves the asynchrony problem. Proposals created during a heartbeat cycle, when no user session is active, are queued and presented at the next session start. Every decision has a timestamp, a human answer, and an execution outcome. The whole system is auditable.

Pre-commit safety

A hook system intercepts operations before execution. Before any commit, the system runs the linter and formatter. Commits that would introduce lint violations are blocked. This prevents the agent from introducing code quality regressions even when its generated code is syntactically correct.

Proactive behavior

The heartbeat

A background process runs every five minutes, independent of any active user session. It polls external systems for state changes and creates exchange items when it detects actionable events:

A blocked ticket becomes unblocked: propose starting implementation.
An MR receives a review comment: propose investigating.
A CI pipeline fails: create a to-do for triage.
An MR has been awaiting review for more than 24 hours: flag staleness.

The heartbeat is deliberately conservative. It proposes but never executes. Its job is to keep the system aware of the engineering environment even when nobody is actively working with it.

Session initialization

Every new session begins with a checklist:

Verify the dashboard and heartbeat are running.
Fetch backlog items created since the last session.
Scan open exchange items by urgency.
List pending to-do items.
Present a concise summary before accepting input.

Continuity matters. The system picks up where it left off, instead of starting from zero every morning.

Where it breaks

Plenty.

Hallucinated references

The LLM can hallucinate a ticket key, a file path, an API endpoint. The orchestrator validates external references before acting on them. Ticket keys are checked against Jira, branch names against git, file paths against the file system. When validation fails, the orchestrator returns a structured error rather than propagating the hallucination.

Stale knowledge acting as truth

Despite staleness management, a window exists between a real-world change and the next re-ingestion. I mitigate this by never caching fast-changing data and by marking inferred knowledge with lower confidence. Agents are instructed to treat inferred facts as context, not constraint. This is not a perfect defense. It is defense in depth.

Proposal flooding

During active development, the heartbeat can generate a high volume of exchange items. Review fatigue follows: I start approving things without reading them carefully. Urgency levels and 24-hour deferral reduce the volume, but the underlying tension between proactivity and cognitive load is real and unsolved.

Scope creep

Given an implementation brief, the agent will sometimes implement more than requested. Error handling for impossible cases. Refactoring adjacent code. Abstractions for hypothetical future requirements. I mitigate this with explicit coding standards ("don't add features beyond what was asked") and the code review agent flags scope creep as a blocker. It is still one of the most common failure modes. Constant calibration.

Operational reality

Cost

The whole system runs on a single laptop. No GPU, no dedicated server, no cloud infrastructure beyond the team tools that already exist (Jira, GitLab, Confluence). The only operational cost is Claude Code's API usage, which scales with the number of tickets processed.

Mechanical phases consume zero API tokens. The knowledge agent and heartbeat consume modest amounts during re-ingestion and polling. The bulk of consumption comes from implementation and code review, which are exactly the steps where agent reasoning is genuinely needed.

The SQLite database, the Markdown wiki, and the ChromaDB vector store all run locally. The dashboard is a single-file Node.js app. This minimalism was deliberate. Every external dependency is a maintenance burden and a failure mode.

Maintainability

Three things need ongoing maintenance.

Orchestrators. When external APIs change (a new Jira field, a GitLab API deprecation), the affected orchestrator needs updating. Plain Python with structured JSON I/O makes this straightforward to test and deploy. A few hours per month.

Standards. The coding standards file is a living document. When I notice a new failure mode (the agent over-engineers, a test pattern is fragile), I update the standards. This is not different from maintaining a team style guide, except that the primary consumer is an LLM. The standards evolve through the same proposal mechanism as everything else: the code review agent flags a recurring pattern, and it becomes a candidate for a new standard.

Wiki schema. As the engineering environment evolves, the wiki's category structure and staleness thresholds need adjustment. The schema is a single YAML file, so changes are low-risk.

What does not need maintenance: the exchange protocol, the dashboard, and the activity log. Stable across months of use. The layered architecture pays off here. Stable components (governance, observability) are decoupled from evolving ones (orchestrators, standards, wiki schema).

What breaks if you stop maintaining it

If the orchestrators fall behind external API changes, mechanical phases start failing with deterministic errors. The system degrades gracefully. The agent can still reason, but the automated context assembly stops working and you have to provide context manually. Annoying, not catastrophic.

If the standards stop evolving, the code review agent keeps enforcing stale rules. They drift from what the team actually wants. The system still works, but its output becomes increasingly misaligned with reality. Subtler failure.

If the wiki stops being maintained, it becomes unreliable. Staleness thresholds mitigate this, but if the underlying sources change in ways the schema does not anticipate, the wiki compounds outdated information. This is the most dangerous failure mode, because it is silent.

Lessons

The silent overwrite

Early in deployment, the system implemented a ticket that required modifying a shared utility function. Claude Code correctly identified the function and modified it to satisfy the new ticket's acceptance criteria. In doing so, it broke three other features that depended on the function's original behavior. The test suite caught one regression. The other two had no test coverage.

The root cause was not the agent's code quality. The modification was locally correct. The root cause was the orchestrator's context assembly. Phase 1 had provided the ticket's acceptance criteria and the target file, but not the list of callers. The agent did not know what else depended on that function.

The fix was straightforward. The orchestrator now includes a dependency analysis step that identifies all callers of modified functions and adds them to the implementation brief. The code review agent was updated to explicitly check for behavioral changes in shared code.

The broader lesson is the most useful one I have.

The agent's failure modes are usually upstream, in the context it receives, not in its reasoning.

Improving context assembly has had a larger impact on output quality than any prompt engineering I have ever done.

Proposals beat permissions

The proposal-approval model replaced a fragile permission system with a simple rule: the system proposes, the human decides. Easier to implement, easier to reason about, easier to trust. The only ongoing challenge is proposal volume during active development.

Where agents still fall short

Even with the architectural mitigations, certain limitations remain.

Cross-repository reasoning. When a feature spans multiple services, the agent struggles to maintain a coherent mental model of the full change set. Structured tracking helps, but does not solve it.
Ambiguous acceptance criteria. When ticket descriptions are vague, the agent produces reasonable but often wrong implementations. The system flags ambiguous tickets as blockers rather than guessing.
Scope creep. The agent's tendency to over-engineer requires constant calibration through standards and review.
Stale context windows. In long sessions, earlier context falls out of the underlying LLM's effective attention. Session-start re-initialization mitigates but does not eliminate this.

Bounded autonomy beats the demo

Most autonomous coding agents on the market optimize for the demo. End-to-end issue resolution. Watch the agent work. Marvel at the autonomy.

I am not interested in the demo. I am interested in Tuesday morning, when someone has to debug why a merge broke staging.

Bounded autonomy with explicit human decision points is less impressive in a screencast and far more useful in practice. The system I built is deliberately the opposite of an autonomous agent. It is a tool with a strong opinion about what humans should still do.

If I had one piece of advice for someone building something similar, it would be this.

Start with the orchestrator, not the prompt. Figure out what context the agent actually needs, assemble it mechanically, and hand it over in a clean bundle. The agent will do the rest.

The hard part is not getting the agent to reason well. It is giving it the right things to reason about.

Originally published on pixari.dev

Top comments (1)

PEACEBINFLOW • May 5

The observation that an agent's failure modes are usually upstream in the context it receives, not in its reasoning, reframes something I've been sensing but couldn't articulate. Everyone obsesses over prompt engineering, but the prompt is just the final handoff. The real leverage is in what you choose to include in the bundle before the handoff happens.

What this makes me think about is how much of software engineering is really context assembly disguised as coding. When a senior developer picks up a ticket, half the work is knowing which three files to read, which team owns the adjacent service, what broke last time someone touched this area. The actual code change might be fifteen lines. The system you've built externalizes that context-gathering into something mechanical and auditable, which means the agent isn't doing "autonomous engineering" — it's doing the coding part of engineering, while the orchestrator does the part that looks a lot like research and triage.

The dependency analysis fix you mentioned — adding all callers of a modified function to the implementation brief — is a perfect example. That's not a better prompt. That's a better brief. And building a better brief is a systems problem, not a language problem.

What I keep wondering is whether the orchestrator itself eventually needs to learn. Right now it sounds like you update the orchestrator's behavior manually when you discover a new failure mode (add dependency analysis, flag ambiguous tickets as blockers, etc.). But the pattern recognition that tells you "the silent overwrite happened because callers weren't in the brief" — could that eventually be automated? Not the fix itself, but the detection that a category of failure just occurred and the brief was missing something systematic. Or does that loop right back into the same hallucination risk you're trying to avoid?