WonderLab

Posted on Apr 29

I Spent Thousands of Dollars in Tokens Building an AI-Driven End-to-End Bug Fix Pipeline

#claude #agents #enterprise #bugfix

Before We Start

Let me lead with the cost: this system burned thousands of dollars in API tokens during development and debugging.

I still think it's worth writing about. Because this isn't a demo — it's an end-to-end pipeline running against real enterprise systems. A bug ticket in Jira goes in. The AI reads the logs, diagnoses the root cause, writes the fix, runs a Code Review, executes a SonarQube scan, runs unit tests, submits to Gerrit, polls CI/CD for results, adds a human reviewer, and writes back a comment to Jira.

AI drives the whole thing. Humans only step in at critical decision gates.

This article is a full retrospective: how the system is designed, what worked, what failed, what engineering problems I ran into, and how I solved them.

Why Build This

There's a category of software engineering work that consumes enormous human capacity every day but follows a highly standardized process: bug fixing.

A typical bug workflow looks like this:

Receive bug ticket → Read logs → Find root cause → Fix code → Self-test
→ Code Review → Static scan → Unit tests → Submit → Wait for CI
→ Add reviewer → Wait for CR → Merge → Update Jira

Every step involves fixed actions, tool interactions, and a certain amount of accumulated experience. This is exactly the kind of work AI agents are built for.

The challenge: this is a 12+ node long-chain pipeline spanning Jira, log systems, code repositories, review standards, Gerrit, and CI/CD infrastructure. A tool failure at any single node can break the entire flow.

I spent a significant amount of time designing and debugging this workflow on the OpenClaw platform, turning a whiteboard design into a system that actually runs.

Architecture: Three Layers

The system has three layers: Skills, Workflow, and Platform.

┌────────────────────────────────────────────────┐
│              OpenClaw Platform                  │
│  (Enterprise AI Coding Assistant, Claude-based) │
└──────────────────┬─────────────────────────────┘
                   │
┌──────────────────▼─────────────────────────────┐
│              Bug E2E Workflow                   │
│  Node sequence, branching logic, retry policy   │
│  Human gates: checkpoints A / B / C             │
└──────────────────┬─────────────────────────────┘
                   │
┌──────────────────▼─────────────────────────────┐
│                  Skills                         │
│  One independently runnable AI skill per node   │
│  jira-communication / rnd-automotive-issue-     │
│  analyzer / write-code / ph-code-review /       │
│  ph-sonar-scan / ph-junit-ut / commit-format /  │
│  gerrit-verify / ...                            │
└────────────────────────────────────────────────┘

Skills are atomic units. Each skill maps to a single capability and has its own SKILL.md that defines context, input/output contracts, and execution steps. When the agent runs a node, it reads the corresponding skill document and follows the spec.

Workflow is the orchestration layer. Defined in OpenClaw: node execution order, conditional branches (e.g., retry paths when Code Review fails), human gate trigger conditions, and cross-session state management.

Platform is OpenClaw — a Claude-based enterprise AI coding assistant that supports multi-agent concurrency, sub-agent invocation, workspace persistence, and workflow orchestration.

The Full Workflow: 12 Nodes

The workflow starts when a bug ticket arrives and ends when Jira is updated. In between:

#	Node	Skill	Status
1	Fetch bug info & logs	`jira-communication`	✅
2	Root cause analysis	`rnd-automotive-issue-analyzer`	✅
3	Fetch source code	`code-fetch`	✅
4	Fix code	`write-android-code`	✅
5	Code Review	`ph-code-review`	✅
6	Static analysis (SonarQube)	`ph-sonar-scan`	✅
7	Generate & run unit tests	`ph-junit-ut`	✅
8	Commit code	`commit-format`	✅
9	Poll CI/CD verify result	`gerrit-verify`	✅
10	Add Gerrit reviewer	—	✅
11	Automated regression tests	—	✅
12	Write back Jira comment	`jira-communication`	✅

Each node is much more than "call an API." Take root cause analysis as an example:

Download the log attachment from Jira (usually a .zip)
Extract it and locate the relevant log files
Invoke rnd-automotive-issue-analyzer — a skill built specifically for diagnosing crashes, black screens, and system stability issues in automotive Android systems
Output: root cause judgment, affected modules, and suggested fix directions

Only then does code modification begin.

Why is "Fetch Source Code" the hardest node?
To automatically map a bug description to the correct repository, you need a maintained "module → repo" mapping table and bug tickets that have accurate module fields filled in at creation time. On top of that, some repos are massive — pulling fresh every time is too slow, but caching creates workspace storage and staleness problems. This node is currently in cross-team evaluation (DevOps + IT + Development + QA).

Real Tests: 6 Scenarios

The design only matters if it holds up in practice. Here are the 6 most representative cases from actual test runs.

Case 1: The Closest to the Happy Path

This was the most satisfying run. The workflow completed end-to-end as designed:

Fetch bug info
  → Download logs → Extract → Analyze root cause
  → Fix code
  → Code Review (passed)
  → SonarQube scan
  → Unit tests
  → Commit to Gerrit
  → Poll verify status (scheduled)
  → Add Gerrit reviewer
  → Add Jira comment

All 12 nodes, fully AI-driven, zero human intervention. The Gerrit MR was submitted successfully; the Jira ticket was automatically updated with the analysis summary and action log.

One minor glitch: the Commit step was supposed to run automatically but triggered a human confirmation prompt. This was Claude Code's Permission system requiring a second confirmation for certain Git operations. Fixed later by adjusting the Hooks configuration.

Case 2: Tool Failure → Wait for Human

The UT step failed due to a JDK version mismatch in the environment.

When the workflow detected the failure, it didn't crash — it triggered the human notification path, paused, and waited. This is exactly the fallback route designed from the start:

Tool failure
  → Can it be auto-retried?
  → No → Push notification → Wait for human → Resume after confirmation

This case validated that the fault tolerance mechanism works.

Case 3: Session Interrupted → Resume in New Session

A system notification interrupted the agent mid-execution. I opened a new session, sent the same task description, and the agent automatically read the previous workflow_state.json file and resumed from the checkpoint — no restart from scratch.

This relies on the workflow persisting a state file after every node completes:

{
  "bug_id": "AE-33995",
  "current_phase": 4,
  "current_step": "4.3",
  "completed_steps": ["0", "1", "2", "3", "4.1", "4.2"],
  "artifacts": {
    "log_path": "...",
    "review_result": "review_r1.json",
    ...
  }
}

Sessions can be interrupted. Tasks don't get lost. For long-running automation workflows, this is non-negotiable.

Case 4: Code Review Fails → Multi-Round Retry Loop

This is the most interesting section of the whole pipeline.

Code Review failed on the first round (score: 57, with 8 mandatory violations). Instead of immediately escalating to a human, the workflow launched up to 3 automatic fix-and-retry rounds:

Round 1: CR → Failed (57 pts, 8 mandatory violations)
  → Launch sub-agent to fix all 8 violations
  → Round 2 CR → Failed (83 pts, 2 violations)
  → Launch Round 3 fix
  → Round 3 CR → Failed (74 pts, 4 violations)
  → Max retries reached → Trigger human gate B

Each round, the agent reads the previous CR result and makes targeted fixes — not a full rewrite. Round 1 fix summary:

Issue ID	Violation	File
B2-01	Raw Thread → CoroutineScope	PerformanceInfoMonitor.kt
C1-01	`binding!!` → safe call	PerformanceFloatingView.kt
C1-02	`windowManager!!` (5x) → `?.let`	PerformanceFloatingView.kt
C5-01	InterruptedException caught separately	PerformanceInfoMonitor.kt
D1-01	Interface renamed → ICloseClickCallback	PerformanceFloatingView.kt
…	…	…

Case 5: Max Retries Exceeded → Human Gate

Continuing from Case 4. After 3 rounds of fixes, Code Review still failed. But this exposed a deeper and more interesting problem:

Two of the 4 "mandatory violations" in Round 3 were classified as "recommended" in Round 2, then upgraded to "mandatory" the following round.

When an LLM acts as a code review judge, its scoring criteria drift between rounds. The same issue gets rated differently depending on how much history has accumulated in the prompt context. The loop can't converge.

This is a fundamental engineering problem, not something a Prompt tweak can fix.

When human gate B fired, the agent presented a three-round comparison report and offered two options:

A: Accept current code and proceed to submission
   (core bug fixed; remaining issues are style, not functional)

B: Human fixes the remaining 4 items, then notifies agent to re-run quality checks

At this point the human's role is judge, not executor. The AI did everything it could; a person makes the call.

Case 6: CI Pipeline Verify Failed → Human Gate

After submitting to Gerrit, the CI pipeline returned failure votes:

Reviewer	Verified	Compile	UT	Code-Check	Smoke-Test
icvsbgci	-1	+1	+1	+1	0
jenkins.dl	0	0	0	0	-1

The agent read the Gerrit vote state, detected pipeline failures, and entered human gate C with three options:

A — Pipeline issue is known/acceptable; proceed with adding reviewers and updating Jira
B — Needs pipeline re-trigger (manually rebase in Gerrit, then agent re-polls)
C — Abandon this submission and re-evaluate

Engineering War Stories: Three Core Problems

Problem 1: Context Explosion on Long Pipelines

This is the biggest technical challenge the system faces.

A single bug-fix run accumulates: Jira reads, log downloads and extractions, parsing large log files, reading multiple source files, Code Review output, Sonar scan reports… all within one agent turn. In one real test run, a single turn executed 117 tool calls — and then, right after the Sonar scan completed as the agent was about to proceed to the next step, the API request was aborted:

seq=115: sonar poll returned  ← scan result received
seq=117: stopReason="aborted", errorMessage="Request was aborted."

The agent knew exactly what to do next. The turn's accumulated context was simply too large and the server rejected the request outright.

Solution: Sub-Agent Architecture.

Refactor each heavyweight phase into an independent sub-agent call. The main agent only orchestrates; sub-agents execute specific tasks and return structured results. After each sub-agent completes, the main agent receives only a summary — not the full execution trace. Each phase's context stays isolated and doesn't accumulate linearly across the whole pipeline.

Problem 2: LLM Judge Score Drift

Already described in Case 5. The core issue: an LLM's judgment is influenced by its accumulated context. As fix rounds increase, the prompt history grows, and the model's evaluation baseline shifts.

No clean solution exists yet. Directions being explored:

Start a fresh session for each Code Review round (clear all history context)
Decouple the scoring logic from the LLM — use AST-based static rule checking for pass/fail decisions, with the LLM only providing human-readable explanations
Add a consistency validation layer on top of the "mandatory/recommended" classification

Problem 3: Cross-Session Task Recovery

An AI session is fundamentally stateless. In enterprise environments, a bug fix might run for 20 minutes or be interrupted for hours while waiting for CI. You can't rely on a single session staying alive.

The solution is externalizing state. After every node completes, the workflow serializes the current state to the filesystem (workflow_state.json), recording: completed nodes, key artifact paths, and the current phase. When a new session starts, it reads this file first and resumes from the checkpoint.

This is essentially simulating a persistent task queue, with the filesystem as the state store.

What Works, What's Still Missing

What's Running

Jira info retrieval + log download / extraction / analysis
Bug root cause analysis for automotive Android systems (via rnd-automotive-issue-analyzer)
Code fixes (Kotlin, Android application layer)
Code Review with custom standards (multi-round retry + human gate)
SonarQube static analysis (requires SonarQube 10.x — 9.9 doesn't work)
Code submission to Gerrit (with standardized commit messages)
CI/CD result polling + vote status interpretation
Automated Gerrit reviewer assignment
Jira comment write-back

What's Still Being Built

Automatic source code matching is the biggest open problem. Inferring the right repository from a bug description requires a maintained "module → repo" mapping and accurate module fields in bug tickets. This needs cross-team coordination and is currently handled by manually specifying the repo path.

Automated regression testing requires the QA team to co-design the smoke test trigger, execution environment, and result write-back — all of which involve pipeline changes.

Was It Worth It?

Thousands of dollars in tokens, for a pipeline that's still being refined. Worth it?

My answer: depends what you're measuring.

From a pure cost standpoint, no — the per-run token cost is still high and needs optimization.

But from another angle:

This is working proof on real enterprise systems, not a toy demo. It's connected to real production toolchains: Jira, Gerrit, SonarQube, Jenkins.
It reveals the actual boundary of AI agents in enterprise engineering: what can be automated, what needs human judgment, and what's an engineering problem rather than an AI capability limitation.
Sub-Agent architecture, externalized state, and human gates — these three engineering patterns were forged through real failures. They're applicable to any team trying to deploy AI agents in enterprise environments.
Most importantly: the path is viable. Many nodes are still being refined, but "a bug ticket goes in, a Gerrit MR comes out" has been demonstrated.

Summary

This isn't something you knock out in a month. Every skill requires deep familiarity with enterprise tooling. Every workflow node debugged reveals another edge case in how AI agents behave in complex real-world environments.

But the direction is right. AI isn't here to replace engineers — it's here to replace the part of engineering that engineers hate doing but that still has to get done.

If you're working on similar problems, I'd love to compare notes.

Questions or thoughts on your own AI automation experience? Drop a comment below.

DEV Community