Label an issue. Walk away. Come back to a reviewed, tested, and merged pull request.
We built an event-driven coding pipeline called dev-agents. You label a GitHub issue with dev-agents, and the system designs the solution, writes the code, runs the tests, reviews its own work, and merges the PR. The entire thing runs on a Raspberry Pi acting as a self-hosted GitHub Actions runner.
No cloud GPU. No API keys. Just the Claude Code CLI running on an 8GB ARM board under a desk.
This is how it works, what broke along the way, and why the architecture ended up the way it did.
The Pipeline
A single pipeline run walks through five stages. Each stage has a specific job and a specific failure mode we designed around.
GitHub Issue (labeled "dev-agents")
│
▼ repository_dispatch
┌──────────────────────────────────────────────┐
│ Self-hosted runner │
│ │
│ Tech Lead (sonnet, orchestrator) │
│ ├── DESIGN — explore codebase, write spec│
│ ├── IMPLEMENT — spawn Opus to write code │
│ ├── VERIFY — run tests/typecheck/build │
│ ├── QA — spawn Opus to write tests │
│ └── FINALIZE — commit stragglers │
│ │
│ Post-pipeline (shell, no LLM): │
│ ├── Rebase onto main │
│ ├── Push branch + create PR │
│ ├── REVIEW — sonnet posts inline comments │
│ ├── Auto-merge (squash) │
│ └── Comment on originating issue │
└──────────────────────────────────────────────┘
The Tech Lead is a Sonnet session with 100 turns. It maintains context across all stages — it knows what it designed, what the implementer wrote, what verification found, and what QA flagged. It delegates heavy work to Opus subagents that run in isolated contexts. This is deliberate: the orchestrator keeps the big picture while workers focus on implementation details without context pollution.
Single Orchestrator, Isolated Subagents
Early prototypes used separate agent sessions for each stage. The architect would design a solution in session one. The implementer would open a new session, re-read the spec, and start coding. Context was lost at every handoff.
The fix was a single orchestrator pattern. One Sonnet session runs from start to finish. When it needs code written, it spawns an Opus subagent via the Claude Code Agent tool. The subagent gets a focused prompt, writes code, commits, and exits. Control returns to the orchestrator, which still has the full conversation history.
This matters because the verification stage needs to know what was designed, what was implemented, and what the test output means. A fresh session would need to re-derive all of that context from files. The orchestrator already has it.
Model allocation is intentional:
| Role | Model | Max Turns | Why |
|---|---|---|---|
| Tech Lead | Sonnet | 100 | Orchestration, exploration, coordination |
| Implementer | Opus | 40 | Heavy code generation, complex changes |
| QA | Opus | 20 | Test writing, edge case analysis |
| Reviewer | Sonnet | 20 | Diff review, inline comments |
| Monitor | Haiku | 15 | Lightweight status checks |
Sonnet orchestrates because it is fast and cheap for tool-heavy workflows. Opus implements because it writes better code on first attempt, which matters when you are paying per turn. Haiku monitors because you do not need a frontier model to check if a process is still running.
Event-Driven Trigger Architecture
Target repos stay lightweight. Each onboarded repo gets one small workflow file that fires a repository_dispatch event when an issue is labeled dev-agents. The dispatch lands on the dev-agents repo, where the self-hosted runner picks it up.
Target repo: "Add dark mode" issue labeled
→ repository_dispatch to dev-agents repo
→ GitHub Actions on self-hosted runner
→ Enqueue trigger to filesystem queue
→ Process queue (priority-sorted)
→ Run pipeline
This separation matters for two reasons. First, target repos do not need Claude Code installed or any AI dependencies. The only addition is a 30-line workflow file. Second, the queue lives on the runner, so it survives workflow restarts and can batch triggers from multiple repos.
Pipeline type is auto-detected from issue labels: bug maps to bugfix (skip design, go straight to implementation), hotfix maps to highest priority. Everything else is a feature with full design-first flow.
The Priority Queue
Triggers are enqueued to a persistent filesystem queue on the runner. No database, no Redis, no message broker. YAML files sorted by filename.
~/dev-agents-queue/
├── pending/myapp/
│ ├── 1-20260313T100000Z-fix-auth.yaml # hotfix
│ ├── 2-20260313T100100Z-fix-layout.yaml # bugfix
│ └── 3-20260313T100200Z-add-dark-mode.yaml # feature
├── active/myapp/ # currently running
├── completed/myapp/
├── failed/myapp/
└── locks/myapp.lock # flock per project
The priority prefix (1-, 2-, 3-) means sort naturally processes hotfixes before bugfixes before features. Within the same priority, timestamps provide FIFO ordering.
Concurrency rules: same project runs sequentially (one flock per project), different projects run in parallel (background processes). A hotfix for project A does not wait for project B's feature to finish.
Deduplication is by task ID. If the same issue triggers twice (user removes and re-adds the label), the second enqueue is a no-op.
Crash recovery: if a trigger sits in active/ for more than four hours, process-queue.sh moves it back to pending/. Long enough to handle legitimate large features, short enough to recover from a crashed pipeline before the next cycle.
One-Command Repo Onboarding
Onboarding a new repo takes one command:
./scripts/onboard-repo.sh myorg/myapp
This does six things:
- Shallow-clones the repo to a temp directory
- Auto-detects language, framework, test/build/lint commands from
package.json,Cargo.toml,pyproject.toml, orgo.mod - Creates a project config YAML in the dev-agents repo
- Creates a
dev-agentslabel on the target repo - Pushes the dispatch workflow to the target repo
- Prompts for a PAT secret
Framework detection reads dependency lists and maps them to commands:
| Detected | Test Command | Build Command |
|---|---|---|
| vitest in deps | npx vitest run |
— |
| jest in deps | npx jest |
— |
| next.js in deps | — | npx next build |
| Cargo.toml exists | cargo test |
cargo build |
| pytest in deps | pytest |
— |
The whole process is idempotent. Re-running on an already-onboarded repo skips existing steps.
Pre-Push Review, Not Post-PR
Code review happens before the branch is pushed. A separate Sonnet session reads the git diff, writes structured findings to a review file, and returns a verdict: APPROVE or REQUEST_CHANGES.
If the verdict is REQUEST_CHANGES, the pipeline spawns an Opus fix agent to address the critical and major issues. Then verification runs again — typecheck, tests, build. Only after gates pass does the branch get pushed and the PR created.
If hard gates fail after the review cycle, the PR is created as a draft with a pipeline-failed label. This creates a visible record of what happened without polluting the main branch.
This ordering was a deliberate choice. Post-PR review creates noise: a PR exists, reviewers see it, but it might have obvious issues that a pre-merge check would catch. Pre-push review means the PR that lands in your inbox has already been verified and reviewed. The PR is a record, not a gate.
Failure Memory
Agents learn from past mistakes. Every pipeline failure is appended to a per-project log file:
---
date: 2026-03-10T14:22:00+00:00
task: fix-auth
title: Fix OAuth token refresh
type: bugfix
exit_code: 1
error: |
TypeScript error: Property 'refresh_token' does not exist on type 'Session'
The last 50 lines of this failure log are injected into future pipeline prompts with a header: "These are recent pipeline failures. Learn from them — do NOT repeat the same mistakes."
This is not fine-tuning. It is context injection. But it works. After recording a TypeScript strict-mode failure, subsequent pipelines check for strict mode before writing code. After recording a test database teardown issue, QA agents started including cleanup steps.
The failure log is append-only, capped at 50 lines of context injection, and automatically pruned when tasks move to completed/ after 30 days.
The Kill Switch
touch data/.pause # Stop everything
rm data/.pause # Resume
Every script checks for .pause at the top. This is the same pattern we use in our sales pipeline — a filesystem-level circuit breaker that requires no process management.
What This Cost
The self-hosted runner is a Raspberry Pi 4 (8GB) that also runs our sales pipeline. GitHub Actions self-hosted runners are free. Claude Code runs on a Pro subscription — no API key, no per-token billing. The marginal cost of each pipeline run is effectively zero.
A typical feature pipeline (design through merge) takes 15-30 minutes and uses 100-200K tokens across all agents. A bugfix pipeline skips design and finishes in 5-15 minutes.
Key Takeaways
- Single orchestrator + subagents preserves context across pipeline stages while enabling model specialization. The orchestrator coordinates; workers execute.
- Filesystem queues work. YAML files sorted by filename give you priority queuing, crash recovery, and human inspectability with zero infrastructure.
- Pre-push review catches more than post-PR review. If you are going to have an AI reviewer, run it before the PR exists.
- Failure memory is cheap context injection. Append failures to a log, inject the last N lines into future prompts. Agents stop repeating the same mistakes.
-
Shell scripts over frameworks for orchestration. The entire pipeline is bash calling
claude -p. No SDK, no dependency graph, no build step. When something breaks, you read the script. - Event-driven keeps target repos clean. One workflow file, one label. The complexity lives in the pipeline repo, not in every project you onboard.
The source is at github.com/bing107/dev-agents.
SifrVentures builds dedicated engineering teams for tech companies. Based in Berlin. Learn how we work | Read more on our blog
SifrVentures builds dedicated engineering teams for tech companies. Based in Berlin. Learn how we work | Read more on our blog
Top comments (0)