DEV Community: Adrien Cossa

Software engineering became software architecture.

Adrien Cossa — Wed, 10 Jun 2026 12:39:23 +0000

Highly opinionated, based on my personal experience dogfooding my own setup.
Not a prescription. I'm scratching the surface, with a lot left to learn.

Is coding solved? I think so. But what's left to us? If there's something every developer can still refer to — even those who don't follow the buzz around agentic engineering, then harness engineering, and now loop engineering — it would be architecture. You could say we've been promoted, or soon will be, to manager of a team of agents. But the one part of the job that still speaks to each of us, the part we should keep doing ourselves, is software architecture.

I'm not claiming the agent writes clean, typed, tested code out of the box — it doesn't, and I'll get to how you set that up. But that setup is mostly a one-time job plus a little maintenance, and once it's done, coding is basically solved. What's really left to us is the architectural decisions — the ones that separate good quality from good quality plus a design that lasts. The taste. The judgment about what should exist and where the boundaries go. That layer is still on our side... or so I thought.

The reality, at least for me, is a bit different. Even though I knew it wasn't a good idea, I'd already handed that layer to the model — inside teatree, the code factory I build and run for myself. It wasn't planned. It just happened — and I'm still watching it slip further away from me.

"Coding is solved" assumes one thing I'll set aside here: a complete spec going in. I treat the spec I hand the agent as bulletproof — making sure it actually is, and asking questions when it isn't, is my job, just not this post's subject. Now let me start from the beginning.

What a gate is, and what it does

A gate, the way I'll use the word here, is a deterministic check that returns a pass-or-fail verdict and blocks the commit on a fail. Same input, same verdict, every time — no matter which model ran or what context was loaded. Run one on every commit and the output can only move one way: toward whatever the gate calls "good." That's convergence. Tech debt piles up when nothing pushes back. A gate pushes back, and you can rely on it to.

This matters more in the agent era, not less. The agent out-produces my reading — it writes more code, faster, than I would by hand, and more than I can carefully read. Convergence is what keeps that volume from becoming instant debt. Drop the gates and the same speed that ships features ships mess at the same rate. The gates aren't perfect, but they're necessary.

I work in Python, so my gates are ruff, ty and tach, run through prek, plus a stack of project-specific hooks. teatree — the personal code factory I'm dogfooding — is a Django project that turns a ticket URL into a merged PR (in theory). Its pre-commit pipeline runs more than sixty hooks in a numbered sequence. Most fire on every commit, a few only at push or in CI. The prose below leans on a handful of them:

Gate	What it blocks
Safety guards	commits and pushes that shouldn't happen at all — a merged branch, a public leak, a secret
Lint and structure	rule, boundary and duplication violations — `ruff` (every rule on), `tach`, `import-linter`, a 500-line file cap
Types	type errors and warnings — `ty`, with warnings failing the commit
Gate guards	silent relaxations of the gates themselves
Evals (token-free)	behaviour regressions — skill-triggers, pinned-regressions

The rest — formatting fixers, lockfile sync, doc generators, dependency audits, a conventional-commit check — do the unglamorous tidying you'd expect. Two of the gates above run stricter than people usually set them. ruff runs with every rule on, exceptions justified one by one. ty fails the commit on a warning instead of letting it scroll past. Coverage sits behind a hard floor, with a stricter per-module floor on the newer code, so one module can't rot quietly while the project number stays green. None of it is advisory. A failing gate is a failing commit.

The ratchet only turns one way

A check you — or your agent — can quietly widen or switch off stops being a constraint. The cheap way out of a red check is to widen the ignore list or lower the floor. So one of my gate guards watches for exactly that: it fails any commit that touches a lint-ignore list, a coverage floor, an omit pattern, or reaches for --no-cov or --no-verify, and tells you to fix the underlying issue instead. The escape hatches are themselves gated, so convergence stays one-directional.

A second guard works on structure, not lint. A 500-line file cap and a cap on module-level functions don't force every oversized file under the limit at once. A file already over the line stays — but it can only shrink. A commit that grows it is blocked, and newly crossing a cap is blocked outright. Structure ratchets the same way coverage does.

But the durable thing isn't the rule. It's where the rule lives. A rule kept in prose — a skill, a note, a thing I have to remember — isn't reliable, because prose gets read inconsistently or not at all. The same rule as a hook is as reliable as your unit tests. My blueprint, the one document that records the system's current shape, puts it plainly: durability comes from enforcement encoded in code and structure, not prose that decays. You can push prose to hold more strictly than that — but it's a harder problem, and a later post.

So a lot of tooling that predates the agentic era just became more necessary than ever. The kind I lean on more lately are the structural ones — what I've seen called architectural fitness functions: a deterministic test that checks a property of the whole module graph rather than a single line. tach enforces dependency direction (which module is allowed to import which — the module DAG, the layering). A chokepoint registry maps each dangerous primitive to the one module allowed to call it: every outbound network call, say, has to go through a single egress module, so a raw HTTP call made anywhere else fails the check. I reach for these because of volume. "Remember not to do X" loses when the agent writes more than I can read. A structural test that makes the violation impossible doesn't depend on anyone reading anything — and it's declarative, so one line catches the whole class, including code that doesn't exist yet.

The part nothing gates

Every gate above checks the code. None of them checks whether the design is right.

ruff will tell you a function is too complex. It won't tell you the function shouldn't exist, or belongs in a different module, or that the boundary it sits behind is in the wrong place. ty catches a type error and waves through a wrong abstraction that happens to type-check. Coverage tells you the code is exercised, not that it's the right code to exercise. You can have all four legs from Part 0 in place — the model, the harness, the deterministic constraints (the gates this post is about), and the skills — pass every gate, and still ship a clean, fully-typed, fully-covered implementation of the wrong architecture.

Architecture has no deterministic gate. No fixed rule returns the same pass or fail every time on whether a design is right.

The closest thing in my setup is a design companion — a checklist that fires before any code touches the core surfaces (the CLI, the core models, the scanners, the overlay base class, a backend protocol) and makes the agent reason through nine checks first:

Layout — blueprint alignment, component boundaries, dependency direction
Contracts — FSM phase boundaries (which moves from one workflow phase to the next are legal), extension-point contracts, behavior preservation
Under change — test surface, resilience invariants, identity and key normalization

Of those nine, exactly one is backed by a real gate: dependency direction, which tach enforces. The other eight produce design questions and nothing more — no verdict behind them. Someone still has to reason about whether the answer is right, and no hook I can write resolves that. It isn't a gap I haven't gotten to yet. With fixed rules, it's the shape of the problem.

But fixed rules aren't the only kind of check. There's another kind — non-deterministic, grading behaviour rather than asserting a line — and that's where Part 2 goes. The tidy conclusion, "no gate, so it stays human," leans on a binary that doesn't hold, so I'd rather leave that door open than slam it here.

I never decided to hand it over

Here's the part I got wrong about my own setup.

I assumed those eight verdict-less checks would route to me. I'm the human, architecture is the human's job, so the companion fires and I sign off. They don't route to me, and the reason is mundane. The companion fires on every change to a core surface. Signing off on each pass means the agent stops and waits for me every few minutes. That doesn't scale — and I never sat down and decided it didn't. I just stopped doing it, the way you stop reading a dialog box you've seen a hundred times.

So the call defaults to the agent. It makes the architectural decision, writes the code, and only pulls me in when it flags its own uncertainty. Which means the agent is already making most of the architectural decisions in this system — not because I reasoned my way to delegating them, but because the alternative was an interruption I couldn't sustain.

Found by use, not by spec

If you can't gate whether the architecture is right, you find out the only other way: you run the thing and watch it break.

My own README is blunt about it — not a stable product, expected to break, expected to change shape, dogfooded daily on real work. A design flaw in a system you don't use is a hypothesis. A design flaw in a system you depend on every day is a stalled ticket, and a stalled ticket is impossible to ignore. Battle-testing isn't a phase after the design. It is the design process — the only honest signal I have about architecture.

teatree's current shape wasn't specced up front. It grew, in this order: it started as one monolithic skill, ac-multitask — take a ticket, run it end to end. I split that into about eight lifecycle skills, one per phase, and those became the t3-* skill system. A unified t3 CLI with a finite-state machine pulled them together. Then it became a Django extension — models, migrations, real persistence. Then the inversion: the whole thing became the Django project itself, with the overlays demoted to lightweight packages on top. Then I packaged it as a Claude plugin. Each shape change came from hitting a wall with the previous one, not from a plan that anticipated the wall.

The blueprint is where the current shape is written down — but after use confirmed it, as a record of what survived, not a spec dictated before the first line.

So what's the job now?

So the agent has the volume of architectural decisions now. It doesn't yet have the judgment to catch its own bad ones.

It still makes beginner mistakes — a boundary in the wrong place, a decision heading the wrong direction — the kind any experienced developer has made once and learned to spot on sight. I catch some of them by reading the agent's reasoning as it goes. Part of that is plain developer experience. The other part is knowing this particular model — where it oversells a fix, where it quietly papers over something it couldn't do, the kind of task it's already failed three times.

So if I've handed off the code and most of the first-pass calls, what's left? The part around them. Building the gates so the convergent work converges without me. Shaping what the agent reasons against — the blueprint, the boundaries, the chokepoints — so its default lands closer to right. Reading the reasoning on the decisions that matter, and catching the ones it gets wrong. Deciding what gets built now and what waits — the product calls, made even when there's no spec written down.

That last part is product work, not engineering: I'm the product manager here as much as the developer. I'm still at the keyboard all day, just not writing code — I read what the agent produces and write back what to do next. (Until that turns into talking out loud, which it will.)

How long that holds, I don't know. The model keeps narrowing the gap, and the day it catches its own architectural mistakes, the job moves again — the way it just moved from writing code to shaping the thing that writes it. I'd rather watch that line move than pretend it's holding still. That's most of why I'm writing this down.

Hosting OpenClaw: a money trap and two silent failures

Adrien Cossa — Wed, 10 Jun 2026 12:39:06 +0000

I run OpenClaw on a Hetzner CAX ARM VPS. It talks to me over Signal and does a morning press review. Three gotchas on that box are worth writing down: one money trap in the model-routing layer, and two silent failures that each left the briefing dead for days. In case someone is staring at the same thing.

OpenRouter can spend your credits on a provider you didn't pick

I use OpenRouter as the single door to a pile of models. Its BYOK (bring-your-own-key) feature has a trap. You add your own OpenAI key for a model, flip on "Always use for this provider," and read that as never spend OpenRouter credits. It doesn't mean that.

The toggle only guarantees they use your key for that provider. It does not stop OpenRouter routing to a different provider that serves the same model when your key fails or is unavailable — and that fallback spends your OpenRouter credits, at pay-per-token rates, on a provider you never picked. Working as designed. Just not the design in your head when you flip the toggle.

The setting that does what I wanted is the provider.only routing param. It tells OpenRouter to fail the request rather than fall back:

"provider": { "only": ["your-provider-slug"] }

The part that surprised me even without BYOK: a single model spans a wide price range by provider. I checked openai/gpt-oss-120b (what I actually run) via GET /api/v1/models/openai/gpt-oss-120b/endpoints — it runs from about $0.039 to $0.95 per million tokens. Default routing optimises its own mix of price and availability and can land you on the expensive end silently.

So I pinned it. In OpenClaw the routing params live under models.providers.openrouter.params.provider:

"provider": { "only": ["deepinfra", "dekallm", "novita"], "sort": "price" }

Now OpenRouter uses only those three, cheapest first, and a failure drops to my free fallback instead of escalating to a pricey provider. One catch that cost me a few minutes: only wants provider slugs, not display names — the slug is the part before the / in each endpoint's tag field.

Two guardrails worth setting regardless:

OpenRouter API keys have no spend limit by default — mine showed "limit": null. Set one at openrouter.ai/settings/keys. A cap is the one protection that doesn't depend on getting the routing right.
openrouter.ai/activity shows which provider actually served each request — where to look when you suspect a provider you didn't pick.

There's a catch to sort: price that pulls the opposite way: the cheapest providers are the least reliable. My daily digest started failing after a day — LLM request failed, nothing else. The gateway was healthy, so it wasn't the auto-update below. The route was the problem: a long digest turn kept landing on whichever cheap provider was flaky that morning. The fix was to pin the cron's agent to a few reliable, still-cheap paid providers and sort by throughput:

"provider": { "only": ["groq", "together", "baseten"], "sort": "throughput" }

For a digest of about 30k tokens a day, that's around $0.20 a month — and the next run came through. Pin by price for anything you can afford to retry, but for an automation that must deliver, the cheapest route is a false economy. Free tiers rate-limit and cheap providers get flaky exactly when you lean on them. Two more things the same episode taught me: routing reliability belongs in the routing config, not the agent's model list — and an over-eager edit to that model block can hard-fail the gateway on restart, so back up the file first.

signal-cli crashes on ARM64, and the fix is a container

The briefing went quiet. The gateway was up and healthy, but signal-cli kept dying. The crash pointed at a SIGSEGV deep inside libsignal's native JNI code during message encryption — generation worked fine, every send killed the daemon, and the watchdog kept restarting it into the same wall.

The root cause is an ARM64 packaging gap. signal-cli's bundled libsignal-client jar ships a native library for Linux x86 and macOS ARM, but not Linux ARM64. On an ARM box it falls back to a mismatched libsignal_jni.so that corrupts JNI handles and segfaults on the encrypt path. I burned time on JVM flags first — interpreter-only (-Xint), a different garbage collector, disabling compressed oops — and every one still crashed. It's not a JIT or GC problem. The native library is simply wrong for the platform.

What worked: stop fixing the native build by hand and run signal-cli from a container that ships correct ARM64 binaries — bbernhard/signal-cli-rest-api. Mount the existing account data so there's no re-pairing and no safety-number change:

docker run -d --name signal-daemon --restart unless-stopped --no-healthcheck \
  -p 127.0.0.1:8080:8080 \
  -e XDG_DATA_HOME=/data -e HOME=/tmp \
  -v "$HOME/.local/share:/data" \
  --user 1000:1000 --entrypoint signal-cli \
  bbernhard/signal-cli-rest-api:latest \
  daemon --http 0.0.0.0:8080 --no-receive-stdout

Then set channels.signal.autoStart to false and channels.signal.httpUrl to http://127.0.0.1:8080 so OpenClaw connects to the container instead of spawning the broken local binary. Sends stopped crashing, and a direct send test came back SUCCESS. Updates are now a docker pull.

One rough edge still open: in the container, replies to incoming messages resolve their recipient fine, but the proactive scheduled send (the cron "announce") doesn't resolve the explicit recipient and errors before sending — even though the generated text is right there in the run record. For now I get a briefing by messaging the bot. Tracking down why the scheduled push differs is still on my list.

An auto-update can kill your scheduled jobs without a single error

The briefing went quiet again — but nothing crashed. The gateway was healthy, systemctl status said active (running), restart count low. The daily cron just stopped firing: no error, no run-log row. Two days passed before I noticed.

The cause: OpenClaw auto-updated on disk and migrated its cron store as part of the bump — the per-feature JSON files got consolidated into a single ~/.openclaw/state/openclaw.sqlite, the old ones renamed *.migrated. But it never restarted the running process. The old in-memory scheduler still pointed at the now-renamed jobs.json, which no longer existed. New code on disk, old code in memory, store moved out from under both.

The class is worth naming: an auto-updating daemon that migrates state without restarting fails silently. "It worked yesterday" means nothing across an auto-update. Newer OpenClaw ships a restart-after-update path, but the fallback may not pick up a store migration, and I didn't want to bet the briefing on that.

The fix watches the outcome, not the process — because the process was green the whole time. A small stdlib Python script on a daily systemd timer, run an hour after the brief is due, asks one question: did today's expected output actually arrive? On a miss it messages me over the same Signal channel, naming the problem and the fix (sudo systemctl restart openclaw.service). A healthy day sends nothing. Checking "is the daemon up" stays green for two days. Checking "did the thing I wanted happen" catches it the same morning.

One footgun while debugging: don't run the openclaw CLI on the host. openclaw cron list, openclaw doctor, any of them — the CLI detects the running gateway's PID and SIGTERMs it before starting its own, knocking the service over for 20-30 seconds. Edit the cron store directly, hit signal-cli's JSON-RPC for sends, and admin from anywhere but the box.

The pattern across all three: the thing that tells you it's fine — the toggle, systemctl status, the watchdog — is measuring something next to what you actually care about. The cure each time was to watch the outcome and keep a way back in. If you're running OpenClaw on your own box and hit a fourth one of these, or found a cleaner fix for that proactive-send gap, I'd like to hear it.

Text-to-Speech for Claude Code — Hear What the Agent Is Doing

Adrien Cossa — Sat, 06 Jun 2026 20:35:46 +0000

Claude Code can already listen to you. Run /voice and you get push-to-talk dictation — you speak, it transcribes into the prompt (docs). What it does not do is talk back. When I leave a long task running, I either babysit the terminal or miss the moment it finishes or asks a question.

So I added the other half: text-to-speech. A hook reads the agent's replies aloud. I can be in another room and still hear "done, tests pass" or "I need a decision here". This post has two parts — a small recipe anyone can paste into their config, and how I wired the same idea into my own tooling for the times I'm not at my desk.

This is a personal hack, not a Claude Code feature. It reads short text aloud after the agent stops. That's it. No wake words, no conversation, no reading code blocks (you don't want that).

The recipe: a hook + your OS speech command

Claude Code hooks run a shell command on lifecycle events. The two that matter here:

Stop — fires when the agent finishes responding. It receives the path to the conversation transcript on stdin.
Notification — fires when Claude Code wants your attention (a permission prompt, an idle nudge). It receives the notification text on stdin as a message field.

Notification is the simplest win, so start there. Every OS ships a speech command: say on macOS, spd-say or espeak-ng on Linux, and a one-line PowerShell call on Windows.

Here is a Notification hook that speaks the message. Put it in ~/.claude/settings.json:

{
  "hooks": {
    "Notification": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "jq -r '.message // empty' | say"
          }
        ]
      }
    ]
  }
}

jq reads the message field from the JSON on stdin, and say (macOS) reads piped text aloud. On Linux swap say for spd-say -e or espeak-ng, both of which also read stdin. On Windows, point the command at PowerShell:

"command": "jq -r '.message // empty' | powershell -Command \"Add-Type -AssemblyName System.Speech; (New-Object System.Speech.Synthesis.SpeechSynthesizer).Speak([Console]::In.ReadToEnd())\""

That covers the "needs your attention" case. If you also want the agent to read its actual reply, add a Stop hook. The wrinkle: Stop gives you the transcript path, not the text. The transcript is JSONL (one JSON object per line), so you pull the last assistant text block out of it:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "jq -rs 'map(select(.type==\"assistant\")) | last | .message.content[]? | select(.type==\"text\") | .text' \"$(jq -r .transcript_path)\" 2>/dev/null | head -c 600 | say"
          }
        ]
      }
    ]
  }
}

A few honest caveats, because this is where it gets rough:

Cap the length. head -c 600 stops say droning through a 4 KB status report. Pick your own limit.
Strip markdown if you can. Read aloud, code fences and URLs are noise. The recipe above doesn't strip them — for a one-liner it's tolerable, but a real version should.
The transcript shape is not a stable public contract. The jq filter above matches the current JSONL layout. If Claude Code changes it, the filter breaks. Treat it as a hack, not an API.

For most people the Notification hook alone is enough, and it's the part least likely to break.

The t3 extra: speak settings

I keep my Claude Code automation in a project called teatree. It has a t3 speak command driven by one [teatree.speak] table:

[teatree.speak]
local = "dm"   # what plays on this machine's speakers: "dm" | "all" | "off"
slack = true   # attach a spoken audio file to each bot→user Slack DM

local controls the speakers in front of you: dm reads only the bot's DMs to you, all also reads every agent turn aloud, off is silent. slack attaches a spoken audio file to each bot→user DM. The two are independent, and both default off, so it does nothing until you configure it.

Two destinations because there are two places I am. At the desk, local plays through the speakers the moment a DM lands — no clicking. Away from it, slack is what I reach for: the spoken text arrives as an audio file attached to the DM, and on the phone I press play. Not hands-free, but I can listen while moving instead of stopping to read.

Two operational notes. The voice comes from macOS say. And slack needs the bot's file-upload permission, so an existing bot has to be reinstalled once to grant it.

Where it stands

The hook recipe is the part I'd actually recommend trying — it's a few lines and it degrades gracefully. The teatree side is tied to my own setup, so take it as one way to structure the same idea rather than something to copy verbatim.

I'm still figuring out how much to read aloud. local = "all" gets chatty fast. dm is calmer but misses things. If you try this, I'd be curious what threshold works for you.

Coding is solved. The factory isn't.

Adrien Cossa — Fri, 05 Jun 2026 09:23:20 +0000

Highly opinionated, based on my personal experience. Not a prescription —
just notes from what I keep figuring out while dogfooding my own setup. I'm
scratching the surface, with a lot left to learn.

I'm building a multi-repo personal code factory. I don't spec it up front: I dogfood it day by day — using it, and asking for improvements or fixes when something breaks. The architectural decisions still can't be made blindly by the models, so daily use is how the system finds its shape. Two qualifiers about scope, then four claims.

Personal means local — for better and worse. I call this a personal code factory because it runs on my own laptop, with the credentials I already carry as a developer — not as managed infrastructure with its own auth, audit, and isolation. It's a personal dev tool, with a personal dev tool's trade-offs: the agent works inside the access I already have, the way I'd run anything else on my own machine.

The upside of that scope is speed. It's my own tool on my own machine, so there's nothing to coordinate — no infra to provision, no shared environment to keep in sync, nothing to wait on. It just automates what I'd otherwise do by hand, and that's what lets me experiment fast and reshape the thing every week.

The downside is the same thing said the other way: because it's mine, it doesn't help anyone else, and it's nowhere near as efficient as something that would run on GitLab or Slack directly. This is a POC. If it turns out to work, the right move is to promote it to actual company infrastructure.

Multi-repo means a particular kind of hard. I run it on multi-repo because that's what my work looks like. If your code lives in a single repo, a lot of what I describe in this series either disappears or shows up differently. I'm not claiming the multi-repo case is the interesting one — just that it's the one I have.

1. Coding is solved

Coding is solved — Cherny's phrase, and I think he's right. It took four things, and they're not the same thing.

The model: capable enough to write the code.

The harness: what lets the model act instead of just emitting text — read the repo, run the tests, iterate, fix. (Mine is Claude Code, but the principle isn't tied to it.)

A layer of deterministic constraints: checks that keep the output converging toward quality instead of tech debt. I work in Python, so for me that's ruff, ty, tach run through prek, plus gitleaks and a stack of project-specific hooks. Different language, different tools — the constraint is the point, not the toolchain.

And skills: written guidance that gives the model the business and project knowledge to make the right call in this codebase, not a generic one.

Take any one of the four away and it stops working. What none of them guarantees is that the architecture is right — and that is the next claim.

2. The factory around it isn't, and you can't specify it

The factory around it isn't solved. I don't think you can specify it up front.

There are two ways to get a system that builds and ships software for you. One: write the spec — every edge case, every failure mode, every integration — hand it to an agent, let it build. Two: use it every day and fix what breaks.

I don't believe in the first — at least I wouldn't try it. A spec for a system that builds, reviews, and ships software ends up being more or less the system itself: you don't find out which edges bite until they bite. And the architectural calls inside it still can't be made blindly by the models, so the spec would have to make all of them in advance — that's the part I don't see working yet.

That leaves the second way.

3. Dogfooding is the only loop where any of this works

That leaves dogfooding — using the thing every day, fixing what breaks, keeping it running tomorrow.

Dogfooding fuses three things into one loop that no spec can: verifying the system works, improving it where it's wrong, and keeping it running long enough to do both. The first two are the same act — you verify by trying to use it, and the parts that don't work are the parts you fix.

Making that verification less manual split into two halves. The proactive half is a test suite that checks whether the agent behaves as intended — did it reach for the right tool, did it avoid the wrong one — so a behavior regression shows up as a red test instead of going unnoticed for days. I'm only starting on these: a handful of behavioral scenarios plus the deterministic checks around them, noisy enough that I don't lean on them yet. The reactive half is a runtime hook that catches a bad action as it happens and refuses it — the backstop for when the agent misbehaves anyway. I lean on those far more today. But every backstop I need is something the proactive half didn't catch in time. If the evals and the agent were good enough, the gates would be dead weight. They aren't yet, so I keep both.

The third thing in the loop is the precondition. Self-improvement and resilience are two sides of the same coin. A system that shuts down can't keep improving itself. If I had to pick which matters more, it's resilience — improvement stops the moment the loop stops. You don't get either by specifying them. You get both by running the thing every day and refusing to let it stay broken.

So who orchestrates the loop? That's the last claim.

4. Orchestration is the last thing to automate

Orchestration looks like the part that stays human: holding the big picture, deciding what gets attention first, noticing when two threads are about the same thing, deciding what to keep and what to drop.

In teatree most of it already runs without me. One orchestrator with the big picture, not a swarm — it arbitrates and hands the actual work to sub-agents. What still needs me is basically troubleshooting and steering, and I assume that the loop can't be fully closed as long as the behavioral evals are missing.

What the rest of the series is

I'll try to publish roughly one post a week. Each one is the thing I keep getting wrong and trying to get less wrong:

Part 1 — Software engineering became software architecture. Deterministic constraints solve code quality. Nothing solves whether the architecture is right — that's what's left for a human, and there is no gate for it.
Part 2 — Suppose the skill is never followed. Why I treat prose guidance as decorative and everything that matters as a hook, why I had to invent memory because skills aren't reliably read, and how I'm starting to write evals — a test suite that checks whether the agent actually behaved as planned — so I catch a skill being ignored before a hook has to.
Part 3 — Make yourself an optional reviewer. The closed-loop part, including the surprisingly hard subproblem of letting the system merge PRs without my approval without it feeling reckless.
Part 4 — One orchestrator, many loops. Why I run a single session with many sub-agents instead of many sessions, what that costs, and the honest ceiling I think it has.
Part 5 — FSM in the database. How concurrency, leaks, and crashes stopped being terrifying once the workflow state lived in a table instead of in memory — and how the same substrate carries resilience and a distributed improvement mechanism across multiple repos.

I'll change my mind about some of this between now and the last post. That's the point.

Installing OpenClaw the Easy Way

Adrien Cossa — Mon, 16 Mar 2026 16:34:05 +0000

I started installing OpenClaw manually — reading through the docs, figuring out each step, hitting the usual walls. It was taking too long, so I stopped and decided to try a different approach. So instead of repeating the process, I wrote a skill for it: a set of instructions and references that lets an AI coding agent handle the whole thing. This is the skill-driven development approach I described in a previous post — let the agent do the work, fix the skill when it gets something wrong, repeat until it gets it right.

The result is a skill that handles the full installation and configuration of OpenClaw on a VPS. You tell your agent "set up OpenClaw on my server" and it walks through everything: provisioning, hardening, messaging channels, backups. The skill encodes the gotchas so you don't have to hit them yourself.

What you get

With the skill loaded, setting up OpenClaw goes roughly like this:

"Set up OpenClaw on my new Hetzner server"

The agent asks a few questions — SSH access details, which messaging channels you want, which model providers you have API keys for — then works through the phases in order. It handles:

Server provisioning — SSH setup, initial access
OS hardening — UFW firewall, Fail2Ban, SSH lockdown
Disk encryption — LUKS full-disk or encrypted block storage volumes
Runtime installation — Node.js, Python, build tools
OpenClaw installation — clone, configure, systemd service
Model configuration — API keys (Anthropic, OpenAI, etc.) or local Ollama
Messaging channels — Signal, Telegram, WhatsApp, Discord, and others
Multi-agent routing — different agents for different contacts
Remote access — Cloudflare Tunnel, Tailscale, or Caddy reverse proxy
Docker sandboxing — container isolation with proper firewall rules
Backups — local snapshots, GitHub push, cloud provider images
Social media integration — optional, third-party schedulers
Ongoing maintenance — updates, log rotation, health checks

At each step, the agent adapts to your choices. Pick Telegram instead of Signal? The Signal-specific build steps get skipped. Want Tailscale instead of Cloudflare Tunnel? Different config, different verification. The skill describes the trade-offs; the agent presents them and moves on.

Without the skill, an agent can still attempt all of this — the information exists in OpenClaw's docs and platform-specific guides. But it takes hours of trial and error, and several of the issues below are genuinely hard to figure out from scratch.

Issues the skill handles for you

These are the problems that came up during the SDD loop — things the agent got wrong, that I fixed in the skill, and that you won't have to deal with.

Docker bypasses UFW

This one is not well-known. When Docker publishes a port, it writes its own iptables rules that bypass UFW entirely. Your firewall says "deny all incoming," but Docker's containers are wide open anyway.

The skill includes explicit DOCKER-USER chain rules that block all inbound connections to containers unless they come from loopback (127.0.0.1) or your Tailscale CGNAT range (100.64.0.0/10). Without this, any container you run is exposed to the internet regardless of your UFW config.

Signal on ARM64 has no pre-built binary

If you're running on an ARM64 VPS (Hetzner CAX series, for example), signal-cli's libsignal JNI binding doesn't have a pre-built binary. You have to clone the signal-cli repo and build libsignal from Rust source yourself. This requires openjdk-25-jre-headless, build-essential, cmake, libclang-dev, protobuf-compiler, and rustc — a dependency chain that takes a while to sort out.

The skill documents the exact build steps and dependencies. Without it, the agent installs signal-cli, it silently fails at runtime, and you spend a while figuring out why.

Secrets end up in config files

Left to its own devices, the agent will store API keys in environment files or config files because it's the fastest path. The skill enforces storing all secrets in pass (the standard Unix password manager) and reading them at runtime. No API keys in plain text, ever.

Services "started" but not actually running

A common agent failure: it runs systemctl start openclaw, gets no error, and declares the phase complete. But the service might have crashed immediately after starting, or it might be listening on the wrong port, or a dependency might be missing.

The skill marks verification as (Non-Negotiable) — the agent must confirm each service actually responds via HTTP before moving on. This single rule prevented most of the false-completion issues during the SDD loop.

Disk encryption on a cloud VPS

LUKS full-disk encryption on a cloud VPS isn't straightforward. You need to boot into rescue mode, set up cryptsetup, and install dropbear-initramfs so you can unlock the disk remotely after every reboot. The skill documents this path but also offers a simpler alternative: Hetzner encrypted block storage volumes for sensitive data only, which avoids the rescue-boot complexity for most use cases.

Remote access done wrong

Exposing ports directly to the internet is the default instinct, but it's the wrong one. The skill requires one of three approaches: Cloudflare Tunnel (outbound-only, Zero Trust), Tailscale Serve (private mesh), or Caddy reverse proxy with password auth. Each has documented trade-offs. The agent asks which one you prefer and configures it accordingly.

How the skill is structured

The main SKILL.md file is around 160 lines and covers the decision flow and ordering constraints. Detailed procedures live in reference files that the agent pulls in as needed:

Reference	What it covers
Security hardening	SSH config, UFW rules, Fail2Ban jails, Docker iptables bypass
Local models	Ollama installation, model selection by RAM, GPU passthrough
Messaging channels	Per-channel setup (Signal ARM64 build, WhatsApp QR pairing, Telegram BotFather)
Multi-agent routing	Contact-to-agent bindings, DM access policies, agent personalities
Remote access	Cloudflare Tunnel vs Tailscale vs Caddy — trade-offs and setup
Social media	Third-party schedulers, risk considerations
Backups	Local snapshots, GitHub push with deploy key, cloud provider images

This progressive disclosure keeps context usage reasonable. The full Signal ARM64 build instructions only get loaded if you're actually setting up Signal on ARM64.

Limitations

The skill isn't tied to a specific setup. It asks what you're working with — VPS or local machine, which OS, which architecture, which provider — and adapts accordingly. It can fetch current VPS pricing (Hetzner CAX series, DigitalOcean, etc.) so you can compare options before committing. If you want to run OpenClaw on a spare laptop instead of a cloud server, it adjusts the flow — no VPS provisioning, different networking, local Ollama instead of cloud API keys.

That said, my own setup (Hetzner ARM64, Ubuntu 24.04) is the only tested path so far. Other combinations should work but may have gaps in the gotcha coverage.

The skill is also a snapshot as of early 2026. OpenClaw is actively developed. Signal might ship ARM64 binaries eventually. Docker might fix the UFW bypass. Treat specific version numbers with appropriate skepticism.

And while the skill handles the setup, it doesn't replace understanding what's running on your server. If something breaks outside the skill's playbook, you'll need basic comfort with SSH, firewalls, and systemd to debug it.

Getting it

npx skills add https://github.com/souliane/skills --skill ac-openclaw -g -y

It works with any AI agent that can read files and run commands.

OpenClaw | MIT License

Skill-Driven Development: Transferring your Craft to AI Agents

Adrien Cossa — Fri, 13 Mar 2026 16:40:28 +0000

Every project has its own way of doing things — migration patterns, transaction handling, deployment quirks, that one PDF template workflow nobody wants to touch. I've been writing this stuff down as markdown files with helper scripts so my AI agent can follow along. I've taken to calling this skill-driven development (building on agent skills, a convention that started with Anthropic's Claude Code and has since been adopted by several other AI coding agents).

This post walks through some skills I've put together and shows the feedback loop that makes them worth maintaining.

What skills add over a README

A project README or an AGENTS.md can capture conventions. The skills I've been writing try to go a bit further:

Progressive disclosure — a slim main file loads first, with detailed references pulled in only when needed. This keeps the agent's context window focused.
Composability — skills declare dependencies and load together. A Django skill + a project overlay skill + a TDD skill compose into a complete workflow.
Scripts — skills can ship executable scripts alongside the instructions. A script the agent calls is more reliable than a 15-step procedure the agent interprets.
Self-improvement — when something goes wrong, the fix goes into the skill. Next session, the agent follows the updated instructions.

That last point is what makes it worth maintaining: every correction you make goes into the skill, and the agent doesn't make the same mistake twice.

The SDD loop

There's a useful parallel with Test-Driven Development:

	TDD	SDD
Write	Write the test first	Write the skill first
Run	Run the code	Let the agent produce the code
Evaluate	Test fails → fix the code	Output is wrong → fix the skill
Loop	Until green	Until the agent gets it right

In TDD, you iterate on code until the tests pass. Here, you iterate on skills until the agent's output is what you wanted. The skill isn't a one-shot handoff — you keep tweaking it as you find gaps.

In practice, the two aren't separate activities — skills encode TDD as part of the workflow. The implementation skill says "write a failing test first, then implement," and the agent does both. Nobody writes tests by hand and then separately invokes a skill; the skill is what makes the agent follow TDD.

When tests fail, it's worth asking whether the code is wrong or the skill is incomplete. Fix the skill, re-run. Over time it adds up — each fix prevents one class of mistake, and after enough sessions you've built up a decent set of guardrails just from things that went wrong.

If you use teatree (a set of lifecycle skills I put together for multi-repo development), this loop can be automated: t3-retro runs a retrospective after each session and writes fixes into the skill files. But it works just as well manually — whenever you correct the agent, you can put that correction in a skill so it sticks.

What a skill looks like

A skill is a markdown file (SKILL.md) with YAML frontmatter, often accompanied by scripts for the mechanical parts. Here's a simplified example of the markdown side:

---
name: ac-django
description: Django coding conventions and best practices.
metadata:
  version: 0.0.1
---

# Django Conventions

## Models

### Fat Models Doctrine
- Business logic belongs in models, not views or serializers.
- Use model managers for complex queries.

### Migrations
- Always use `apps.get_model()` in data migrations — never import directly.
- Set `elidable=True` on data-only migrations.
- Include both `forwards` and `backwards` functions.

## Settings

### Storage Configuration (Non-Negotiable)
- Use `STORAGES` dict (Django 4.2+), not `DEFAULT_FILE_STORAGE`.
- The deprecated setting causes silent failures on deployment.

Rules marked (Non-Negotiable) are things I've learned the hard way. "Always verify services respond via HTTP before declaring running" sounds obvious, but without it, the agent will say "servers started" without checking whether anything actually came up.

These work with any agent that can read files — Claude Code, Codex, Cursor, whatever. The agent reads the skill and follows the instructions.

What's in the repo

I've put the generic ones in a public repository in case any of them are useful to others. Here are the ones I'd recommend looking at first.

`ac-reviewing-skills` — keep your skills in shape

This is probably the most broadly useful one. It does a deep audit of your skill files — architecture, content quality, script correctness, stale cross-references, duplicated guidance. I run it periodically and it consistently finds things I missed: rules that drifted between files, references pointing at renamed sections, scripts with missing error handling. If you maintain more than a handful of skills, it's worth running periodically.

`ac-django` — Django conventions that models already "know" but get wrong

The agent knows Django. It doesn't know how you use Django. This skill covers the mistakes I kept correcting: outdated migration patterns (apps.get_model() vs direct imports), unsafe transaction handling, the STORAGES dict vs deprecated DEFAULT_FILE_STORAGE, post_migrate signal timing for permission assignments. It's a reference, not a tutorial — it assumes the agent already understands the framework and just needs guardrails for the non-obvious parts.

ac-python is its companion for generic Python: style, typing, OOP patterns, testing conventions. Less opinionated, but useful as a baseline.

`ac-adopting-ruff` — structured linter migration

A step-by-step playbook for replacing black + isort + flake8 with ruff, one rule category per MR. It handles the things I got stuck on — conflicting formatter settings, rule equivalences between linters, the unfixable vs ignore distinction. Doing it in one big MR is painful; the skill breaks it into reviewable increments.

`ac-openclaw` — self-hosted AI assistant setup

An interactive guide to install OpenClaw on a VPS or local machine. Covers server provisioning, OS hardening, model configuration (BYOK or local Ollama), messaging channel integration (Signal, WhatsApp, Telegram, etc.), and secure remote access (Cloudflare Tunnel, Tailscale, or Caddy). It walks through every decision point — useful if you want a self-hosted personal AI assistant without piecing together a dozen tutorials.

Everything else

Skill	What it covers
`ac-python`	Generic Python: style, typing, OOP design, testing, tooling
`ac-editing-acroforms`	AcroForm PDF templates: widget geometry, content streams, font subsetting
`ac-auditing-repos`	Cross-repo infrastructure audit: harmonize pre-commit, linter, and editor configs
`ac-writing-blog-posts`	Article writing + social media promotion + dev.to publishing pipeline
`ac-generating-slides`	Markdown to presentation slides via Marp
`ac-scaffolding-skill-repos`	Scaffold new skill repos with correct config and structure

ac-editing-acroforms deserves a mention — it came out of editing PDF form templates by hand. The internals (annotation dictionaries, appearance stream generation, widget flags) are barely documented. The agent can't figure this out from training data alone, so the skill ships with Python scripts that handle the tricky bits.

This blog post was written with ac-writing-blog-posts, for what it's worth.

How to use them

Install with npx skills:

npx skills add https://github.com/souliane/skills --skill '*' -g -y

This installs all skills globally for your default agent. To install for multiple agents at once:

npx skills add https://github.com/souliane/skills --skill '*' -g -y \
  --agent claude-code codex cursor github-copilot

If you want the SDD feedback loop — where retrospective fixes land in files you can commit — clone the repo and symlink it into your agent's skills directory:

git clone git@github.com:souliane/skills.git ~/workspace/souliane/skills

# Example for Claude Code — adjust the target for your agent runtime
for skill in ~/workspace/souliane/skills/ac-*/; do
  ln -s "$skill" ~/.claude/skills/"$(basename "$skill")"
done

This points your agent at the live git checkout directly. When the agent (or you) updates a skill file, the change is immediately available in the next session and can be committed. Don't use npx skills add for this — it creates a managed copy that doesn't point back to your clone.

If you use teatree, its setup wizard can suggest these as companion skills for your project overlay — they're loaded automatically when you work in matching repos.

When it helps

Skills work best when:

You correct the agent for the same kind of mistake more than once
Your project has conventions that diverge from common patterns
You work across sessions and the agent keeps losing context
You use deterministic tools (PDF editors, linters, deployment scripts) where the agent needs exact steps
You want to share a recipe with others — a skill is a portable, self-contained package that anyone can install and use with their own agent

They're less useful for one-off tasks or when the model's defaults already match your preferences. But even something you only do once yourself might be worth writing as a skill if it's useful to someone else.

These skills reflect my own workflow — Django, Python, PDF templates, multi-repo infrastructure. They might not match yours at all. The most useful skills are probably ones you'd write yourself for your own project's conventions. These are just examples of what worked for me.

Update — June 2026

The tooling around skills has moved on since I wrote this, so a couple of additions.

There's now an attempt at a shared convention (agentskills.io) and lots of community skill collections worth borrowing from. A term I've picked up is "skillify" — capturing a workflow into a reusable skill draft after a session pulls it off, which is a nice bottom-up complement to writing the skill up front.

The bigger gap is the evaluate step in the loop above: I left it as a human judgment call, but the better answer is a runnable eval. Anthropic's skill-creator stores test cases (a prompt plus an expected_output) in an evals/ folder and grades a skill's output with and without the skill, so a regression shows up as a failing benchmark instead of a gut feeling. If I were drawing the SDD diagram today, that node would be a real eval. If you're picking this up now, lean on the current tooling rather than this post — it's a bit dated already.

GitHub | MIT License

Introducing Teatree: Parallel Multi-Repo Development with AI Agents

Adrien Cossa — Thu, 12 Mar 2026 13:16:50 +0000

I'm a Customer Success Engineer at Oper Credits. My daily work involves a multi-repo project — backend, frontend, translations, configuration — and I use AI coding agents constantly. The friction isn't writing code; agents handle that well. It's everything surrounding it: following different conventions across codebases, coordinating changes across services, managing local environments that diverge from what's in git, and encoding the workflow patterns we could all benefit from.

The agent can figure out most of these things, but it struggles with the specifics — it loops on troubleshooting, tries approaches that don't match the project's actual setup, and burns tokens on trial and error. I started putting together teatree to write down that knowledge so the agent doesn't have to rediscover it every session. It's also a way to define and automate your personal workflow without adding friction with your team — build it on your own, then push for adoption once it works.

This post walks through the architecture, the design choices I landed on, and how the pieces fit together. It's long because there's a lot of ground to cover. If you just want the quick pitch, the README has that.

What it looks like
The problem
Skills as markdown and scripts
The lifecycle graph
Multi-repo worktree management
The overlay and extension system
Auto-loading hooks
The retrospective loop
Companion skills
Getting started
When it helps (and when it doesn't)

What it looks like

Tell your AI agent what you want. Teatree skills guide it through the entire lifecycle:

https://gitlab.com/org/repo/-/issues/1234

The agent fetches the ticket, creates synchronized worktrees, provisions isolated databases and ports, implements the feature with TDD, writes a test plan, runs E2E tests, self-reviews, then pushes and creates the merge request.

Fix PROJ-5678

The agent fetches the failed test report from CI, reproduces locally, fixes, pushes, and monitors the pipeline until green.

Review https://gitlab.com/org/repo/-/merge_requests/456

The agent fetches the ticket for context, inspects every commit individually, and posts draft review comments inline on the correct file and line.

Run the test plan for !789

The agent generates a test plan from the MR changes, runs E2E tests, and posts evidence screenshots on the MR.

Follow up on my open tickets

The agent batch-processes your assigned tickets, checks CI statuses, nudges stale MRs, and starts work on anything that's ready.

The problem

AI coding agents can do a lot — reason about architecture, run tests, create merge requests. But without your project's specific context, they spend tokens and time rediscovering things you already know. Your repo layout, your CI conventions, your team's practices, your local tooling — none of that is in training data.

The friction is especially pronounced with:

Multi-repo setups — creating branches across 3+ repos for a single ticket, provisioning isolated databases, allocating non-conflicting ports
Atypical local environments — personal tooling that differs from what's in git, dev configurations the team hasn't adopted yet
Operational workflows — self-reviewing before pushing, creating properly formatted merge requests, monitoring pipelines, running retrospectives

The agent can attempt all of these. But without explicit guidance, it either asks twenty questions or confidently does the wrong thing — and when something fails, it loops instead of applying the fix you already know.

I tried shell scripts and aliases first, sometimes Python scripts too. They worked for the happy path but couldn't handle edge cases — the database import that fails because VPN is down, the port conflict because another worktree is still running, the CI format check that rejects your MR title. A shell script can't say "if the test fails, check if it's a known flake — here are the patterns." An AI agent can.

So I started writing this stuff down — as markdown instructions with tested Python and shell scripts for the mechanical parts. The markdown gives the agent enough context to handle edge cases; the scripts handle deterministic operations where you don't want the agent improvising.

Skills as markdown and scripts

A teatree skill starts with a markdown file (SKILL.md) with YAML frontmatter, but the heavy lifting often happens in scripts that ship alongside it. Teatree currently has 15 Python executables, 9 library modules, and 3 shell scripts — backed by 26 test files. Here's a simplified example of the markdown side:

---
name: t3-code
description: Writing code with TDD methodology.
requires:
  - t3-workspace
metadata:
  version: 0.0.1
---

# Writing Code (TDD)

## Dependencies

- **t3-workspace** (required) — provides dev servers for live reload.

## Workflow

### 1. Plan First (Non-Negotiable)

Always make a plan before writing code. Never jump straight to coding.
- Identify scope: which files, modules, and repos are affected.
- Review existing patterns in the codebase before writing new code.

### 2. TDD Cycle

Write failing test → Implement → Green → Refactor

### 3. Follow Conventions

- Language/framework conventions from the project's convention skills.
- Repository-specific patterns take precedence over generic guidance.

A few things to note:

Skills contain both instructions and scripts. The markdown tells the agent when and why to do things. The Python scripts handle deterministic operations: worktree creation, port allocation, database provisioning, branch finalization. A script the agent calls is more robust than a 15-step procedure in a markdown file. Instructions for judgment calls, scripts for mechanical work.

Skills declare dependencies. The requires: field in the frontmatter tells the loading system which other skills need to be present. When t3-code is loaded, t3-workspace comes along automatically. This eliminates wasted round-trips where the agent reads a skill, sees "Load /t3-workspace now", and then has to make a second call.

Skills use progressive disclosure. Most SKILL.md files are 80–160 lines, with detailed procedures in references/ files that the agent reads on demand. This keeps the typical skill set well within a reasonable context budget.

Skills have rules marked (Non-Negotiable). These are things I've had to learn the hard way. "Always verify services respond via HTTP before declaring running" sounds obvious, but without it, the agent will say "servers started" without checking whether anything actually came up.

The lifecycle graph

Teatree organizes development into phases, each handled by a dedicated skill:

The flow is: ticket → code → test → review → ship → retro, with t3-workspace providing infrastructure to all phases and t3-debug available whenever something breaks.

Here's what each skill does:

Skill	Phase	What it handles
`t3-setup`	Bootstrapping	Interactive setup wizard, health checks, overlay scaffolding
`t3-workspace`	Infrastructure	multi-repo worktrees, port allocation, DB provisioning, env files, dev servers, cleanup
`t3-ticket`	Intake	Fetch the issue, extract acceptance criteria, detect affected repos, detect tenant/variant, create worktrees
`t3-code`	Implementation	Plan-first workflow, TDD cycle, convention enforcement, feature flag checks
`t3-test`	Verification	Test execution, CI interaction, E2E test plans, quality gates
`t3-debug`	Troubleshooting	Systematic 5-phase debugging protocol, user-hint-first investigation
`t3-review`	Code review	Self-review checklist, giving review, receiving feedback
`t3-ship`	Delivery	Commit formatting, branch finalization, MR creation, pipeline monitoring
`t3-review-request`	Notifications	Post MR links to review channels, check for duplicate requests
`t3-retro`	Improvement	Conversation audit, root cause analysis, skill updates, privacy scans
`t3-contribute`	Contribution	Push skill improvements to fork, open upstream issues
`t3-followup`	Batch ops	Process assigned tickets, check CI statuses, nudge stale MRs

The skills mirror how development actually works. Implementing a ticket touches intake, coding, testing, review, and delivery — often across multiple repos. Making the skills fully independent would mean duplicating knowledge across every one of them, which always diverges over time.

The follow-up dashboard

One skill worth highlighting is t3-followup. It runs your daily routine: batch-processing new tickets, checking CI statuses, advancing tickets through their lifecycle, and nudging reviewers about stale MRs.

As it works, it builds a persistent cache (followup.json) of all in-flight work — tickets, merge requests, pipeline statuses, review request states, and review comment tracking. From that cache, it generates an HTML dashboard:

The dashboard gives you a single view of everything that's in flight: ticket lifecycle status, pipeline results (color-coded pills), review request state, and tracked review comments. Everything is a clickable link — tickets, MRs, CI pipelines, Slack messages — so you can jump directly into any conversation.

The cache is a plain JSON file, so project overlays can inject extra fields (external tracker status, deployment state, tenant info) via the followup_enrich_data extension point. Stale tickets are purged automatically after their MRs have been merged for 14 days (configurable via T3_FOLLOWUP_PURGE_DAYS).

Multi-repo worktree management

This is where I started, and it's the feature I use most.

Suppose your project has three repos: acme-backend, acme-frontend, and acme-translations. You're about to work on ticket PROJ-1234. Running t3_ticket PROJ-1234 creates this structure:

Each ticket gets its own directory containing one git worktree per affected repo — lightweight checkouts that share the .git directory with the main clone but have their own branch and working tree. A shared .env.worktree file provides allocated ports, database name, and variant configuration.

After creating the worktrees, t3_setup provisions the environment:

Symlinks — .venv, node_modules, .python-version, and configurable shared directories are symlinked from the main repo (so you don't reinstall dependencies for every worktree)
Environment files — .env.worktree with unique ports, database URL, variant-specific overrides
Database — creates an isolated DB, imports from a snapshot or dump, runs migrations
direnv — auto-loads environment variables when you cd into the worktree
Frontend dependencies — installs if the lockfile changed

Then t3_start brings everything up: Docker services, migrations, backend server, frontend dev server. Each worktree is fully isolated — its own database, its own ports, its own services. You can have ticket 1234 and ticket 5678 running simultaneously without conflicts.

Why this matters

Without isolation, the most common failure is contamination between tickets. You're working on ticket A, make a database change, then switch to ticket B which expected the old schema — migrations fail, the frontend shows stale data, and you spend time figuring out what went wrong. Worktree isolation avoids this. Each ticket is a clean room.

The other benefit is parallelism. While waiting for CI on ticket A, start working on ticket B in a completely separate environment. No branch switching, no stashing, no "wait, which database am I pointing at?"

Multi-tenant awareness

If your project serves multiple tenants — each with their own configuration, feature flags, and sometimes database — teatree handles that too. The variant system (wt_detect_variant) auto-detects the target tenant from ticket labels, descriptions, or external trackers, then provisions tenant-specific databases, environment variables, and configuration. Feature flag checks during code review ensure changes are properly scoped per tenant.

The project overlay wires in your tenant-to-variant mapping; teatree handles the rest. This means "set up a worktree for ticket X" automatically produces an environment configured for the correct tenant — no manual env file editing, no guesswork about which tenant you're in.

Why `t3_ticket` instead of raw git commands

The convention is <ticket>/<repo>/ — a ticket directory containing worktrees. Raw git worktree add creates flat worktrees at whatever path you give it, which breaks the ticket-directory structure that every other tool expects. t3_ticket enforces the convention, handles branch naming (with your prefix), and creates worktrees across all affected repos in one call. The skill file marks this as (Non-Negotiable) because flat worktrees cause subtle breakage downstream.

The overlay and extension system

Teatree knows how to create worktrees, allocate ports, and orchestrate a development lifecycle. It doesn't know how to start your backend, import your database, or create your merge requests. That project-specific knowledge lives in a project overlay.

The three-layer architecture

When teatree needs to do something project-specific (start the backend, import a database, create an MR), it calls an extension point through a registry. The registry resolves the implementation using a 3-layer priority:

Priority	Layer	Source	Example
Highest	Project	Your overlay's `project_hooks.py`	`t3_start` that runs Docker + Django + Angular
Middle	Framework	Framework integration (e.g., Django)	`wt_post_db` that runs `manage.py migrate`
Lowest	Default	Teatree core fallback	Usually a no-op or "not configured" message

The registry itself is simple — 45 lines of Python:

_LAYERS = ("default", "framework", "project")
_LAYER_RANK = {layer: i for i, layer in enumerate(_LAYERS)}
_registry: dict[str, list[tuple[str, Callable]]] = {}

def register(point: str, fn: Callable, layer: str = "default") -> None:
    entries = _registry.setdefault(point, [])
    entries[:] = [(lyr, func) for lyr, func in entries if lyr != layer]
    entries.append((layer, fn))
    entries.sort(key=lambda x: _LAYER_RANK[x[0]])

def get(point: str) -> Callable | None:
    entries = _registry.get(point)
    if not entries:
        return None
    return entries[-1][1]  # highest priority = last entry

def call(point: str, *args, **kwargs):
    fn = get(point)
    if fn is None:
        raise KeyError(f"No handler registered for extension point {point!r}")
    return fn(*args, **kwargs)

Registering a handler at the "project" layer automatically overrides anything at "framework" or "default". The framework layer is there so teatree can ship framework integrations (Django is the first) that work out of the box but can still be overridden by project-specific needs.

What an overlay looks like

A project overlay is a directory with this structure:

acme-overlay/
├── SKILL.md                    # Skill description + loading order
├── scripts/
│   └── lib/
│       ├── bootstrap.sh        # Shell wrappers (sourced after teatree)
│       ├── shell_helpers.sh    # Env loading, variant detection
│       └── project_hooks.py    # Extension point overrides
├── hook-config/
│   ├── context-match.yml       # Patterns that trigger this overlay
│   └── reference-injections.yml # References to load per lifecycle phase
└── references/
    ├── prerequisites-and-setup.md
    ├── troubleshooting.md
    └── playbooks/
        └── README.md

The project_hooks.py file registers your overrides:

from lib.registry import register

def register_acme():
    def wt_env_extra(envfile):
        with open(envfile, "a") as f:
            f.write("ACME_API_KEY=dev-key\n")

    def wt_db_import(db_name, variant, main_repo):
        # Import from your team's shared dump
        from lib.db import db_restore
        db_restore(db_name, f"{main_repo}/dumps/{variant}_latest.sql")
        return True

    def wt_run_backend(*args):
        import subprocess
        subprocess.run(["python", "manage.py", "runserver", "0.0.0.0:8000"],
                      check=False)

    register("wt_env_extra", wt_env_extra, "project")
    register("wt_db_import", wt_db_import, "project")
    register("wt_run_backend", wt_run_backend, "project")

The teatree core scripts call registry.call("wt_run_backend"), and your project handler runs instead of the default "not configured" stub. You only override what you need — everything else falls through to the framework or default layer.

There are 25 extension points

They cover the full lifecycle:

Category	Extension Points
Workspace setup	`wt_symlinks`, `wt_env_extra`, `wt_services`, `wt_detect_variant`
Database	`wt_db_import`, `wt_post_db`, `wt_restore_ci_db`, `wt_reset_passwords`
Dev servers	`wt_run_backend`, `wt_run_frontend`, `wt_build_frontend`, `wt_start_session`
Testing	`wt_run_tests`, `wt_trigger_e2e`, `wt_quality_check`
Delivery	`wt_create_mr`, `wt_monitor_pipeline`, `wt_send_review_request`, `wt_fetch_failed_tests`, `wt_fetch_ci_errors`
Ticket management	`ticket_check_deployed`, `ticket_update_external_tracker`, `ticket_get_mrs`
Follow-up	`followup_enrich_data`, `followup_enrich_dashboard`

The /t3-setup wizard can scaffold an overlay for you. Tell it your repos, your backend framework, and your database, and it generates the skeleton with commented-out examples for each relevant extension point. From there, fill in the blanks — or ask your AI agent to fill them in if it already knows your codebase (e.g., after working in the repos for a while).

The sourcing chain

Shell functions are loaded in order:

# In .zshrc:
source ~/.teatree                                     # load config
source "$T3_REPO/scripts/lib/bootstrap.sh"            # teatree core functions
source "$T3_OVERLAY/scripts/lib/bootstrap.sh"         # project overlay overrides

The overlay's bootstrap has a guard — it checks that teatree was sourced first (_T3_SCRIPTS_DIR must be set). This prevents confusing errors from running the overlay standalone.

Inside Python scripts, the pattern is similar:

import lib.init
lib.init.init()                 # registers defaults + auto-detects framework
from lib.project_hooks import register_project
register_project()              # registers project overrides at 'project' layer
from lib.registry import call as ext
ext("wt_post_db", project_dir)  # calls highest-priority handler

Auto-loading hooks

Skills don't help if the agent doesn't load them. I got tired of manually telling it which skill to read, so I added a hook that suggests the right skills automatically based on what you're doing.

The mechanism is ensure-skills-loaded.sh, a hook that runs before every message (in Claude Code, this is a UserPromptSubmit hook; other agent platforms would use their own equivalent). It does three things:

1. Project context detection

The hook scans all skill directories for hook-config/context-match.yml files. If any pattern in the file matches the current working directory or the active-repo tracker, that skill is identified as the project overlay. This is how teatree knows you're working in a specific project without you having to say so.

# hook-config/context-match.yml
cwd_patterns:
  - "acme-backend"
  - "acme-frontend"

If your $PWD contains acme-backend, the hook knows you're in the acme project and will suggest loading the ac-acme overlay alongside whatever lifecycle skill you need.

2. Intent detection

The hook parses the prompt to figure out which lifecycle phase you're in. It checks for:

URL patterns — a GitLab issue URL triggers t3-ticket, a Sentry URL triggers t3-debug
Keyword patterns — "implement" triggers t3-code, "push" triggers t3-ship, "broken" triggers t3-debug
End-of-session phrases — "done", "all set", "that's it" triggers t3-retro (only if at least one other skill was loaded this session)
Bare imperative verbs — "Fix the login page" triggers t3-code

If nothing matches and you're in project context, it defaults to t3-code — because most prompts in a project directory are about coding.

3. Dependency resolution and suggestion

Once the hook knows which skill you need, it:

Parses the skill's requires: frontmatter to find dependencies
Checks which skills are already loaded (tracked in a session file)
Builds a suggestion list of skills that need loading
Adds companion skills (e.g., ac-django for backend work in a Django project)
Adds reference file injections from reference-injections.yml

The output looks like:

LOAD THESE SKILLS NOW: /t3-workspace, /t3-code, /ac-acme.
ACME references to read: references/prerequisites-and-setup.md

The agent sees this as a system message and loads the skills before doing anything else. The wording is intentionally forceful ("LOAD THESE SKILLS NOW") — softer phrasing ("Consider loading...") gets ignored by models.

Symlink health checks

The hook also runs a once-per-session health check on skills that you maintain (determined by an ownership config):

Verifies skill symlinks are actual symlinks (not stale copies)
Checks that the source is a real git repository (not a downloaded zip)
Validates that symlinks point into git repos (so retrospective commits work)

If anything is broken, it either auto-fixes (re-running the installer) or warns with a specific remediation.

The retrospective loop

After every non-trivial session, t3-retro runs a retrospective — a systematic audit of the conversation that produces concrete skill improvements and optionally contributes them upstream.

What the audit catches

The retrospective categorizes issues into specific types:

Category	What went wrong	Example
False completion	Claimed "done" without full verification	Said feature was complete but didn't run the test suite
Skill not loaded	A relevant skill existed but wasn't loaded	Worked in project context without the overlay
Playbook miss	A playbook covered the task but wasn't consulted	Didn't check the deployment playbook before pushing
Over-engineering	Did unnecessary work	Built a migration when admin config would have sufficed
Under-engineering	Missed required work	Updated the backend but forgot the frontend changes
Hook gap	Auto-loading should have triggered but didn't	Hook didn't detect intent from "fix the flaky test"
Stale guidance	Followed outdated instructions	Playbook referenced pre-refactoring patterns

For each issue, the retrospective determines the root cause and writes the fix directly into the skill system — a new guardrail, an updated playbook, a troubleshooting entry, a hook pattern.

Where improvements go

The retrospective respects a clear hierarchy:

Project overlay ($T3_OVERLAY) — receives project-specific improvements (troubleshooting, playbooks, guardrails). This is the default target when T3_CONTRIBUTE is false.
Core skills ($T3_REPO) — only modified when T3_CONTRIBUTE=true, and only for generic improvements (missing verification steps, hook gaps, stale core guidance)
Personal config (memory files, agent config like AGENTS.md) — for user preferences and environment-specific facts. Also serves as a fallback location when the overlay isn't maintained by the user.

The contribution model

When you enable T3_CONTRIBUTE=true:

The retrospective creates a local commit on the current branch in your fork. It never pushes automatically.
A privacy scan checks for emails, home directory paths, API keys, internal hostnames, and any terms in $T3_BANNED_TERMS.
When you're ready, /t3-contribute reviews what will be pushed, checks for fork divergence, and optionally opens an issue on the upstream repo.

The idea is that every user's failures make the system better for all users — but only through an explicit, reviewed contribution path. Nothing happens without your consent. The default is T3_CONTRIBUTE=false, which means the retrospective only improves your project overlay and personal config.

A concrete example

Suppose during a session, the agent set up a multi-repo worktree and claimed it was ready, but the backend server failed to start due to port conflicts with a previous worktree. The agent didn't verify that the infrastructure was actually running before declaring complete.

The retrospective would:

Audit: Identify this as "false completion" — claimed infrastructure ready without verification evidence
Root cause: The t3-workspace script runs through all setup steps but has no way for projects to define and verify health checks before the agent declares the worktree usable
Fix (core): Add a new extension point wt_health_check to t3-workspace that projects can implement
Fix (overlay): Implement wt_health_check in the project's project_hooks.py to curl the backend, check the frontend dev server, verify the database is accessible
Verify: Check that the skill file parses, the extension point is registered correctly, and the overlay hook runs without errors
Commit: If T3_CONTRIBUTE=true, commit the core extension point to the fork's teatree core skills; overlay changes go to the project overlay repo

Next time the agent sets up a worktree, t3-workspace runs the project's health checks before finishing — the core provides the mechanism, the project overlay provides the specifics. Both are enforced going forward.

It adds up

A single retrospective might fix one guardrail. After enough sessions, you've accumulated a lot of them — each one from a specific failure that actually happened.

Companion skills

Teatree handles the lifecycle — ticket intake, worktree management, TDD, review, delivery. It doesn't know about your programming language's conventions or your framework's best practices. That's what companion skills are for.

Companion skills are standalone skills that live in separate repos and are loaded alongside teatree when relevant. I maintain a few (souliane/skills) covering Django and Python conventions, but the best companion skill for your stack is one you find (or build) yourself. I wrote a separate post about skill-driven development and the skills I'm open-sourcing.

The project overlay's hook-config/context-match.yml wires companion skills to repo patterns:

companion_skills:
  ac-django:
    - "acme-backend"
  ac-python:
    - "acme-backend"

When the hook detects you're working in acme-backend, it suggests loading ac-django and ac-python alongside the lifecycle skill. You get framework conventions without cluttering the core lifecycle skills with language-specific details.

This separation matters. Django conventions change on a different cadence than worktree management. Keeping them in separate skills means you can update one without touching the other, and teams using Flask or Express aren't burdened with Django-specific guidance.

Companion skills vs framework layer

These are different things. The framework layer is teatree's built-in middle priority in the 3-layer extension point registry — it ships stock implementations for common frameworks (e.g., a Django integration that auto-registers manage.py migrate as the post-DB hook). Companion skills are external standalone skills that teach the agent coding conventions — they don't register extension points, they provide guidelines. The framework layer handles infrastructure (how to run migrations); companion skills handle conventions (how to write good Django code).

Getting started

Prerequisites

An AI coding agent (the auto-loading hooks currently target Claude Code, but the skills and scripts work with any agent that can read files and run commands)
Python 3.12+
uv (Python package manager)

Installation

Teatree requires a local git clone — it has shared infrastructure (scripts/, references/, integrations/) that lives outside the individual skill directories, so npx skills add alone isn't enough.

Fork the repo on GitHub (or just clone it directly if you don't plan to contribute back), then:

git clone git@github.com:YOUR_USERNAME/teatree.git ~/workspace/teatree
cd ~/workspace/teatree
./scripts/install_skills.sh

The install script creates symlinks from your agent's skills directory to the clone. Then open your agent and run /t3-setup — it handles config, shell integration, hooks, and optionally scaffolds a project overlay for your repos.

If you want the retrospective loop to write improvements back into skill files, set T3_CONTRIBUTE=true in ~/.teatree (created by /t3-setup). This requires a fork — the agent pushes to your fork, not to the upstream repo.

The setup wizard:

Checks prerequisites — verifies all required tools are installed, reports a summary table
Creates ~/.teatree — asks for workspace path, branch prefix, issue tracker, chat platform
Scaffolds a project overlay (optional) — ask it about your repos, framework, and database, and it generates the skeleton
Configures shell integration — adds sourcing lines to .zshrc or .bashrc
Installs skill symlinks — creates the symlink chain from the agent's skills directory to your clone
Configures hooks — sets up ensure-skills-loaded.sh and the statusline (Claude Code-specific; other agents would configure their own hooks)
Runs a smoke test — verifies hooks parse, statusline runs, Python imports work

After setup, restart your agent (or start a new conversation). Try: "start working on ticket PROJ-1234" — the hook should suggest /t3-ticket + /t3-workspace, and the agent will take it from there.

You can re-run /t3-setup at any time as a health check. It validates the existing installation, checks for broken symlinks, verifies hook wording, and reports what needs fixing.

The directory structure after setup

~/
├── .teatree                    # Config file (sourced by shell)
├── .local/share/teatree/       # Runtime data (ticket cache, dashboard, MR reminders, cache)
├── .claude/                    # Claude Code example (adapt paths for your agent)
│   ├── CLAUDE.md               # Agent instructions (skill-loading block)
│   ├── settings.json           # Hooks, statusline
│   └── skills/
│       ├── t3-ticket -> ~/workspace/teatree/t3-ticket
│       ├── t3-code -> ~/workspace/teatree/t3-code
│       ├── ...
│       └── ac-acme -> ~/workspace/acme-overlay
└── workspace/
    ├── teatree/                # Teatree clone (or fork)
    ├── acme-overlay/           # Project overlay
    ├── acme-backend/           # Main repo clone
    ├── acme-frontend/          # Main repo clone
    └── ac/                     # Ticket worktrees
        ├── 1234/
        │   ├── acme-backend/   # Worktree
        │   ├── acme-frontend/  # Worktree
        │   └── .env.worktree   # Shared env
        └── 5678/
            └── ...

The symlinks ensure that skill files always resolve to the live git clone. This is important for the retrospective — when the agent writes improvements to skill files, the changes land in a real git repository where they can be committed and pushed.

When it helps (and when it doesn't)

It helps most with: structured, repeatable processes that span multiple repos or require project-specific knowledge. Ticket intake, worktree setup, TDD cycles, code review, MR creation, CI debugging. The kind of work that eats hours but follows a pattern.

It helps less with: one-off creative decisions, highly ambiguous tasks, or projects simple enough that a single repo with npm start covers everything. If your development workflow is "edit a file and push," teatree is overkill.

The sweet spot is when you have enough friction that encoding it pays off through repetition. The project works for my workflow but hasn't been tested beyond that. If something doesn't click for your setup, open an issue or a PR. Or point your AI agent at the problem and let it fix things until it works for you — that's kind of the point.

A note on security

Teatree skills are prompt instructions — they control what your AI agent does. That makes the supply chain a security surface. The defaults are conservative: self-improvement is off (T3_CONTRIBUTE=false), pushing is disabled (T3_PUSH=false), and there is no auto-update mechanism. You opt in to each level of automation explicitly. If you use a fork from someone else, you're trusting that person's skill files as agent instructions — review changes before pulling.

Why "teatree"?

TEA's Extensible Architecture for work*tree* management. Also, teatree oil cuts through grime, which felt fitting.