Building a Coffee Roaster with a Team of Agents

#ai #agents #programming #softwareengineering

In this series I want to share what it was like to build a real piece of
software with a team of AI agents rather than by typing the code myself / sitting and prompting, and to
be honest about where that worked, where it did not, and what the job actually
became. The thing we built is a coffee roaster controller, which sounds like a
toy until you remember that a roaster is a hot drum with a heating element and a
fan, and that getting it wrong scorches a batch or worse. That is the whole
point: I wanted a build where being wrong was not free, because that is where
the interesting questions about delegating to agents live.

These posts are not tutorials. They are more of a field report: an overview of
how the pieces fit together, the decisions that mattered, and the receipts where
I have them. I will reference the previous post and signpost what comes next, so
you can follow the arc or dip into the one part you care about.

What we are building

The project is called RoastPilot. It drives a Hottop home coffee roaster
through a roast from charge (the moment the beans go in) to drop (the moment
they come out), watching the bean temperature, the rate of rise (RoR, how fast
the temperature is climbing), and listening for first crack (FC, the audible pop
that marks the start of the development phase). A roast is a curve, and a good
roast is a curve that gets the right shape at the right time.

The operator sees all of this live in a small web dashboard: the curve as it
draws, the current phase, the development time and ratio once first crack lands,
and the few controls that matter. It is deliberately a minimal interface. Heat
and fan are shown as read-outs, not dials, because on this machine those are the
controller's job, not a thing you want a human (or a model) nudging by hand
mid-roast.

The live dashboard. The panel labelled LLM Advisory is the model's latest
recommendation; the decision history below it shows what the safety policy did
with each one. (Design prototype with mock data; the shipped verdicts read
ALLOW / CLAMP / REJECT.)

When a roast finishes the same data becomes a record you can read back: the whole
curve, the milestones, the headline numbers, and the full trace of every
recommendation the model made and what happened to it.

A finished roast. The decision trace at the bottom is the part I care most
about: every consult, the recommendation, and the verdict, kept for review.

That is the surface. The reason the series exists is underneath it, and it turns
on a word I will lean on throughout: harness. There are really two harnesses in
this story, and it is worth separating them now. The runtime harness is the
controller, advisor and safety system that runs a roast and keeps a model's
advice inside hard limits. The build harness is the way a team of agents was
steered to build that software in the first place. This post introduces both, and
the series moves between them. The next two sections take each in turn, starting
with how the work was run, then the shape of what it produced.

The idea: a PM steering a team of agents

I did not sit and write this code. I worked as the product manager (PM) and
architect of a small team of Claude Code agents, and my job was to own the plan,
the decisions, and the judgment calls, while the agents did the building. The
human stays in the loop for the things that do not reduce to a lookup, which in
practice is a surprisingly specific list.

The work was dispatched in three shapes, and choosing between them deliberately
turned out to be most of the craft:

A single sub-agent for a self-contained story, one owner from start to pull request.
An agent team when several pieces could genuinely be built in parallel, one teammate per page or surface, after a shared foundation was in place.
A dynamic workflow when the work was a repeatable pipeline over many items (build, review, verify), and I wanted the control flow to be deterministic rather than left to a model's discretion.

The reasons to reach for each, and the times fanning out was exactly the wrong
move, are a whole post later in the series. For now the headline is the seat
itself: with agents doing the typing, the bottleneck is no longer code
generation. It is judgment, memory, and verification. A good chunk of these
posts is really about those three.

                          You
            architect / domain owner / decision-maker
                       │  ▲
                 steer │  │ consult on the calls only you can make
                       ▼  │
                ┌───────────────────┐
                │    The PM seat    │
                │ plan · decisions  │
                │    · judgment     │
                └─────────┬─────────┘
                          │ picks the right shape per piece of work
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
   │   Single    │ │    Agent    │ │   Dynamic   │
   │  sub-agent  │ │    team     │ │  workflow   │
   │  one story  │ │  parallel,  │ │ build/review│
   │             │ │ per surface │ │ /verify pipe│
   └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
          └───────────────┼───────────────┘
                          ▼
            ┌────────────────────────────┐
            │  Shared truth: the plan    │
            │   repo + the agent code    │
            │  (no agent depends on      │
            │   another's chat history)  │
            └────────────────────────────┘

The operating model: you steer the PM seat, the PM seat picks the right shape
for each piece of work, and everything writes to the same shared truth so no
agent depends on another's chat history.

The high-level architecture: a deterministic controller, the LLM as advisor

The system has one invariant that everything else hangs off, and it is worth
stating plainly because it is the opposite of where a lot of agentic AI writing
points: the controller owns the loop, and the large language model (LLM) only
advises. The deterministic controller runs the roast on a fixed tick, reads
the sensors, and decides what the machine does. The LLM is consulted for a
recommendation, it returns typed data, and that recommendation is never wired
straight to the hardware. Every command, whether it comes from the model or from
the operator, passes through a safety policy first, and the policy can allow it,
clamp it into range, or reject it outright.

So the model sits inside a hard box that it cannot reach past. That is not a lack
of ambition, it is the design: when being wrong means a ruined batch on a hot
drum, you want the unclever, predictable thing holding the levers and the clever
thing offering an opinion the box can veto. One of the sharper turns later in the
series is the roast that taught me to give the model less authority, not more,
and to move more of the control into the deterministic layer. I will not spoil it
here, but the shape above is the setup for that story.

The whole system on one page. The outer loop is the deterministic controller
that owns the machine; the inner loop is the LLM advice turn, and its output
comes back as a typed decision that the controller safety-gates before any
command reaches the roaster. The model is the only part that lives off the Pi.

Three rules hold that box shut, and they are what make this a harness rather than
just an app with an LLM in it. First, every write to the machine passes through
the safety policy, with no exception path, so neither the model nor the operator
can move heat or fan without the policy seeing it and allowing, clamping, or
rejecting it. Second, the advisor is never handed the tools to act: it only
ever returns typed advice, so even a confused or adversarial model has no route
of its own to the hardware. Third, a restart never auto-resumes heat or fan;
if the software comes back up in the middle of a roast it stops and asks the
operator rather than guessing, because resuming a hot drum on an assumption is
exactly the kind of confident wrong move the design exists to prevent.

A few honest boundaries while we are here, because I would rather set them than
have you assume past them. This is an in-progress build, not a finished product.
The LLM is advisory only and never controls the hardware directly. I am not
going to quote determinism percentages or call anything fully autonomous, and I
will not call it production-ready before it has been validated end to end on the
real machine. Where the work is unproven, I will say so.

What the series covers

Here is the rough shape of what is coming, so you know where we are going:

How the team is actually wired on Claude Code: the roles, and which tool holds the plan.
How the work gets organised once you have a team: the plan repo as shared memory, and why the right way to split work is by which files collide, not by discipline.
What the build's own tooling caught that humans and reviews missed, including the test that was quietly hiding a failure rather than just skipping one.
What it cost to run a build this way, measured honestly.
The deepest technical story: choosing the roast advisor by replaying real roasts as a test set, then taking it onto the hot machine and finding out what the offline eval could not see.
Why any of this generalises beyond a safety-critical hobby project, with a second case study on ordinary product work.

I did not type this system into existence, and I barely prompted it either. The agents ran in loops, building, reviewing and checking each other, while I steered at the points where the decision was genuinely mine. The interesting part was never whether they could write the code; it was knowing which of their answers to trust, and where I still had to stand. The roaster is just where that gets tested. The next post gets concrete: the roles, the Claude Code building blocks they map to, and the one that ends up holding the plan when no single agent remembers the last conversation.