syamaner

Posted on Jun 7

One More Cup, Four Agents: Getting Warp to Roast My Coffee

#machinelearning #python #mcp #warp

We know what Warp is good at by now. Warp can be summarised as an Agentic Development Environment: an AI agent that plans, writes code, runs commands, and uses tools, in the terminal where the work already happens. It ships features, fixes CI, wrangles infrastructure while allowing you to use any model, any harness, your own way. The question nobody was asking: can it roast coffee?

It can, if you hand it the right MCP server. This post is about that server, the agents that built it, and the afternoon Warp drove a real drum roaster through two roasts.

Sunday, kitchen counter. A Hottop drum roaster, a USB microphone on a small stand pointed at it, and a laptop running Warp, connected to the roaster through an MCP server. For the past eight minutes I had been directing the agent through preheat ("set heat 100% and fan 10%") while the bean probe climbed past 180 °C.

Then I poured in the green beans and said nothing about it. No "beans in". No charge command. Automatic charge detection itself is a solved problem. Artisan has inferred CHARGE from the bean-probe drop for years. But there was no roast logger running here, just an MCP server keeping its own authoritative timeline while an AI agent drove the controls. The next time I asked Warp for the roast state, this was sitting in it:

{
  "kind": "beans_added",
  "payload": {
    "source": "auto_t0",
    "charge_temperature_c": 186.0,
    "detected_bean_temperature_c": 156.0,
    "drop_c": 30.0,
    "drop_threshold_c": 25.0
  }
}

The MCP server had watched the probe drop 30 °C as cold beans hit the hot drum and recorded the charge on its own, about five seconds after the pour. Nine minutes later, while I was leaning over the roaster listening for the first faint pops, another state read came back with something roast loggers do not do:

{
  "kind": "first_crack_detected",
  "payload": {
    "source": "first_crack_detector",
    "confidence": 0.9074,
    "confidence_threshold": 0.6,
    "positive_window_count": 5,
    "min_positive_windows": 5,
    "window_sequence_number": 337,
    "confirmed_by_window_sequence_number": 343
  }
}

The Audio Spectrogram Transformer I trained in Part 3, running as a quantized INT8 ONNX model against a £25 USB microphone, had heard first crack, accumulated five positive detection windows over eighteen seconds, and written the event into the roast timeline. There were no manual overrides in the entire roast. I made the roast-profile decisions (heat, fan, when to drop); the runtime detected every milestone.

This was the second roast of the day. The first, run earlier that afternoon with a stricter detector profile, had also caught first crack by audio.

I want to be precise about who built this, because no single AI agent did. Four agents, with four different jobs, took the system from prototype to a verified production roast:

Agent	Job	Output
Warp (Oz)	Built the ML: dataset pipeline, training, tuning, ONNX export, Hugging Face publishing	coffee-first-crack-detection: 18 stories, 10 PRs, around 11k lines
Codex (GPT-5.5)	Built the production MCP server	coffee-roaster-mcp: 8 epics, 78 PRs, 394 tests
Warp (Gemini 3.5 Flash)	Operated the roaster live over MCP	Two supervised roasts, both first cracks audio-detected
Claude Code (Opus 4.8)	Supervised roast day through me: device discovery, pre-flight checks, the operator prompts Warp ran from, live log analysis, evidence collection	Validation reports, checksummed artifacts, and the final epic closed in coffee-roaster-mcp

A clarification on that table, because it is easy to misread: Warp appears twice, and the other two agents are not competitors I switched away to. Oz and Gemini 3.5 Flash are Warp's own agent running different models. Codex and Claude Code are independent harnesses, and I ran them inside Warp sessions as well. The whole project, building and roasting, happened in one environment, with the model and harness chosen per job. That is the "any model, any harness" part in practice.

Claude Code ran beside me in a second terminal for the whole roast day. It never touched the roaster; it prepared and verified everything around it, fed me the prompts I gave Warp, and afterwards turned the day's results into committed evidence and closed the open epic. I was the only human in the loop. During development I worked as the engineering lead. During the roast I worked as the human harness.

This post stands on its own, but it also concludes a longer series. It covers the part I had not written about yet: the MCP server rebuild, why one server replaced two, and what running a roast through an agent actually looks like.

The detector itself came out of a five-part build I documented separately, starting with Part 1: The Architecture. That work produced the model, but not a production way to use it. The thing that used it in my kitchen since last November was a prototype: two MCP servers, an Auth0 layer, SSE transport, and an n8n agent orchestrating them. It worked, and it still works. It also made its own design problems clear.

Why One MCP Server Replaced Two

The prototype split responsibility along familiar service boundaries: one MCP server for roaster control, another for first-crack detection, an orchestrator above both.

The problem is that a roast is one timeline, and I had split its truth across processes.

The charge event lived in the control server. The first-crack event lived in the detection server. Development time, the most important number in roasting (seconds since first crack), required the orchestrating agent to fetch state from both servers, reconcile clocks, and compute the result in prompt-space on every state read. Every state change paid a synchronization cost:

Two clocks. Detection timestamps and control timestamps came from different processes, so correlating "first crack at T+9:01" depended on cross-process time agreement.
Two failure domains. A detection-server restart mid-roast orphaned the timeline while the beans kept roasting.
The agent as state machine. The LLM was the only place where the roast existed as a whole, which is a poor home for authoritative state.
Infrastructure without a purpose. Auth0 and SSE exist to let networked processes talk safely. Everything here ran on one machine next to a roaster.

The rebuild inverted this. coffee-roaster-mcp is one local stdio MCP server that owns the entire roast session: driver control, telemetry sampling, automatic T0 detection, the first-crack audio runtime, derived metrics (rate of rise, development time, DTR), and log export. One process, one monotonic clock, one append-only event timeline. The agent on top makes decisions. It is never the database.

The difference shows in the live roast logs. When first crack confirmed, the same process that recorded the event moved the session phase to development, started the development-time clock, and shut down its own audio capture. Three side effects, no synchronization, one timeline row.

How Codex Delivered It

The ML repository was Warp/Oz's project. For the MCP server rebuild I ran the same spec-driven method from Part 1 with a different agent: Codex, on GPT-5.5.

The numbers: eight epics decomposed into 78 GitHub issues, delivered one PR per story. The repository stands at 78 merged PRs across 87 commits, 394 tests at roughly 90% coverage, and four releases shipped to PyPI and the MCP Registry. It also contains 65 session summaries in docs/session-summaries/, one per working session, each written by the agent that did the work.

The rules of engagement live in an AGENTS.md file at the repository root. Two of them did most of the work:

Keep roaster hardware control conservative. Heat, fan, drop, cooling, and emergency stop behavior require explicit tests or manual validation notes.

The default roaster driver is mock. Default first-crack mode is disabled.

The mock-first constraint shaped everything. All 78 PRs ran hardware-free in CI. Real hardware was only reachable through a CLI that requires stating intent explicitly:

coffee-roaster-mcp hottop-validate \
  --config coffee-roaster-mcp.yaml \
  --i-understand-this-controls-hardware \
  --include-drop --include-emergency-stop

The irreversible steps, bean drop and emergency stop, are opt-in flags, and the command writes a JSON evidence file scoring every step. Neither an agent nor a tired human can drift into spinning a 240 °C drum by accident.

The session summaries record more than what changed. They also record operating conditions. This is from the first hardware validation session:

Context window: 29% left (186K used / 258K) ... This was a high-context hardware validation story because it depended on the accumulated Epic 3 driver decisions from E3-S4 through E3-S8, prior Hottop review fixes, and live operator observations.

Sixty-five of these documents make the repository's history legible to any agent that picks it up later. Several did.

Roast Day: Warp at the Controls

The final epic story, E7-S6, required installing the published package the way an end user would, from the MCP Registry path, into a real MCP client, and running a full supervised roast on real hardware with the real microphone.

The client was Warp, running Gemini 3.5 Flash. The install used the published artifact:

{
  "RoastPilot": {
    "command": "uvx",
    "args": ["coffee-roaster-mcp==0.1.3", "serve"],
    "env": {
      "COFFEE_ROASTER_MCP_CONFIG": "/Users/.../roasts/coffee-roaster-mcp.yaml"
    }
  }
}

No dev checkout and no editable install. The same PyPI package anyone gets from the registry entry io.github.syamaner/coffee-roaster-mcp.

The rig, photographed during the first roast. The Warp session on screen is the transcript quoted below.

My role inverted completely. During development I directed agents as the engineering lead. During the roast I was the human harness, and I mean that precisely: the agent never acted on a timer or its own initiative. Every MCP call, including every state read, happened because I asked for it. I stood at the machine, pasted the operator prompt into Warp, and issued each instruction: "set heat to 60% and fan to 70%", "show me the roast state". Warp translated the instructions into tool calls, read the telemetry, and reported back in one-line status updates:

[08:55] 179.0°C / 229.0°C / +9.3°C/min / 100% / 30% / pending
FIRST CRACK DETECTED at 09:01 (9m 1s elapsed since T0)!
[09:06] 181.0°C / 230.0°C / +8.8°C/min / 100% / 30% / detected / 00:05 dev / 0.9% DTR

The same moment in the Warp client: each state read is a visible MCP tool call, and the detection announcement comes from a routine read, not a push.

The human-driven cadence is a design decision, not a limitation. For a first production validation on live heat I wanted no autonomous loop anywhere in the agent layer. The supervised human is the scheduler. The server is built for exactly this split: a background sampler logs telemetry every 5 seconds and feeds the detection runtime regardless of when anyone asks, so the roast record stays complete even while the harness is busy listening for cracks. The agent reads state when instructed. The runtime never stops watching.

We ran two roasts.

Roast 1 proved the live path: manual charge marking, default detector profile, first crack audio-detected at +09:01 with confidence 0.9066 against a 0.9 threshold. Drop at 198 °C, 15.0% DTR.

Roast 2 ran the production profile and is the run described in the opening: automatic T0 from the bean-temperature drop, and the sliding-window detector (10-second windows, 0.7 overlap, confidence threshold 0.6, five positive windows required) confirming first crack at +08:56. First positive at window 337, confirmation at window 343, about 18 seconds of accumulating evidence. No mark_beans_added, no mark_first_crack, no overrides. Across both roasts: more than 26,000 serial status packets, zero read errors, zero command-loop errors, zero faults.

On cost: roast 1 finished at 76.9 Warp credits, roast 2 at 82. Around 160 credits in total for two fully agent-operated, evidence-logged roasts, with individual turns (a state read, a heat change) costing 1.7 to 4.4 credits. Gemini 3.5 Flash never improvised a hardware command. The operator prompt forbade it, and the transcript shows it complied.

Three things went wrong, all in instructive ways:

"Preheat" heated nothing. My first operator prompt said "set heat and fan to the percentages I give you", and I said "preheat" with no percentages. The server, correctly, streamed safe-zero packets while the drum sat cold. This is not a bug. The server never implies heat. The prompt got default preheat values, and the safety property got documented.
The export raced the session end. Roast 2's summary.json snapshot was taken during cooling, so it froze at phase: cooling. The append-only roast.jsonl had the full timeline regardless, and the docs state the discrepancy rather than hiding it.
The registry metadata had a real bug, found only because we installed the real package: server.json lacked packageArguments, so a purely registry-driven launch would run uvx coffee-roaster-mcp without the required serve subcommand and exit at argument parsing. Found on roast day, fixed the same day.

The last one is the case for end-to-end validation in compressed form. The bug lived in the gap between "all tests pass" and "a stranger installs this".

What Production-Ready Means Here

It would have been easy to write "validated" in the README and move on. Instead, roast day produced a paper trail: the validation report, per-roast summaries with full timelines, Warp transcript excerpts with the raw MCP tool payloads, the guarded-validation JSON (8 of 8 steps including drop and emergency stop, run against the published package), and both sessions' complete roast logs. Every artifact is SHA-256 checksummed and committed.

This is the same evidence discipline the agents were held to during development, applied to the final claim. A repository that calls itself production-ready should be able to show the production.

Who Did What

The division of labour, end to end:

Warp/Oz built the model, the dataset, the training science, the edge deployment, and the Hugging Face publishing pipeline (Parts 1 to 5).
Codex built the production MCP server: the one-process architecture, the Hottop driver, the detection runtime, the metrics, the release machinery, and 65 session summaries documenting the work.
Warp with Gemini 3.5 Flash operated the roaster live, for about 160 credits.
Claude Code ran roast-day operations: found the serial port and microphone, pre-warmed the model cache, executed the guarded hardware validation, wrote the operator prompts Warp ran from, watched the logs in real time, compiled the evidence, and closed E7-S6 and Epic 7 in the repository.
I designed the architecture, wrote the specs, supervised the hardware, made the roast decisions, and drank the results.

The agents were interchangeable in one specific sense: every one of them worked from the same written specs, the same repository state documents, and the same rules files. Moving between them cost no momentum, because the context an agent needs was never trapped in a chat history. It lives in the repository, and whichever harness opens it next picks up mid-stream. Roast day showed the compressed version: two harnesses working side by side in Warp the same afternoon, one driving the roaster, the other compiling the evidence, both reading the same state files. That is the main lesson of this series. The workflow (spec first, state files as memory, evidence as the definition of done) survived contact with four agents running models from different vendors. The agents executed it.

What Comes Next

The roasts in this post were human-paced on purpose: I was the scheduler, and the agent acted only when asked. The next milestone inverts that, carefully. roastpilot-agent is a deterministic harness now in active development that drives the same coffee-roaster-mcp server with the roles pinned down in code rather than in prompts: a typed state machine owns a 1-second control loop (a cadence set by the thermocouples' response time, not by the LLM), a hard safety policy validates every command with typed verdicts (allow, clamp, reject, recovery, emergency stop), and the LLM is advisory only. It recommends heat and fan targets with a rationale; it never calls a tool. A restart never auto-resumes heat.

The build runs on the same method as everything above: a plan repository with decision records and cross-repo epics, mock-safe vertical slices before any hardware, and the MCP contract pinned by fixtures captured from the real server validated in this post. A cloud component for sharing roasts and collecting tasting feedback (roastpilot-cloud) follows after the harness, with one rule already fixed: the cloud never controls the roaster and is never required for an active roast.

The supervised manual roast you just read about is the baseline that work gets measured against.