DEV Community: syamaner

Building a Coffee Roaster with a Team of Agents

syamaner — Sun, 21 Jun 2026 10:49:48 +0000

In this series I want to share what it was like to build a real piece of
software with a team of AI agents rather than by typing the code myself / sitting and prompting, and to
be honest about where that worked, where it did not, and what the job actually
became. The thing we built is a coffee roaster controller, which sounds like a
toy until you remember that a roaster is a hot drum with a heating element and a
fan, and that getting it wrong scorches a batch or worse. That is the whole
point: I wanted a build where being wrong was not free, because that is where
the interesting questions about delegating to agents live.

These posts are not tutorials. They are more of a field report: an overview of
how the pieces fit together, the decisions that mattered, and the receipts where
I have them. I will reference the previous post and signpost what comes next, so
you can follow the arc or dip into the one part you care about.

What we are building

The project is called RoastPilot. It drives a Hottop home coffee roaster
through a roast from charge (the moment the beans go in) to drop (the moment
they come out), watching the bean temperature, the rate of rise (RoR, how fast
the temperature is climbing), and listening for first crack (FC, the audible pop
that marks the start of the development phase). A roast is a curve, and a good
roast is a curve that gets the right shape at the right time.

The operator sees all of this live in a small web dashboard: the curve as it
draws, the current phase, the development time and ratio once first crack lands,
and the few controls that matter. It is deliberately a minimal interface. Heat
and fan are shown as read-outs, not dials, because on this machine those are the
controller's job, not a thing you want a human (or a model) nudging by hand
mid-roast.

The live dashboard. The panel labelled LLM Advisory is the model's latest
recommendation; the decision history below it shows what the safety policy did
with each one. (Design prototype with mock data; the shipped verdicts read
ALLOW / CLAMP / REJECT.)

When a roast finishes the same data becomes a record you can read back: the whole
curve, the milestones, the headline numbers, and the full trace of every
recommendation the model made and what happened to it.

A finished roast. The decision trace at the bottom is the part I care most
about: every consult, the recommendation, and the verdict, kept for review.

That is the surface. The reason the series exists is underneath it, and it turns
on a word I will lean on throughout: harness. There are really two harnesses in
this story, and it is worth separating them now. The runtime harness is the
controller, advisor and safety system that runs a roast and keeps a model's
advice inside hard limits. The build harness is the way a team of agents was
steered to build that software in the first place. This post introduces both, and
the series moves between them. The next two sections take each in turn, starting
with how the work was run, then the shape of what it produced.

The idea: a PM steering a team of agents

I did not sit and write this code. I worked as the product manager (PM) and
architect of a small team of Claude Code agents, and my job was to own the plan,
the decisions, and the judgment calls, while the agents did the building. The
human stays in the loop for the things that do not reduce to a lookup, which in
practice is a surprisingly specific list.

The work was dispatched in three shapes, and choosing between them deliberately
turned out to be most of the craft:

A single sub-agent for a self-contained story, one owner from start to pull request.
An agent team when several pieces could genuinely be built in parallel, one teammate per page or surface, after a shared foundation was in place.
A dynamic workflow when the work was a repeatable pipeline over many items (build, review, verify), and I wanted the control flow to be deterministic rather than left to a model's discretion.

The reasons to reach for each, and the times fanning out was exactly the wrong
move, are a whole post later in the series. For now the headline is the seat
itself: with agents doing the typing, the bottleneck is no longer code
generation. It is judgment, memory, and verification. A good chunk of these
posts is really about those three.

                          You
            architect / domain owner / decision-maker
                       │  ▲
                 steer │  │ consult on the calls only you can make
                       ▼  │
                ┌───────────────────┐
                │    The PM seat    │
                │ plan · decisions  │
                │    · judgment     │
                └─────────┬─────────┘
                          │ picks the right shape per piece of work
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
   │   Single    │ │    Agent    │ │   Dynamic   │
   │  sub-agent  │ │    team     │ │  workflow   │
   │  one story  │ │  parallel,  │ │ build/review│
   │             │ │ per surface │ │ /verify pipe│
   └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
          └───────────────┼───────────────┘
                          ▼
            ┌────────────────────────────┐
            │  Shared truth: the plan    │
            │   repo + the agent code    │
            │  (no agent depends on      │
            │   another's chat history)  │
            └────────────────────────────┘

The operating model: you steer the PM seat, the PM seat picks the right shape
for each piece of work, and everything writes to the same shared truth so no
agent depends on another's chat history.

The high-level architecture: a deterministic controller, the LLM as advisor

The system has one invariant that everything else hangs off, and it is worth
stating plainly because it is the opposite of where a lot of agentic AI writing
points: the controller owns the loop, and the large language model (LLM) only
advises. The deterministic controller runs the roast on a fixed tick, reads
the sensors, and decides what the machine does. The LLM is consulted for a
recommendation, it returns typed data, and that recommendation is never wired
straight to the hardware. Every command, whether it comes from the model or from
the operator, passes through a safety policy first, and the policy can allow it,
clamp it into range, or reject it outright.

So the model sits inside a hard box that it cannot reach past. That is not a lack
of ambition, it is the design: when being wrong means a ruined batch on a hot
drum, you want the unclever, predictable thing holding the levers and the clever
thing offering an opinion the box can veto. One of the sharper turns later in the
series is the roast that taught me to give the model less authority, not more,
and to move more of the control into the deterministic layer. I will not spoil it
here, but the shape above is the setup for that story.

The whole system on one page. The outer loop is the deterministic controller
that owns the machine; the inner loop is the LLM advice turn, and its output
comes back as a typed decision that the controller safety-gates before any
command reaches the roaster. The model is the only part that lives off the Pi.

Three rules hold that box shut, and they are what make this a harness rather than
just an app with an LLM in it. First, every write to the machine passes through
the safety policy, with no exception path, so neither the model nor the operator
can move heat or fan without the policy seeing it and allowing, clamping, or
rejecting it. Second, the advisor is never handed the tools to act: it only
ever returns typed advice, so even a confused or adversarial model has no route
of its own to the hardware. Third, a restart never auto-resumes heat or fan;
if the software comes back up in the middle of a roast it stops and asks the
operator rather than guessing, because resuming a hot drum on an assumption is
exactly the kind of confident wrong move the design exists to prevent.

A few honest boundaries while we are here, because I would rather set them than
have you assume past them. This is an in-progress build, not a finished product.
The LLM is advisory only and never controls the hardware directly. I am not
going to quote determinism percentages or call anything fully autonomous, and I
will not call it production-ready before it has been validated end to end on the
real machine. Where the work is unproven, I will say so.

What the series covers

Here is the rough shape of what is coming, so you know where we are going:

How the team is actually wired on Claude Code: the roles, and which tool holds the plan.
How the work gets organised once you have a team: the plan repo as shared memory, and why the right way to split work is by which files collide, not by discipline.
What the build's own tooling caught that humans and reviews missed, including the test that was quietly hiding a failure rather than just skipping one.
What it cost to run a build this way, measured honestly.
The deepest technical story: choosing the roast advisor by replaying real roasts as a test set, then taking it onto the hot machine and finding out what the offline eval could not see.
Why any of this generalises beyond a safety-critical hobby project, with a second case study on ordinary product work.

I did not type this system into existence, and I barely prompted it either. The agents ran in loops, building, reviewing and checking each other, while I steered at the points where the decision was genuinely mine. The interesting part was never whether they could write the code; it was knowing which of their answers to trust, and where I still had to stand. The roaster is just where that gets tested. The next post gets concrete: the roles, the Claude Code building blocks they map to, and the one that ends up holding the plan when no single agent remembers the last conversation.

One More Cup, Four Agents: Getting Warp to Roast My Coffee

syamaner — Sun, 07 Jun 2026 16:15:42 +0000

We know what Warp is good at by now. Warp can be summarised as an Agentic Development Environment: an AI agent that plans, writes code, runs commands, and uses tools, in the terminal where the work already happens. It ships features, fixes CI, wrangles infrastructure while allowing you to use any model, any harness, your own way. The question nobody was asking: can it roast coffee?

It can, if you hand it the right MCP server. This post is about that server, the agents that built it, and the afternoon Warp drove a real drum roaster through two roasts.

Sunday, kitchen counter. A Hottop drum roaster, a USB microphone on a small stand pointed at it, and a laptop running Warp, connected to the roaster through an MCP server. For the past eight minutes I had been directing the agent through preheat ("set heat 100% and fan 10%") while the bean probe climbed past 180 °C.

Then I poured in the green beans and said nothing about it. No "beans in". No charge command. Automatic charge detection itself is a solved problem. Artisan has inferred CHARGE from the bean-probe drop for years. But there was no roast logger running here, just an MCP server keeping its own authoritative timeline while an AI agent drove the controls. The next time I asked Warp for the roast state, this was sitting in it:

{
  "kind": "beans_added",
  "payload": {
    "source": "auto_t0",
    "charge_temperature_c": 186.0,
    "detected_bean_temperature_c": 156.0,
    "drop_c": 30.0,
    "drop_threshold_c": 25.0
  }
}

The MCP server had watched the probe drop 30 °C as cold beans hit the hot drum and recorded the charge on its own, about five seconds after the pour. Nine minutes later, while I was leaning over the roaster listening for the first faint pops, another state read came back with something roast loggers do not do:

{
  "kind": "first_crack_detected",
  "payload": {
    "source": "first_crack_detector",
    "confidence": 0.9074,
    "confidence_threshold": 0.6,
    "positive_window_count": 5,
    "min_positive_windows": 5,
    "window_sequence_number": 337,
    "confirmed_by_window_sequence_number": 343
  }
}

The Audio Spectrogram Transformer I trained in Part 3, running as a quantized INT8 ONNX model against a £25 USB microphone, had heard first crack, accumulated five positive detection windows over eighteen seconds, and written the event into the roast timeline. There were no manual overrides in the entire roast. I made the roast-profile decisions (heat, fan, when to drop); the runtime detected every milestone.

This was the second roast of the day. The first, run earlier that afternoon with a stricter detector profile, had also caught first crack by audio.

I want to be precise about who built this, because no single AI agent did. Four agents, with four different jobs, took the system from prototype to a verified production roast:

Agent	Job	Output
Warp (Oz)	Built the ML: dataset pipeline, training, tuning, ONNX export, Hugging Face publishing	coffee-first-crack-detection: 18 stories, 10 PRs, around 11k lines
Codex (GPT-5.5)	Built the production MCP server	coffee-roaster-mcp: 8 epics, 78 PRs, 394 tests
Warp (Gemini 3.5 Flash)	Operated the roaster live over MCP	Two supervised roasts, both first cracks audio-detected
Claude Code (Opus 4.8)	Supervised roast day through me: device discovery, pre-flight checks, the operator prompts Warp ran from, live log analysis, evidence collection	Validation reports, checksummed artifacts, and the final epic closed in coffee-roaster-mcp

A clarification on that table, because it is easy to misread: Warp appears twice, and the other two agents are not competitors I switched away to. Oz and Gemini 3.5 Flash are Warp's own agent running different models. Codex and Claude Code are independent harnesses, and I ran them inside Warp sessions as well. The whole project, building and roasting, happened in one environment, with the model and harness chosen per job. That is the "any model, any harness" part in practice.

Claude Code ran beside me in a second terminal for the whole roast day. It never touched the roaster; it prepared and verified everything around it, fed me the prompts I gave Warp, and afterwards turned the day's results into committed evidence and closed the open epic. I was the only human in the loop. During development I worked as the engineering lead. During the roast I worked as the human harness.

This post stands on its own, but it also concludes a longer series. It covers the part I had not written about yet: the MCP server rebuild, why one server replaced two, and what running a roast through an agent actually looks like.

The detector itself came out of a five-part build I documented separately, starting with Part 1: The Architecture. That work produced the model, but not a production way to use it. The thing that used it in my kitchen since last November was a prototype: two MCP servers, an Auth0 layer, SSE transport, and an n8n agent orchestrating them. It worked, and it still works. It also made its own design problems clear.

Why One MCP Server Replaced Two

The prototype split responsibility along familiar service boundaries: one MCP server for roaster control, another for first-crack detection, an orchestrator above both.

The problem is that a roast is one timeline, and I had split its truth across processes.

The charge event lived in the control server. The first-crack event lived in the detection server. Development time, the most important number in roasting (seconds since first crack), required the orchestrating agent to fetch state from both servers, reconcile clocks, and compute the result in prompt-space on every state read. Every state change paid a synchronization cost:

Two clocks. Detection timestamps and control timestamps came from different processes, so correlating "first crack at T+9:01" depended on cross-process time agreement.
Two failure domains. A detection-server restart mid-roast orphaned the timeline while the beans kept roasting.
The agent as state machine. The LLM was the only place where the roast existed as a whole, which is a poor home for authoritative state.
Infrastructure without a purpose. Auth0 and SSE exist to let networked processes talk safely. Everything here ran on one machine next to a roaster.

The rebuild inverted this. coffee-roaster-mcp is one local stdio MCP server that owns the entire roast session: driver control, telemetry sampling, automatic T0 detection, the first-crack audio runtime, derived metrics (rate of rise, development time, DTR), and log export. One process, one monotonic clock, one append-only event timeline. The agent on top makes decisions. It is never the database.

The difference shows in the live roast logs. When first crack confirmed, the same process that recorded the event moved the session phase to development, started the development-time clock, and shut down its own audio capture. Three side effects, no synchronization, one timeline row.

How Codex Delivered It

The ML repository was Warp/Oz's project. For the MCP server rebuild I ran the same spec-driven method from Part 1 with a different agent: Codex, on GPT-5.5.

The numbers: eight epics decomposed into 78 GitHub issues, delivered one PR per story. The repository stands at 78 merged PRs across 87 commits, 394 tests at roughly 90% coverage, and four releases shipped to PyPI and the MCP Registry. It also contains 65 session summaries in docs/session-summaries/, one per working session, each written by the agent that did the work.

The rules of engagement live in an AGENTS.md file at the repository root. Two of them did most of the work:

Keep roaster hardware control conservative. Heat, fan, drop, cooling, and emergency stop behavior require explicit tests or manual validation notes.

The default roaster driver is mock. Default first-crack mode is disabled.

The mock-first constraint shaped everything. All 78 PRs ran hardware-free in CI. Real hardware was only reachable through a CLI that requires stating intent explicitly:

coffee-roaster-mcp hottop-validate \
  --config coffee-roaster-mcp.yaml \
  --i-understand-this-controls-hardware \
  --include-drop --include-emergency-stop

The irreversible steps, bean drop and emergency stop, are opt-in flags, and the command writes a JSON evidence file scoring every step. Neither an agent nor a tired human can drift into spinning a 240 °C drum by accident.

The session summaries record more than what changed. They also record operating conditions. This is from the first hardware validation session:

Context window: 29% left (186K used / 258K) ... This was a high-context hardware validation story because it depended on the accumulated Epic 3 driver decisions from E3-S4 through E3-S8, prior Hottop review fixes, and live operator observations.

Sixty-five of these documents make the repository's history legible to any agent that picks it up later. Several did.

Roast Day: Warp at the Controls

The final epic story, E7-S6, required installing the published package the way an end user would, from the MCP Registry path, into a real MCP client, and running a full supervised roast on real hardware with the real microphone.

The client was Warp, running Gemini 3.5 Flash. The install used the published artifact:

{
  "RoastPilot": {
    "command": "uvx",
    "args": ["coffee-roaster-mcp==0.1.3", "serve"],
    "env": {
      "COFFEE_ROASTER_MCP_CONFIG": "/Users/.../roasts/coffee-roaster-mcp.yaml"
    }
  }
}

No dev checkout and no editable install. The same PyPI package anyone gets from the registry entry io.github.syamaner/coffee-roaster-mcp.

The rig, photographed during the first roast. The Warp session on screen is the transcript quoted below.

My role inverted completely. During development I directed agents as the engineering lead. During the roast I was the human harness, and I mean that precisely: the agent never acted on a timer or its own initiative. Every MCP call, including every state read, happened because I asked for it. I stood at the machine, pasted the operator prompt into Warp, and issued each instruction: "set heat to 60% and fan to 70%", "show me the roast state". Warp translated the instructions into tool calls, read the telemetry, and reported back in one-line status updates:

[08:55] 179.0°C / 229.0°C / +9.3°C/min / 100% / 30% / pending
FIRST CRACK DETECTED at 09:01 (9m 1s elapsed since T0)!
[09:06] 181.0°C / 230.0°C / +8.8°C/min / 100% / 30% / detected / 00:05 dev / 0.9% DTR

The same moment in the Warp client: each state read is a visible MCP tool call, and the detection announcement comes from a routine read, not a push.

The human-driven cadence is a design decision, not a limitation. For a first production validation on live heat I wanted no autonomous loop anywhere in the agent layer. The supervised human is the scheduler. The server is built for exactly this split: a background sampler logs telemetry every 5 seconds and feeds the detection runtime regardless of when anyone asks, so the roast record stays complete even while the harness is busy listening for cracks. The agent reads state when instructed. The runtime never stops watching.

We ran two roasts.

Roast 1 proved the live path: manual charge marking, default detector profile, first crack audio-detected at +09:01 with confidence 0.9066 against a 0.9 threshold. Drop at 198 °C, 15.0% DTR.

Roast 2 ran the production profile and is the run described in the opening: automatic T0 from the bean-temperature drop, and the sliding-window detector (10-second windows, 0.7 overlap, confidence threshold 0.6, five positive windows required) confirming first crack at +08:56. First positive at window 337, confirmation at window 343, about 18 seconds of accumulating evidence. No mark_beans_added, no mark_first_crack, no overrides. Across both roasts: more than 26,000 serial status packets, zero read errors, zero command-loop errors, zero faults.

On cost: roast 1 finished at 76.9 Warp credits, roast 2 at 82. Around 160 credits in total for two fully agent-operated, evidence-logged roasts, with individual turns (a state read, a heat change) costing 1.7 to 4.4 credits. Gemini 3.5 Flash never improvised a hardware command. The operator prompt forbade it, and the transcript shows it complied.

Three things went wrong, all in instructive ways:

"Preheat" heated nothing. My first operator prompt said "set heat and fan to the percentages I give you", and I said "preheat" with no percentages. The server, correctly, streamed safe-zero packets while the drum sat cold. This is not a bug. The server never implies heat. The prompt got default preheat values, and the safety property got documented.
The export raced the session end. Roast 2's summary.json snapshot was taken during cooling, so it froze at phase: cooling. The append-only roast.jsonl had the full timeline regardless, and the docs state the discrepancy rather than hiding it.
The registry metadata had a real bug, found only because we installed the real package: server.json lacked packageArguments, so a purely registry-driven launch would run uvx coffee-roaster-mcp without the required serve subcommand and exit at argument parsing. Found on roast day, fixed the same day.

The last one is the case for end-to-end validation in compressed form. The bug lived in the gap between "all tests pass" and "a stranger installs this".

What Production-Ready Means Here

It would have been easy to write "validated" in the README and move on. Instead, roast day produced a paper trail: the validation report, per-roast summaries with full timelines, Warp transcript excerpts with the raw MCP tool payloads, the guarded-validation JSON (8 of 8 steps including drop and emergency stop, run against the published package), and both sessions' complete roast logs. Every artifact is SHA-256 checksummed and committed.

This is the same evidence discipline the agents were held to during development, applied to the final claim. A repository that calls itself production-ready should be able to show the production.

Who Did What

The division of labour, end to end:

Warp/Oz built the model, the dataset, the training science, the edge deployment, and the Hugging Face publishing pipeline (Parts 1 to 5).
Codex built the production MCP server: the one-process architecture, the Hottop driver, the detection runtime, the metrics, the release machinery, and 65 session summaries documenting the work.
Warp with Gemini 3.5 Flash operated the roaster live, for about 160 credits.
Claude Code ran roast-day operations: found the serial port and microphone, pre-warmed the model cache, executed the guarded hardware validation, wrote the operator prompts Warp ran from, watched the logs in real time, compiled the evidence, and closed E7-S6 and Epic 7 in the repository.
I designed the architecture, wrote the specs, supervised the hardware, made the roast decisions, and drank the results.

The agents were interchangeable in one specific sense: every one of them worked from the same written specs, the same repository state documents, and the same rules files. Moving between them cost no momentum, because the context an agent needs was never trapped in a chat history. It lives in the repository, and whichever harness opens it next picks up mid-stream. Roast day showed the compressed version: two harnesses working side by side in Warp the same afternoon, one driving the roaster, the other compiling the evidence, both reading the same state files. That is the main lesson of this series. The workflow (spec first, state files as memory, evidence as the definition of done) survived contact with four agents running models from different vendors. The agents executed it.

What Comes Next

The roasts in this post were human-paced on purpose: I was the scheduler, and the agent acted only when asked. The next milestone inverts that, carefully. roastpilot-agent is a deterministic harness now in active development that drives the same coffee-roaster-mcp server with the roles pinned down in code rather than in prompts: a typed state machine owns a 1-second control loop (a cadence set by the thermocouples' response time, not by the LLM), a hard safety policy validates every command with typed verdicts (allow, clamp, reject, recovery, emergency stop), and the LLM is advisory only. It recommends heat and fan targets with a rationale; it never calls a tool. A restart never auto-resumes heat.

The build runs on the same method as everything above: a plan repository with decision records and cross-repo epics, mock-safe vertical slices before any hardware, and the MCP contract pinned by fixtures captured from the real server validated in this post. A cloud component for sharing roasts and collecting tasting feedback (roastpilot-cloud) follows after the harness, with one rule already fixed: the cloud never controls the roaster and is never required for an active roast.

The supervised manual roast you just read about is the baseline that work gets measured against.

Part 5: From Local Model to Live Demo - Publishing to Hugging Face with Warp

syamaner — Tue, 05 May 2026 15:00:00 +0000

At the end of the first two-day Warp/Oz agentic development sprint, the project had a working data pipeline, a fine-tuned AST model, and an ONNX export path for Raspberry Pi 5.

The project has continued to improve since then, with more data and better models. This post captures what was achieved in that initial publishing and serving cycle, including the 973-chunk dataset snapshot discussed below, not the current latest repository state.

The hardware story had reached its milestone. But a model that lives in experiments/baseline_v2/checkpoint-best on a personal laptop is not a community contribution. It is a local artefact.

This post covers the remaining work: publishing the model, ONNX variants, and dataset to Hugging Face Hub; discovering why pipeline_tag: audio-classification does not give you a working inference widget; building a Gradio Space to fill that gap; and debugging four container failures in Hugging Face's runtime. The live Space is at huggingface.co/spaces/syamaner/coffee-first-crack-detection.

In this post

The Packaging & Open Source Milestone
The Widget Illusion
The Pivot to Gradio
Agentic Debugging in the Container
CI/CD: What Was Designed
The Result
Links
References

The Packaging & Open Source Milestone

By the end of that sprint, the pipeline was validated end-to-end. The last remaining story in the epic was making it publicly accessible. The /push-to-hub skill, one of the four parameterised skills from Post 1, handled the publication sequence. I invoked it once; Oz ran the full chain from inside the Warp terminal.

The chain had three parts. First, the model. model.push_to_hub(repo_id) and extractor.push_to_hub(repo_id) are the Hugging Face-native packaging contract: both the model weights and the ASTFeatureExtractor configuration land together at syamaner/coffee-first-crack-detection. Any consumer calling ASTForAudioClassification.from_pretrained("syamaner/coffee-first-crack-detection") gets everything they need in one call. I uploaded the README separately with an explicit HfApi.upload_file call to make the model card explicit and reproducible.

Before the upload, _validate_model_card() runs:

# scripts/push_to_hub.py
def _validate_model_card(model_dir: Path) -> None:
    ...
    required = {"pipeline_tag", "license", "base_model"}
    missing = required - set(metadata or {})
    if missing:
        raise ValueError(f"README.md frontmatter missing required fields: {missing}")

This is spec-driven development enforcing itself. One missing required field in the model card and this publish path fails before upload.

Second, the ONNX variants. Both the FP32 (345MB) and INT8 (89.9MB) models were uploaded under onnx/fp32/ and onnx/int8/ subfolders on the same model repo. Each subfolder includes a copy of preprocessor_config.json, making every variant self-contained:

ASTFeatureExtractor.from_pretrained(
    "syamaner/coffee-first-crack-detection", subfolder="onnx/int8"
)

A Raspberry Pi 5 with no repo clone can download and run the INT8 model with a single Hub call. That was the design goal from the beginning: hf_hub_download as the deployment primitive, not scp.

Third, and most importantly: the dataset. DatasetDict.push_to_hub("syamaner/coffee-first-crack-audio") published all 973 annotated chunks across the training, validation, and test splits, with audio cast to 16 kHz at push time. Anyone pulling the dataset gets properly formatted audio without running the four-step chunking pipeline from Post 2.

The dataset at huggingface.co/datasets/syamaner/coffee-first-crack-audio is, as far as I can determine, the first public annotated audio dataset for coffee roasting first crack detection. I could not find an equivalent dataset on Hugging Face, Kaggle, or in the papers I reviewed. The model weights are useful to anyone replicating this specific setup. The dataset is the contribution that remains useful even if someone rebuilds with a different architecture, adds more recording sessions, or trains for a different roasting event entirely. A model is one implementation. A labelled dataset is infrastructure.

The Widget Illusion

The model card had pipeline_tag: audio-classification. I assumed that was enough because I had seen that metadata associated with inference widgets on supported Hugging Face model pages: upload a file, click Compute, get predictions. I pushed the model and opened the Hub page expecting that experience.

The widget area read: "This model isn't deployed by any Inference Provider."

Hugging Face inference widgets require the model to be actively served by a provider. For this custom AST fine-tune, automatic hosted inference was not available. A paid Hugging Face Inference Endpoint could have served it, and commercial providers may serve some models, but this model needed an explicit backend. pipeline_tag describes the task; it does not provision compute.

The attempt was to add a widget YAML block with example audio URLs, the documented approach for pre-loading inputs into an inference widget (commit 1cf2b21):

# README.md (Hugging Face model card frontmatter)
widget:
  - src: https://huggingface.co/syamaner/coffee-first-crack-detection/resolve/main/audio_examples/first_crack_sample.wav
    example_title: "First crack (10s clip)"
  - src: https://huggingface.co/syamaner/coffee-first-crack-detection/resolve/main/audio_examples/no_first_crack_sample.wav
    example_title: "No first crack (10s clip)"

It did nothing. The documentation explains how to enrich an inference widget with example inputs. In this case, there was no deployed backend, so the widget panel did not render and the metadata had nowhere to appear.

The correct path was a Gradio Space. A practical Hugging Face community serving path is a Space: a containerised app that you own, instrument, and deploy yourself, running on their free CPU tier. pipeline_tag signals intent. A Space provides the actual serving path.

The Pivot to Gradio

A Hugging Face Space is a Git repository with a README.md containing YAML frontmatter that tells Hugging Face's runtime what to run and how. The core runtime configuration for this Space was only a few lines:

# spaces/README.md
sdk: gradio
sdk_version: "6.11.0"
app_file: app.py
pinned: false
license: apache-2.0
models:
  - syamaner/coffee-first-crack-detection

Hugging Face provisions a container, installs the dependencies from spaces/requirements.txt, and runs app.py. The Space gets a public URL. For this deployment, there was no Dockerfile, no Kubernetes manifest, and no CI configuration in the serving path.

I specified the UI requirements: a dropdown to select example clips, an audio upload component, a classify button, and a label output showing the two-class probabilities. Oz built the initial spaces/app.py in one pass as part of PR #28, using gr.Blocks for layout control rather than the higher-level gr.Interface:

# spaces/app.py
with gr.Blocks(title="☕ Coffee First Crack Detection") as demo:
    with gr.Row():
        with gr.Column():
            example_dd = gr.Dropdown(choices=list(_EXAMPLES), label="Try an example")
            audio_in   = gr.Audio(type="filepath", label="Upload Audio (WAV / MP3)")
            submit_btn = gr.Button("Classify", variant="primary")
        with gr.Column():
            output = gr.Label(num_top_classes=2, label="Prediction")

    example_dd.change(fn=load_example, inputs=example_dd, outputs=audio_in)
    submit_btn.click(fn=classify, inputs=audio_in, outputs=output)

Copilot's review of PR #28 caught the most important structural issue: the original implementation initialised the pipeline at import time, with _pipe = pipeline(...) running unconditionally on module load. In a containerised Space, this means a cold-start crash if the Hub is temporarily unreachable during startup, with no error surfaced to the user. Copilot flagged it; the fix was a lazy initialisation pattern:

# spaces/app.py
_pipe: object = None

def _get_pipe() -> object:
    global _pipe
    if _pipe is None:
        _pipe = hf_pipeline("audio-classification", model=_REPO_ID)
    return _pipe

The pipeline now loads on the first inference call and is cached for subsequent requests. A startup failure becomes a user-visible gr.Error on first classify, not a silent container crash.

Copilot also flagged that spaces/requirements.txt listed neither gradio nor huggingface-hub explicitly; both were implicit transitive dependencies. The explicit gradio==6.11.0 pin was added; without it the Space is not reproducible outside the Hugging Face runtime. PR #28 received 10 inline comments across three review passes.

The first deployment did not work. Four container failures appeared in sequence before the Space came up cleanly.

Agentic Debugging in the Container

Container debugging on Hugging Face Spaces followed a specific boundary. Oz could run local commands and edit files from Warp, but it could not directly see the Space log stream in the browser. I deployed, waited for the build, copied the relevant log excerpt into Warp, then used the proposed fix-and-verify loop in the terminal. This was not autonomous log watching; it was spec-driven local execution plus human-in-the-loop transfer of external platform logs.

The four failures were:

Bug	Symptom in container log	Fix
`colorFrom: "brown"` in Space YAML frontmatter	Hugging Face rejected the metadata because `brown` is not a valid Hugging Face Space colour	Changed to a valid colour value
`sdk_version: "5.0.0"` → Gradio 5.x imported `HfFolder`, removed from `huggingface_hub` 0.23+	`ImportError: cannot import name 'HfFolder' from 'huggingface_hub'`	Bump `sdk_version` to `"6.11.0"` (`a12c46e`)
`hf_hub_download()` returns path in `/root/.cache/huggingface/`, outside Gradio 6.x allowed directories	`gradio.exceptions.InvalidPathError: Cannot move /root/.cache/huggingface/hub/...`	Add `local_dir="/tmp"` (`882559e`)
Gradio event loop cleanup during startup	`ValueError: Invalid file descriptor: -1` logged from `asyncio/base_events.py`	Rule out SSR first, then apply a narrow cleanup patch before importing Gradio

The first three were one-commit fixes. None appeared in local testing because they were specific to the containerised Hugging Face runtime. The fourth looked like SSR at first, but the actual bug was lower in Gradio's event loop cleanup.

The SSR Bug

The fourth failure was the one worth dissecting. After the first three fixes landed, the Space built successfully; the build log showed no errors. Then this appeared at startup:

Running on local URL: http://0.0.0.0:7860, with SSR ⚡ (experimental, to disable set ssr_mode=False in launch())

Then the app still launched, but the logs looked broken. Startup emitted a runtime ValueError in the Space logs (issue #32):

ValueError: Invalid file descriptor: -1
  File "asyncio/base_events.py", BaseEventLoop.__del__

The error came from Python's asyncio event loop destructor. Gradio 6.x creates intermediate event loops during startup. When those loops are garbage-collected, BaseEventLoop.__del__ tries to close an already-invalid file descriptor (-1). The traceback is logged as Exception ignored in:, then Python catches and discards it. The app runs correctly; the error is log noise rather than a functional failure.

The initial diagnosis pointed to Gradio's experimental SSR mode, which is enabled by default. Adding ssr_mode=False to demo.launch() removed the SSR startup banner but did not suppress the GC error. It occurs regardless of SSR state, on both Python 3.12 and 3.13.

The actual fix was a monkey-patch applied before Gradio is imported. It is intentionally narrow: it suppresses only ValueError: Invalid file descriptor: -1 during event loop cleanup and re-raises other ValueErrors.

# spaces/app.py, applied before `import gradio`
import asyncio.base_events as _base_events

def _patch_asyncio_event_loop_del():
    original_del = getattr(_base_events.BaseEventLoop, "__del__", None)
    if original_del is None:
        return
    if getattr(original_del, "_spaces_app_patched", False):
        return

    def _patched_del(self: _base_events.BaseEventLoop) -> None:  # noqa: ANN
        try:
            original_del(self)
        except ValueError as exc:
            if str(exc) != "Invalid file descriptor: -1":
                raise

    _patched_del._spaces_app_patched = True  # type: ignore[attr-defined]
    _base_events.BaseEventLoop.__del__ = _patched_del  # type: ignore[attr-defined]

_patch_asyncio_event_loop_del()

That diagnosis took two iterations of the same loop described above. First, ssr_mode=False ruled out SSR as the direct cause. Then the narrower event-loop patch fixed the noisy cleanup path.

CI/CD: What Was Designed

Every change to the model card, dataset card, or Space currently requires a manual upload to Hugging Face Hub. Three Hugging Face targets, three separate operations, and an easy place to get out of sync. After PR #28 landed, I filed issue #34 to automate it.

The design: a single GitHub Actions workflow triggered on every push to main, running three parallel jobs:

Source file	Hugging Face target	Destination
`README.md`	Model repo: `syamaner/coffee-first-crack-detection`	`README.md`
`data/DATASET_CARD.md`	Dataset repo: `syamaner/coffee-first-crack-audio`	`README.md` (renamed)
`spaces/app.py`, `spaces/README.md`, `spaces/requirements.txt`	Space repo: `syamaner/coffee-first-crack-detection`	same filenames

The approach uses HfApi.upload_file rather than a full git-push sync. The model card is a single file at the repo root, and the dataset card requires renaming on upload. Selective file upload is simpler than mirroring the entire repository. An HF_TOKEN secret in GitHub Actions is the only prerequisite.

This is still open. It is not a complex workflow to write; the issue body has the full design. Blog post drafting and follow-on dataset work took priority. It is the last gap between "it works" and "it stays consistent without manual intervention."

The Result

The Space is live at huggingface.co/spaces/syamaner/coffee-first-crack-detection. Upload a 10-second WAV or MP3, click Classify, and you get the model's probability output for first_crack and no_first_crack.

The project has continued to improve since then, but this series captures the first milestone: making the detector usable outside my development machine.

What the series delivered: a complete, public, reproducible audio ML pipeline, from recording sessions and Label Studio annotation through Hugging Face-native training, ONNX INT8 edge deployment, and a live inference UI. The model, dataset, and ONNX variants are all on the Hub. The source is on GitHub. The annotated coffee roasting audio dataset is public for anyone who wants to build on it.

The prototype that started this ran on a laptop. This one runs on a low-cost ARM board, or as a hosted Gradio Space accessed through a browser.

Links

Project:

Tools:

References

1. Hugging Face Spaces & Gradio

Hugging Face Spaces Documentation: Covers the README.md YAML frontmatter, SDK options, and container lifecycle that drove the deployment approach here.
Gradio Blocks Documentation: gr.Blocks is used rather than gr.Interface for layout control over the two-column input/output arrangement.

2. Hugging Face Inference Providers

Hugging Face Inference Providers Documentation: Documents which model types are eligible for automatic inference widget hosting and which require explicit provider deployment. The gap between pipeline_tag and a working widget is explained here.

3. Python asyncio & Gradio Event Loop Cleanup

CPython asyncio BaseEventLoop.__del__: The destructor that raises ValueError: Invalid file descriptor: -1 when garbage-collecting event loops whose self-pipe is already closed. Affects Python 3.12 and 3.13 when Gradio creates intermediate loops during startup.

Part 4: Optimising an 86M Parameter Audio Transformer for Raspberry PI

syamaner — Wed, 29 Apr 2026 18:11:09 +0000

Can a coding agent take an ML model all the way to the edge?

That is the question behind the current post on the audio based coffee first crack detection series.

This project started with a repository structured so that an agent (Oz) in Warp could operate across data, training, evaluation, and deployment without losing context. Then we improved the data processing pipeline.

Post 3 ended with three numbers: 97.4% test accuracy, 100% precision, 0 false positives. All of it measured on a MacBook.

That was enough to prove the model could learn first crack. It did not prove the model could run where I needed it: next to a coffee roaster, on a Raspberry Pi 5, listening through a USB microphone during a live roast.

The first Pi run was "it kind of works". Export the PyTorch checkpoint to ONNX FP32, copy the 345MB file across, run inference. The result was 9.4 seconds per 10-second audio window. That was not a bug. That was the cost of running an 86M parameter transformer on one ARM Cortex-A76 core at full floating-point precision.

The next attempt exposed a different problem. More threads made the model faster, but the Pi crashed. Oz was already in the SSH session, so the next command was vcgencmd get_throttled. The answer came back as 0x50000: under-voltage and throttling.

So the work was not just "export to ONNX." It became a small edge deployment story: shrink the model, tune the threshold, limit threads, swap the power supply, add active cooling, then decide whether the remaining latency still fit the physics of coffee roasting.

The final number was 2.09 seconds per 10-second window at 4 threads with active cooling. The production profile uses 2 threads and lands at 2.45 seconds, leaving CPU headroom for the rest of the system.

In this post

What We're Deploying
The Pi Workflow
ONNX Export
The Platform Numbers
Threshold Sweep
The Hardware Stabilisation Story
The Verdict
Links
References

What We're Deploying

The model is an Audio Spectrogram Transformer fine-tuned for binary classification: first_crack vs no_first_crack on 10-second audio windows resampled to 16kHz mono. Post 2 covers the dataset: 973 annotated chunks from 15 roasting sessions, recording-level splits to prevent data leakage, class-weighted training against a 20/80 imbalance. Post 3 covers the training: two hyperparameter attempts, the oscillating loss from a learning rate too aggressive for 587 samples, and the annotation redesign that moved precision from 87.5% to 100%.

The numbers relevant to this post are the historical baseline_v2 numbers from the edge-validation PR:

86M parameters, fine-tuned from MIT/ast-finetuned-audioset-10-10-0.4593, pre-trained on 2M AudioSet clips
97.4% test accuracy / 100% precision on baseline_v2 (191-sample test set, recording-level isolation)
ONNX INT8 on Mac: 96.86% accuracy, 1 additional FP. That is the full quality cost of INT8 quantisation on the same test set.
The model and ONNX deployment path now live at huggingface.co/syamaner/coffee-first-crack-detection

The Pi Workflow

The Pi validation for PR #23 followed a simple division of labour. The ONNX models were exported on the Mac and synced across to the Pi. Oz SSH'd into the device from within the Warp terminal and drove the benchmark and evaluation scripts directly. My role was to read the results and make hardware decisions.

That mattered when the Pi started failing. The SSH session, benchmark output, vcgencmd result, and next command all stayed in one terminal thread. There was no separate debug notebook or copied shell history. The failure and the next experiment were in the same place.

Two scripts drove everything on the Pi:

scripts/evaluate_onnx.py: runs the test set through the ONNX model and returns accuracy, F1, confusion matrix, and per-sample latency. Designed with no PyTorch inference dependency so it runs on the Pi without a GPU torch install.
scripts/benchmark_onnx_pi.py: 30 timed runs after 5 warmup runs, using dummy audio to isolate inference latency from file I/O variance.

One setup detail matters: even though inference runs entirely through ONNX Runtime, PyTorch CPU is still needed on the Pi for ASTFeatureExtractor's mel filterbank computation. The install is split deliberately. requirements-pi.txt handles the ONNX and audio dependencies, then torch CPU is added separately:

pip install -r requirements-pi.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu

torchaudio and optimum are explicitly excluded from the Pi install. ONNX Runtime handles everything from the filterbank output onward.

Here is the evaluation session: Oz running evaluate_onnx.py over SSH on the Pi, 4 threads, reading the latency numbers as they arrive:

PR #23 accumulated 36 Copilot comments across 6 review rounds, mostly around type annotations, missing error handling, and edge-case validation. The useful miss was the benchmark script's <500ms target. That target made sense on a Mac, but it would always print FAIL on the Pi. The real question was not whether the model was under 500ms. It was whether the detector could keep up with a roast. The production Pi profile later answers that by widening the hop to 7 seconds.

ONNX Export

The trained checkpoint is a standard Hugging Face save_pretrained directory: model weights, config, and the ASTFeatureExtractor preprocessor config. Getting it to the Pi is a two-step process: export the PyTorch graph to a static FP32 ONNX graph, then compress the weights to INT8. Oz executed both steps by invoking the /export-onnx skill shown in Post 1, which runs export, quantisation, and a local benchmark in sequence before marking the step complete.

Step one uses Hugging Face Optimum:

# src/coffee_first_crack/export_onnx.py
from optimum.onnxruntime import ORTModelForAudioClassification

ort_model = ORTModelForAudioClassification.from_pretrained(
    model_dir,
    export=True,
)
ort_model.save_pretrained(str(fp32_dir))

export=True triggers ONNX tracing at load time. The result is exports/onnx/fp32/model.onnx at 345MB. That file ran at 9.4 seconds per window on the Pi at 1 thread, captured in the local validation artifact results/pi5_fp32_eval.json. Usable for offline batch processing, but not for a live roasting session.

Step two uses onnxruntime.quantization.quantize_dynamic:

# src/coffee_first_crack/export_onnx.py
from onnxruntime.quantization import QuantType, quantize_dynamic

quantize_dynamic(
    model_input=str(fp32_path),
    model_output=str(int8_path),
    weight_type=QuantType.QInt8,
)

Dynamic quantisation converts weights to INT8 at export time; activations remain in FP32 at runtime. I used it instead of static quantisation because static quantisation would require a calibration dataset to quantise activations too. The simpler dynamic approach carries a smaller quality penalty. The same model_quantized.onnx benchmarked at 636ms on Apple Silicon runs unchanged on the Pi's ARM Cortex-A76. No recompilation, no platform-specific pass.

The output from one export run:

exports/onnx/fp32/model.onnx: 345MB, FP32 precision
exports/onnx/int8/model_quantized.onnx: 89.9MB, INT8 weights

That is a 3.84x size reduction. The full quality and latency comparison across platforms is in the next section.

The feature extractor config is saved into every variant directory independently. Each subdirectory is self-contained. You can copy just exports/onnx/int8/ to the Pi and run inference without the full repo structure or a parent-level config lookup.

There is a second deployment path that avoids any manual copy. The ONNX variants were pushed to HF Hub alongside the PyTorch checkpoint. syamaner/coffee-first-crack-detection includes an onnx/int8/ subfolder. The production inference module, inference_onnx.py, is HF Hub-first by design: _DEFAULT_REPO_ID = "syamaner/coffee-first-crack-detection" and _DEFAULT_SUBFOLDER = "onnx/int8" are the hardcoded defaults. At startup it calls hf_hub_download and caches locally. A Pi with only requirements-pi.txt + torch CPU installed can pull and run the model without any repo cloning or file transfer. The PR #23 validation used local exports synced over via scp because the test split data was also local. A fresh production deployment pulls everything from the Hub.

The Platform Numbers

The v2 Pi5 evaluation, captured locally as results/v2_pi5_int8_4t_eval.json, settled the portability question directly: does INT8 dynamic quantisation produce different accuracy on ARM64 versus Apple Silicon? The answer is no. The Pi5 INT8 model at 4 threads, run against the same 191-sample v2 test set, returns 96.86% accuracy, 96.9% precision, and 1 false positive. That is the same confusion matrix as the Mac INT8 result: [[154, 1], [5, 31]].

Model	Platform	Test set	Accuracy	Precision (FC)	FP	Latency p50
PyTorch baseline_v2	Mac (MPS)	v2, 191 samples	97.4%	100%	0	n/a
ONNX INT8	Mac	v2, 191 samples	96.86%	96.9%	1	636ms
ONNX INT8	Pi5, 4 threads	v2, 191 samples	96.86%	96.9%	1	2,092ms
ONNX FP32	Pi5, 1 thread	v1, 45 samples†	93.3%	91.3%	2	9,412ms
ONNX INT8	Pi5, 2 threads	v1, 45 samples†	93.3%	91.3%	2	2,436ms

†v1 test set (6 roasts, original annotations). v2 2-thread evaluation pending.

The total INT8 quantisation cost, measured against the v2 baseline on the same test set, is 0.54% accuracy and 1 additional false positive. That cost is identical on Mac and Pi5. The identical confusion matrix confirms the ONNX artefact is fully portable: same weights and same predictions on both Apple Silicon and ARM Cortex-A76.

The v1 rows exist for the FP32 vs INT8 latency story: 9.4 seconds at full FP32 precision, 2.4 seconds at 2 threads INT8, 2.07 seconds at 4 threads. The accuracy figures in those rows reflect the older, smaller test set and should not be read as platform quality numbers.

Threshold Sweep

The default classification threshold is 0.5. Any window where the model outputs P(first_crack) >= 0.5 is flagged. For an isolated classifier, that default is fine. For a roasting assistant, the consequences are asymmetric.

A false negative means the system waits one more 10-second window before detecting first crack. A false positive could incorrectly flag background noise or a drum knock as first crack. That could trigger an automated action (timer, fan relay, alert) at the wrong moment in the roast. The tradeoff is not symmetric, and the threshold should reflect that.

The sweep ran from 0.50 to 0.95 on the Pi5 INT8 model, captured locally as results/pi5_threshold_sweep.json on the v1 test set:

Threshold	Precision	Recall (FC)	F1	FP	FN
0.50 to 0.65	91.3%	95.5%	93.3%	2	1
0.70 to 0.75	90.9%	90.9%	90.9%	2	2
0.80 to 0.90	95.2%	90.9%	93.0%	1	2
0.95	100%	77.3%	87.2%	0	5

The numbers between 0.50 and 0.65 are identical. The model's output probabilities are far from the decision boundary in most cases. At 0.80, one FP is eliminated. The remaining FP, a no_first_crack chunk scored at 0.941, survives all the way to 0.90. Only at 0.95 does it drop out, reducing FP to zero at the cost of recall dropping to 77.3% (5 additional missed windows).

The deployed Pi profile does not use 0.95. It uses 0.90, paired with a pop-confirmation layer. The logic is in configs/default.yaml:

# configs/default.yaml
pi_inference:
  window_size: 10.0
  overlap: 0.3           # 30% -> 7s hop, comfortable margin for 2-thread latency
  threshold: 0.90        # precision=0.952, recall=0.909, F1=0.930
  min_pops: 3            # 3 positive windows required within the confirmation window
  confirmation_window: 30.0  # seconds
  onnx_threads: 2        # leaves 2 cores free for MCP server + agent UI

A single window scoring above 0.90 does not trigger the detector. Three positive windows within a 30-second span must agree. An isolated false positive from background noise cannot survive the confirmation requirement. For a false positive to trigger the system, three separate audio windows, 7 seconds apart, must all independently score above 0.90. I did not see that pattern in the test set. The 0.941 outlier appears once, in one chunk, not in three consecutive windows of a real roasting session.

Two further notes on the Pi profile. First, the overlap drops from 70% (default, 3-second hop) to 30% (7-second hop). With 2-thread inference at 2.45 seconds per window, a 3-second hop would leave almost no idle time and risk falling behind the audio stream. The wider hop provides 4.5 seconds of headroom per cycle. Second, the thread count is 2, not 4, because the Pi runs more than just the detector. The MCP server and agent UI share the same device; limiting ONNX to 2 cores prevents inference from starving the rest of the stack.

The Hardware Stabilisation Story

The first sustained inference run on the Pi used an Apple 96W USB-C charger. It was what was on the desk. The benchmark ran at 1 thread without issue. At 2 threads, the Pi5 crashed mid-run.

Oz was already in the SSH session. The next command was vcgencmd get_throttled:

$ vcgencmd get_throttled
throttled=0x50000

0x50000 sets bits 16 and 18 in the throttle register: under-voltage has been detected and throttling has occurred. The Pi5 was drawing more current than the charger could supply and the firmware was cutting clock speeds to compensate. When the deficit is severe enough, the board halts entirely.

The Apple 96W charger negotiates 5V at 3A on its USB-C output, which is 15 watts. The Raspberry Pi 5 under sustained 2+ thread inference draws up to 5V/5A, which is 27 watts. This is in the RPi5 hardware specification. The 5V/5A USB Power Delivery profile the Pi requires is not what most laptop chargers, including the Apple 96W, negotiate. The naming is misleading: "96W" refers to the high-voltage MacBook profile, not the 5V rail.

Oz handled the workaround immediately: dropped onnx_threads to 1 and re-ran. Single-thread inference stayed within the 15W budget and the benchmark completed. But 1-thread inference at 9.4 seconds per window is not a deployment target. The fix required hardware, not configuration.

Gemini was brought in to cross-reference the RPi5 throttle register documentation and the official PSU specification. The human made the call: the official Raspberry Pi 27W USB-C Power Supply (5V/5A) was the required fix, not a config change.

Here is the SSH debugging session: Oz reading vcgencmd get_throttled, diagnosing the 0x50000 flag, and adjusting threads as an interim workaround while the hardware decision was made:

After the PSU swap, 4-thread inference ran without crashes. But a second flag appeared under sustained load:

$ vcgencmd get_throttled
throttled=0xe0000

0xe0000 sets bits 17, 18, and 19: ARM frequency capped, throttling has occurred, soft temperature limit active. The CPU core temperature hit 77°C under continuous inference. The Pi5 has no heatsink by default; sustained transformer inference is a thermal stress test the passive design cannot pass.

The official Raspberry Pi Active Cooler brought operating temperature to 45°C under sustained 4-thread inference, a 32°C reduction. At 45°C, vcgencmd get_throttled returns 0x0. Stable.

The final hardware requirements from that debugging session are encoded in AGENTS.md:

### RPi5 Hardware Requirements
- PSU: Official RPi5 27W (5V/5A) USB-C. Standard chargers (5V/3A incl. Apple 96W) cause under-voltage crashes.
- Cooling: Active cooler mandatory. Sustained inference hits 77°C+ without it.
- Threads: Default 2 via `--threads` flag. 4 threads needs 27W PSU + active cooler.

The entire debugging session, including SSH, vcgencmd output, and the subsequent AGENTS.md update, stayed inside one Warp terminal session.

The Verdict

The production Pi deployment runs inference_onnx.py --profile pi_inference: 2 ONNX threads, 30% overlap (7-second hop), threshold 0.90, min_pops: 3 confirmation. Under those parameters, with 27W PSU and active cooler, per-window latency is 2.45 seconds p50. That leaves 4.5 seconds of idle time between inference calls at the 7-second hop, which is enough headroom for the MCP server and agent UI sharing the same board.

For this use case, that is sufficient. First crack is not a millisecond-precision event. It is a sustained acoustic phase lasting 60 to 90 seconds. A detector that confirms within 30 seconds of onset, using three positive windows at 7-second intervals, fits the timing of a real roast. The 2.07-second 4-thread result is the performance ceiling with optimal hardware; 2.45 seconds at 2 threads is the stable production target.

The numbers also show the limit of transformers at the edge. An 86M parameter model that takes 9.4 seconds at FP32 on a low cost ARM board is not a general-purpose edge architecture. It works here because the inference cadence is slow and the event it detects lasts long enough to be caught across multiple windows. Deploy the same model against a task requiring sub-second response and you would need a much smaller architecture or dedicated inference hardware. INT8 quantisation helped: 3.84x size reduction, 4.5x latency improvement over FP32. It does not turn a large transformer into a microcontroller-class model.

Post 5 covers the other half of making the model usable: getting it in front of people who are not running SSH sessions into a Raspberry Pi. Publishing the model card and dataset card, building a Gradio Space that works inside a container, and the four platform bugs, including an SSR/asyncio crash on Python 3.13, each diagnosed by pasting HF Space logs to Oz and getting fixes back in seconds.

Links

Project:

Tools:

References

1. ONNX Runtime & Quantisation

ONNX Runtime Dynamic Quantisation: Documents quantize_dynamic with QuantType.QInt8, the approach used for the INT8 export. Explains the distinction between dynamic (weight-only) and static (requires calibration data) quantisation.
Hugging Face Optimum ONNX Export: ORTModelForAudioClassification.from_pretrained(export=True) triggers the ONNX tracing used in export_onnx.py.

2. Raspberry Pi 5 Hardware

Raspberry Pi 5 Product Brief (PDF): Specifies the 5V/5A (27W) USB-C PD requirement for full-load operation. Standard 5V/3A supplies are explicitly noted as insufficient for sustained workloads.
vcgencmd Documentation (Raspberry Pi): Documents get_throttled and the bitmask flags: bit 16 (under-voltage detected), bit 17 (ARM frequency capped), bit 18 (throttled), bit 19 (soft temp limit).

3. Edge Deployment & Model Optimisation

Hugging Face Audio Classification Fine-Tuning Guide: Context for the AST model architecture and ASTFeatureExtractor preprocessing requirements that carry through to edge deployment.

Part 3: The Science - Hyperparameter Tuning & Getting to 100% Precision with Warp

syamaner — Wed, 22 Apr 2026 06:39:32 +0000

Part 2 ended with a dataset: 973 annotated 10-second chunks from 15 roasting sessions, recording-level splits, class-weighted training. What it didn't cover is what happened when that dataset went into a training loop for the first time, and why the first result (91.1% accuracy, 87.5% precision, 3 false positives) was unacceptable for an automated roasting assistant.

This post is about the two things that got the model to 97.4% accuracy, 100% precision, and 0 false positives. Both are domain decisions that do not originate in the spec (the kind of thing a spec absorbs once a human has decided). One is a subtle data engineering constraint embedded in the pre-trained model architecture. The other is the kind of hyperparameter failure that only becomes obvious after you've watched it happen once.

In this post

The Transfer Learning Constraint
The First Attempt: lr=5e-5
The Diagnosis
The baseline_v1 → baseline_v2 Transition
The baseline_v2 Training
The Results
Detection Latency on Full Recordings
References

The Transfer Learning Constraint

The model backbone is MIT/ast-finetuned-audioset-10-10-0.4593, an Audio Spectrogram Transformer pre-trained by MIT on AudioSet, Google's 2M-clip audio classification dataset. Before fine-tuning, the ASTFeatureExtractor converts raw waveforms to log-mel spectrograms. This conversion requires two normalisation constants: a mean and a standard deviation.

The wrong approach is to compute these from your training set. With only 587 training samples at 16kHz, those statistics are a poor estimate of the true distribution compared to what the pre-trained model saw, and they introduce a distribution mismatch that reduces transfer efficiency and can destabilise fine-tuning.

The correct approach is to use the constants the model was calibrated on: mean=-4.2677393, std=4.5689974, hardcoded in configs/default.yaml. These are AudioSet population statistics, not dataset statistics. Using dataset-specific values for a pre-trained model is the audio equivalent of normalising ImageNet with your own RGB means. It works, but you're fighting the pre-training rather than building on it.

# configs/default.yaml
audio:
  feature_extractor_mean: -4.2677393
  feature_extractor_std: 4.5689974
  num_mel_bins: 128
  max_length: 1024

This wasn't something Warp could derive from the code. The original scaffold used dataset-computed values, a reasonable default. I caught it before training began, cross-referenced against the AST paper (Gong et al., 2021) and the ASTFeatureExtractor source. The kind of correction that requires knowing what the pre-trained model expects, not just what the API accepts.

This calibration detail also points to why fine-tuning works at all on 587 samples. The AST backbone was pre-trained on 2 million AudioSet clips spanning 527 classes: car engines, musical instruments, birdsong, crowd noise. By the time it reaches our training loop, it has learned general audio representations: transient detection, periodic signal encoding, broadband noise characterisation. Fine-tuning doesn't train those primitives from scratch. It adjusts a narrow classification boundary on top of representations that already exist. Training from scratch on a dataset this small would not produce a useful model with 86M parameters; fine-tuning with the right LR and calibration turns those same 587 samples into a viable signal.

The First Attempt: lr=5e-5

baseline_v1 was the first full training run after the input_values bug from Post 1 was fixed. Warp didn't type the training command from memory. It invoked the /train-model skill, which encodes the exact steps: verify data splits, review configs/default.yaml, run training, monitor TensorBoard, validate results. Hyperparameters, acceptance criteria and pipeline steps all live in files Warp reads on every run; the agent re-executes the current definition of the system rather than recalling the last conversation. What that buys is consistent execution: the same data splits, the same hyperparameters, the same evaluation protocol, every time.

The initial configuration (a reasonable starting point for fine-tuning audio transformers) was lr=5e-5, weight_decay=0.01, early_stopping_patience=5.

The loss curve told the story immediately. Here is the epoch-by-epoch validation data from experiments/baseline_v1/checkpoint-208/trainer_state.json:

Epoch	Eval Loss	Eval F1 (val)
1	0.125	0.955
2	0.337	0.955
3	0.173	0.977
4	0.219	0.930
5	0.464	0.933
6	0.204	0.977
7	0.380	0.955
8	0.390	0.955

This is not convergence. Validation loss dropped to 0.125 at epoch 1, then nearly tripled to 0.337 at epoch 2. It recovered to 0.173 at epoch 3, spiked again to 0.464 at epoch 5, recovered to 0.204 at epoch 6, and then climbed through epochs 7 and 8. The model never settled.

Within individual epochs the signal was equally erratic: training loss jumped from 0.0039 at the start of epoch 6 to 0.1526 mid-epoch, then collapsed to 0.0001 by the end. The model was doing large gradient steps that briefly improved the loss, then overshot, then recovered partially, a pattern that repeats until early stopping triggers.

The likely mechanism is the AST architecture itself. With 86M parameters pre-trained on 2M clips, the model carries a large amount of prior knowledge about audio representations. At lr=5e-5, the gradient updates appear large enough to disrupt those representations in a single step. The oscillation is consistent with the model simultaneously trying to learn coffee roasting while drifting away from the general audio features it entered training with, the behaviour the literature calls catastrophic forgetting.

Early stopping triggered at epoch 8 (patience=5, best at epoch 3). The best checkpoint was from epoch 3 (F1=0.977 on validation); epoch 6 matched it but did not improve. The test set result was less flattering: 91.1% accuracy, 87.5% precision, 3 false positives on 45 held-out samples.

Three false positives in 45 samples is 6.7%. In a deployed roasting assistant, each false positive is a classification window that incorrectly signals first crack, potentially triggering an automated action (fan on, timer start, alert) at the wrong moment in the roast. That is not acceptable.

The Diagnosis

The oscillation points primarily to an overly aggressive learning rate for the dataset size. With only 208 training chunks in baseline_v1's training set (the original 9-roast dataset), each batch of 8 samples represents nearly 4% of the training data. The gradient estimate is noisy, and a high LR amplifies that noise into destructive update steps.

What made this straightforward in practice: the fix was three numbers in configs/default.yaml, not a code change. The next time Warp invoked /train-model, it picked up the updated config and re-ran the full pipeline. That is the capability the spec-driven workflow buys: deterministic re-execution, no drift between runs, no manual orchestration, no re-explaining constraints. The experiment is defined by files (config, commits, checkpoints), not by the person who ran it, and can be re-run without reconstructing context. The alternative is how most ML experimentation drifts: manual CLI re-runs, ad-hoc flag combinations, and parameters held in memory between sessions. The spec is the source of truth; the agent is the executor.

The configuration change for baseline_v2 was those three numbers, documented in commit a19ab7b:

-  learning_rate: 5.0e-5
+  learning_rate: 2.0e-5

-  weight_decay: 0.01
+  weight_decay: 0.05

-  early_stopping_patience: 5
+  early_stopping_patience: 3

lr=2e-5: smaller gradient steps, less disruption of pre-trained representations. The model adapts to the domain instead of being pushed away from its pre-trained feature space.
weight_decay=0.05: stronger L2 regularisation. With a small training set, this prevents the model from memorising training patterns that don't generalise.
patience=3: tighter early stopping. If the model is going to diverge, stop it faster. Don't let it run 3 epochs past the best checkpoint.

These weren't arbitrary changes. They were informed by the observed training dynamics and by standard fine-tuning practice for pre-trained transformers. lr=2e-5 is a standard value for small-dataset fine-tuning; the HuggingFace AST docs note the model needs a low learning rate, and the official audio classification example uses 3e-5. weight_decay=0.05 is a standard regularisation value for fine-tuning transformers on small datasets. The patience reduction reflects the tighter training budget of a small domain dataset versus a multi-thousand-sample regime.

The baseline_v1 → baseline_v2 Transition

The hyperparameter change didn't happen in isolation. baseline_v2 was retrained on a fundamentally different dataset: the re-annotated, recording-level-split version from Post 2. Two things changed simultaneously:

Annotation redesign: Every recording was re-annotated with a single continuous first crack region, replacing the fragmented per-pop approach from v1. This eliminated the mislabelled windows that the overlap threshold was creating at inter-region gaps (described in Post 2).

mic-2 expansion: The dataset grew from 9 recordings (mic-1 FIFINE K669B only) to 15 recordings (9 mic-1 + 6 mic-2 Audio-Technica ATR2100x). The training set grew from ~208 chunks to 587 chunks. The test set grew from 45 to 191 samples across 3 held-out recordings.

This means the v1→v2 metric improvement isn't cleanly attributable to hyperparameters alone. The cleaner labels, the larger training set, and the lower learning rate all contributed. That's an honest constraint on any ablation claim.

The baseline_v2 Training

With the new configuration and dataset, training ran to epoch 5 before early stopping triggered. The validation F1 from experiments/baseline_v2/checkpoint-148/trainer_state.json:

Epoch	Eval Loss	Eval F1 (val)
1	0.204	0.827
2	0.233	0.866
3	0.480	0.762
4	0.668	0.753
5	0.564	0.853

The best checkpoint was epoch 2 (F1=0.866 on the 195-sample validation set). Early stopping triggered after epoch 5 (3 epochs without improvement beyond epoch 2). This is experiments/baseline_v2/checkpoint-148.

The eval loss tells a different story from v1: not oscillation but divergence. After epoch 2 it climbs: 0.480 at epoch 3, 0.668 at epoch 4, with a slight recovery to 0.564 at epoch 5. The model found a good minimum early and then overfit cleanly. This is the expected behaviour of a large pre-trained model on a small domain dataset: it adapts quickly, then memorises.

Early stopping is the mechanism that prevents committing to a checkpoint in the overfit zone. Without it, you'd always evaluate the final checkpoint, which here would be epoch 5 (F1=0.853), not epoch 2 (F1=0.866). The difference is small on validation, but on a dataset this size the best checkpoint matters. patience=3 earned its keep: with the original patience=5, training would have continued to epoch 7, two more epochs past the useful minimum.

The Results

The evaluation of experiments/baseline_v2/checkpoint-best on the full 191-sample test set (experiments/baseline_v2/evaluation/test_results.json). The test set is 191 chunks drawn from 3 complete recordings that were withheld entirely from training and validation: recording-level isolation, not chunk-level. The model had never seen any audio from these 3 roasting sessions in any form during training.

Metric	baseline_v1 (45 samples)	baseline_v2 (191 samples)
Accuracy	91.1%	97.4%
Precision (FC)	87.5%	100%
Recall (FC)	95.5%	86.1%
F1	91.3%	92.5%
ROC-AUC	97.8%	98.2%
False Positives	3	0

There is one more spec-driven moment here. The /train-model skill's acceptance criteria include recall ≥ 0.95 (safety critical: false negatives are worse than false positives). baseline_v2's 86.1% recall fails it. This is a case where the spec itself needed updating; the domain reasoning had shifted. For roasting automation, a false positive triggers an action at the wrong point in the roast; a false negative just delays detection by 10 seconds. I updated the acceptance criterion before signing off on the model. The spec serves the deployment constraint, not the other way around; when the constraint changes, the spec changes with it, and the agent picks it up on the next run.

The precision/recall tradeoff is deliberate. Recall dropped from 95.5% to 86.1%; the model is more conservative about committing to a first crack detection. This is the correct direction for a roasting automation system. A missed detection delays the response by 10 seconds (one inference window). A false positive risks triggering an automated action at the wrong point in the roast. The asymmetry of consequences justifies trading recall for precision.

One honest caveat: 0 false positives on a 191-sample test set is not a guarantee of 0 FP in production. The 100% precision figure means the model didn't hallucinate first crack on any of the 155 no_first_crack windows across these 3 specific roasts. That's the scope of the claim.

Two things should give a cautious ML engineer pause. First, the validation F1 during training peaked at 0.866, but the test F1 is 0.925. The test set performed better than validation, which may simply reflect that the 3 held-out recordings happened to be a more learnable distribution than the validation set. With only 3 test recordings you cannot rule that out, and you cannot generalise beyond it.

Second, the domain coverage is narrow: 2 bean origins, 2 microphone models, one home roaster, one room. Out-of-distribution sounds (a different roaster's drum noise, a busy kitchen, a USB microphone with a different frequency response) will challenge a model trained on this narrow a corpus. The results are a strong proof of concept. Claiming production robustness would require substantially more recording diversity.

Detection Latency on Full Recordings

Chunk-level test accuracy is the standard metric, but the deployed use case is inference over a full roasting session: 8–12 minutes of continuous audio, sliding window at 70% overlap (3-second hop), returning a binary decision every 10 seconds. The metric that matters for automation is the delay between first crack onset and the system confirming the event.

Three test recordings from the 191-sample set were run through the full sliding-window pipeline. Results from commit a19ab7b:

Recording	Mic	FC Onset	First Detection	Delay
25-10-19_1236-brazil-3	mic-1 (FIFINE K669B)	452.7s	453.0s	0.3s
roast-2-costarica-hermosa-hp-a	mic-1 (FIFINE K669B)	441.0s	441.0s	0.0s
mic2-brazil-roast2	mic-2 (ATR2100x)	599.6s	627.0s	27.4s

The mic-1 results are exactly what you'd want from a production detector: sub-second detection, effectively real-time relative to the 10-second inference window. The mic-2 result is not.

A 27.4-second delay is caused by data imbalance in the training set: 9 recordings from mic-1, 6 from mic-2. The model learned first crack primarily through the FIFINE's acoustic signature: different sensitivity profile, different frequency colouring, different noise floor than the ATR2100x. When presented with mic-2 audio, the model's confidence builds slowly, requiring more overlapping windows before committing.

This isn't a model architecture failure. It's a dataset coverage failure that is correctable by recording more sessions with mic-2. The detection works correctly on all three recordings. It just takes longer on the underrepresented microphone.

Post 4 will cover moving this model off the Mac and onto a Raspberry Pi 5: ONNX export, INT8 quantisation, and the hardware debugging that comes with deploying 86M parameters to an ARM board over SSH from Warp.

References

1. Audio Spectrogram Transformer

AST: Audio Spectrogram Transformer (Gong et al., 2021). The original paper introducing the AST architecture and the AudioSet pre-training that MIT/ast-finetuned-audioset-10-10-0.4593 is built on.
MIT/ast-finetuned-audioset-10-10-0.4593 on Hugging Face. The base checkpoint used for fine-tuning, including the ASTFeatureExtractor configuration with AudioSet mean/std values.

2. Catastrophic Forgetting in Neural Networks

Catastrophic Interference in Connectionist Networks (McCloskey & Cohen, 1989). The original characterisation of catastrophic forgetting in neural networks.
Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al., 2017). EWC paper, which contextualises why aggressive fine-tuning destroys pre-trained representations.

3. Transfer Learning Best Practices

HuggingFace Audio Classification Fine-Tuning Guide. Notes that AST needs a low learning rate; the official example uses 3e-5, consistent with the lr=2e-5 chosen for baseline_v2.

Part 2: Building an Open Coffee Roasting Audio Dataset with Warp

syamaner — Sat, 18 Apr 2026 15:45:03 +0000

I could not find a publicly available audio dataset for coffee roasting first crack detection. Hugging Face, Kaggle, and the academic literature I searched returned nothing directly usable for this task. First crack is a sparse acoustic event: short, irregular pops embedded in continuous drum and airflow noise, which makes both the recordings and the label schema non-trivial. Before the model in Part 1 could train a single epoch, I had to build one from scratch: recording sessions, annotating audio in Label Studio, and building a pipeline that mitigates a specific source of metric inflation in time-series audio models.

The result is a publicly available coffee roasting audio dataset: 973 annotated 10-second segments from 15 roasting sessions, published on Hugging Face. Segments are disjoint (non-overlapping) 10-second chunks; each is labelled first_crack when at least 50% of its window overlaps with the annotated first crack region, and no_first_crack otherwise. Two design choices contributed to the model achieving 100% precision and zero false positives on the held-out test set: splitting data at the recording level, and weighting the loss function to account for a 20/80 class imbalance.

My contribution was the domain judgment: what to record, how to annotate, and which constraints to enforce. Warp implemented the data pipeline (convert_labelstudio_export.py, chunk_audio.py, dataset_splitter.py) in PR #27 from a spec I iterated on before any code was written. This post focuses on those decisions, delivered across a two-night sprint; the addendum covers the follow-up infrastructure work.

In this post

The Gap
Recording & Annotation
Data Leakage
Class Imbalance
Dataset by the Numbers
Addendum: After the First Release
- Reframing: a data problem, not a model problem
- Sample-locking two USB mics
- What the spec asked for, what Warp built
- Live debugging: the MCP exclusive lock
- Copilot review retrospective
- What this buys the dataset
References

The Gap

When there is no dataset, the options are limited: reuse adjacent data and accept a mismatch, or collect new data. Coffee roasting audio is specific enough that generic kitchen or industrial recordings are not a good substitute. A roaster has a consistent acoustic background (drum rotation and airflow), followed by short, irregular pops during first crack that vary with bean and roast profile.

A search across Hugging Face at the time of writing returns only text datasets related to coffee. Kaggle and Papers with Code do not surface annotated roasting audio in their public indexes. No widely used baseline appears in the public literature I searched.

This makes the dataset a primary contribution of this work. The model can be reproduced. The dataset allows other approaches to be explored without repeating the data collection work.

Recording & Annotation

Nine recordings were captured with a FIFINE 669B USB condenser microphone pointed at the roaster during the roasting process. Six additional recordings (S17) were captured with an Audio-Technica ATR2100x dynamic microphone for dedicated data collection. Both record via USB into Audacity (mono, 16-bit PCM at 44.1 kHz), with resampling to 16 kHz for the feature extractor.

The microphone choice affects the data. Condenser and dynamic microphones differ in sensitivity and frequency response. With more condenser recordings in the training set, the model primarily learned that acoustic profile. This imbalance later showed up as delayed detection (27.4 seconds) on recordings from the second microphone.

Each recording is a full roast (8–12 minutes) containing one first crack phase.

Annotation v1: The Fragmented Approach

The initial approach for the prototype treated first crack as a set of discrete events. Each audible burst was annotated as a short region. This produced fragmented labels: multiple short segments with gaps between them. The Brazil recordings from October 2025 show this clearly: 18 separate FC regions across a 90-second first crack window, ranging from 1.3 seconds (chunk_011: 455.75–457.09s) to 5.8 seconds, with gaps of up to 9 seconds between them.

The model trained and achieved 91.1% accuracy, but produced false positives. The issue was not data volume but label consistency.

The actual problem was in how chunk_audio.py assigns labels. Every 10-second window gets a binary label based on whether its overlap with annotated first_crack regions meets a ≥50% threshold:

# src/coffee_first_crack/data_prep/chunk_audio.py
def label_window(
    window_start: float,
    window_end: float,
    regions: list[dict[str, Any]],
    overlap_threshold: float = 0.5,
) -> str:
    window_duration = window_end - window_start
    overlap = compute_overlap(window_start, window_end, regions)
    if overlap >= overlap_threshold * window_duration:
        return "first_crack"
    return "no_first_crack"

With fragmented annotations, many windows containing real first crack audio fell below the threshold. The very first cracking window in brazil-3 is a 1.3-second annotated region at 455.75s. A 10-second training window spanning 455–465s overlaps that annotation for only 1.3 seconds (13% overlap), so it is labelled no_first_crack, even though it contains a real first crack pop. The same logic produces mislabelled windows at every inter-region gap throughout the recording, introducing systematic noise into the training data.

Annotation v2: Phase-Level

The correction was to treat first crack as a continuous phase. All 15 recordings were re-annotated with a single region per roast, spanning from the first audible pop to the end of consistent cracking. This aligns the annotation with the physical process rather than subjective segmentation.

Everything outside the annotated region is implicitly no_first_crack. One decision, enforced by the schema.

With one continuous region, the ≥50% threshold in label_window becomes deterministic. Every 10-second window inside the first crack period gets clean overlap. There are no inter-region gaps to corrupt the training signal. Baseline_v2, trained on these re-annotated labels, produced 0 false positives on a 191-sample test set. That is 100% precision.

The Pipeline Warp Built

The Label Studio export is a single JSON file. Before any training can start, that file needs to be parsed per recording, each recording chunked into 10-second windows, and the chunks split into train/val/test. I specified the constraints (fixed window size, ≥50% overlap threshold for labels, and splitting at the recording level), and Warp implemented the three scripts that handle it, working from inside the Warp terminal as part of PR #27:

convert_labelstudio_export.py parses annotations into per-recording files
chunk_audio.py generates fixed 10-second windows and assigns labels
dataset_splitter.py performs the train/validation/test split

Here is Warp executing the full pipeline (Label Studio conversion, chunking all 973 segments, and recording-level split) before immediately invoking the /train-model skill:

Data Leakage

Audio data introduces a specific form of leakage. A recording's acoustic fingerprint is constant throughout; the background drum hum, the extractor hood (home roasting), noise from outside (street), the room resonance, the mic's frequency response.

If chunks from the same recording appear in both training and test sets, the model can rely on background characteristics rather than the target signal.

The earlier prototype split at the chunk level, which allowed near-identical segments from the same recording to appear in both sets. Reported accuracy was therefore optimistic. Its train/test split looked as follows:

# coffee-roasting prototype — splits/
test/first_crack/roast-1-costarica-hermosa-hp-a_chunk_018.wav
train/first_crack/roast-1-costarica-hermosa-hp-a_chunk_019.wav
train/first_crack/roast-1-costarica-hermosa-hp-a_chunk_020.wav

chunk_018 and chunk_019 are consecutive windows from the same roasting session, milliseconds apart in time, sharing identical background characteristics. The prototype's 91.1% accuracy on that test set was real arithmetic on real predictions. The generalisation it implied was not: the model had acoustically seen those recordings before.

The revised pipeline groups chunks by their source recording and assigns entire recordings to a split. This ensures that evaluation is performed on unseen sessions.

Once recordings are grouped, a split assigns them 70/15/15, stratified by whether each recording contains any first crack chunks, so FC-containing recordings are distributed across all three sets rather than clustering in one. The result is 15 recordings split 9/3/3, producing 587/195/191 chunks.

Every chunk in the 191-sample test set comes from a recording the model had never encountered in any form during training.

Class Imbalance

Fixed non-overlapping 10-second chunking reveals the true distribution: about 20% of windows contain first crack.

Without correction, the loss function is dominated by the majority class. A model predicting only "no first crack" would achieve high accuracy without detecting anything useful. Standard CrossEntropyLoss treats all samples equally, so it provides 4× more gradient signal from the majority class per epoch.

The fix is a small subclass of the HuggingFace Trainer that overrides compute_loss to apply class-weighted CrossEntropyLoss:

# src/coffee_first_crack/train.py
class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        weights = self.class_weights.to(logits.device)
        loss_fn = nn.CrossEntropyLoss(weight=weights)
        loss = loss_fn(logits, labels)
        return (loss, outputs) if return_outputs else loss

The weights come from inverse class frequency on the training set, which increases the contribution of minority class samples during training without resampling or augmentation.

There is a trade-off. Increasing recall can increase false positives. In this system, a false positive is more problematic than a delayed detection. The resulting model favours precision: no false positives on the test set, with reduced recall.

Dataset by the Numbers

The full dataset is published at huggingface.co/datasets/syamaner/coffee-first-crack-audio. The source recordings, annotation JSONs, and pipeline code are in the GitHub repository.


Total chunks	973 (10-second WAV, 16kHz mono)
first_crack	197 (~20%)
no_first_crack	776 (~80%)
Recordings	15 (9 mic-1 FIFINE 669B, 6 mic-2 ATR2100x)
Bean origins	Costa Rica Hermosa, Brazil, Brazil Santos
Split (recordings)	9 train / 3 val / 3 test
Split (chunks)	587 train / 195 val / 191 test
Annotation tool	Label Studio, one `first_crack` region per recording
Chunking	Fixed 10s window, ≥50% overlap threshold
Split strategy	Recording-level (prevents data leakage)
License	Apache-2.0

All four engineering decisions described in this post (the annotation redesign, fixed non-overlapping chunking, recording-level splits, and class-weighted training) are encoded in the pipeline and reproducible from the published data.

Addendum: After the First Release

The sections above describe what shipped at the end of a two-night sprint: the problem, the annotation redesign, recording-level splits, class-weighted training, and the published dataset. This addendum covers what came after the artefact was in production: reframing the 27.4-second mic-2 delay as a data problem, the macOS audio work needed for paired recording, the two scripts Warp built from a spec we iterated on in the Warp terminal, the MCP conflict that broke the first real dual-mic roast, and a retrospective on six rounds of Copilot review on PR #48.

Reframing: A Data Problem, Not a Model Problem

The 27.4-second detection delay on the mic-2 test recording is reproducible and it does not move under hyperparameter tuning. Nine of the 15 recordings used the FIFINE K669B condenser; only 6 used the ATR2100x dynamic. The model learned first crack primarily through the FIFINE's acoustic lens: its sensitivity curve, its noise floor, its handling of attack transients. When the ATR2100x presents a different acoustic signature for the same physical event, the model stalls until it has accumulated enough evidence to override its prior.

Hyperparameter changes will not fix a distribution shift that lives in the training data. The fix is more paired data: recordings that capture the same first crack event through both microphones simultaneously, so every new roast adds one mic-1 and one mic-2 sample to the training set at a 1:1 ratio. That reframes the problem from tuning around a sensor bias to recording two USB microphones in lockstep for ten minutes, which is a solvable macOS audio problem, covered next.

Sample-Locking Two USB Mics

Each USB microphone ships with its own internal clock crystal, and opening two independent sounddevice.InputStream instances lets each one run on its own clock. The waveforms drift over a 10-minute roast, small enough to be invisible for the first few minutes, large enough by the end to make cross-mic timestamps unreliable. That breaks the "annotate once, propagate to both files" workflow I wanted next.

The fix is a macOS CoreAudio Aggregate Device. In Audio MIDI Setup I created a virtual multi-channel input called RoastMics combining the FIFINE K669B and the ATR2100x, with the FIFINE set as Primary Clock and Drift Correction enabled on the ATR2100x. CoreAudio continuously resamples the secondary mic so the two channels stay sample-locked: if a first crack pop arrives at sample index N on channel 0, it arrives at index N on channel 1.

That single property, physical sample-locking, is what makes paired annotation trivial. A dual-mic roast needs one Label Studio pass on the FIFINE track; the ATR2100x annotation is a straight copy because the timestamps are identical. Full Audio MIDI Setup steps and the calibrated gain values (FIFINE 23.00 dB, ATR2100x Front Left/Right 20.1 dB, validated on the Panama Hortigal Estate roasts) are in docs/multi_mic_setup.md.

What the Spec Asked For, What Warp Built

Before any code for the recorder was written, I drafted S19 as a GitHub issue and a Warp Plan artifact, and we iterated on it together. The spec went through five rounds in a single conversation:

Initial dual-mic design with a fixed --duration parameter.
Drop --duration in favour of indefinite recording until Ctrl-C.
Generalise from dual-mic to 1–N mics via a --mics list.
Add --labels with config-backed defaults under recording.mic_labels in configs/default.yaml.
Close three open questions: pin channel mapping, add a _partial suffix for sessions under --min-duration (default 60s), and add a --quiet flag.

Every round updated the GitHub issue body and the Warp Plan in lockstep. By the time implementation began the acceptance criteria, the argument table, and the session JSON schema were fixed. No design decisions were made while writing code.

Warp then implemented two scripts. scripts/record_mics.py opens RoastMics via sounddevice.InputStream, accumulates blocks, and on Ctrl-C splits channels, applies per-mic gain with a clipping guard, and writes one WAV per mic plus a {origin}-roast{n}-session.json containing hardware labels, gains, duration, and an ISO timestamp. scripts/propagate_annotations.py slots between convert_labelstudio_export.py and chunk_audio.py: it reads each *-session.json, finds the primary mic's annotation JSON, and writes an identical annotation for each paired mic with only audio_file and duration updated. The 15 pre-existing recordings have no session JSON, so the propagator does not see them (backward-compatible by construction).

The resulting end-to-end flow is:

python scripts/record_mics.py record --origin panama-hortigal-estate --roast-num 1
python -m coffee_first_crack.data_prep.convert_labelstudio_export ...
python scripts/propagate_annotations.py       # mic1 annotation → mic2 annotation (automatic)
python -m coffee_first_crack.data_prep.chunk_audio ...
python -m coffee_first_crack.data_prep.dataset_splitter ...

Three of those five steps existed before PR #48. The two new ones sit at positions 1 and 3. The rest of the pipeline is untouched.

Live Debugging: The MCP Exclusive-Lock

The first real dual-mic roast attempt failed. With the first crack detection MCP server running (the same detector I rely on during roasts), the RoastMics Aggregate Device dropped to "Offline device" with 0 input channels, and record_mics.py had nothing to open. Nothing in the spec had anticipated this.

Root cause took one turn. Warp read audio_devices.py in the coffee-roasting repo, traced find_usb_microphone() to the line returning the raw FIFINE device index, and identified the conflict: opening the raw subdevice holds an exclusive CoreAudio stream claim, which then makes the Aggregate Device unable to enumerate that same subdevice as a member. The fix was four lines:

# src/mcp_servers/first_crack_detection/audio_devices.py
for dev in devices:
    if "roastmics" in dev["name"].lower():
        return dev["index"]
# ... falls back to raw USB device

If RoastMics exists, the MCP detector now opens the aggregate at channel 0 (the FIFINE channel); CoreAudio multiplexes the hardware internally, and both the live detector and record_mics.py run in the same session. The diagnosis and the patch happened inside the same Warp conversation that reported the failure. I never left the terminal to open a separate editor or debugger.

Copilot Review Retrospective

Copilot reviewed PR #48 across multiple passes. Of the resulting comments, 23 warranted classification. Classified by actual impact rather than by what sounded plausible in the comment body, the breakdown was 10 essential bugs, 5 real-but-lower-blast-radius catches, 4 marginal style or documentation points, and 4 pure noise (PR title wording, stale epic-doc phrasing). The full per-comment table lives in the session summary; two findings stand out for this post.

The highest-impact catch was in batch 2. The --mics list argument translates mic numbers into NumPy channel indices via ch_idx = m - 1. Passing --mics 0 2 silently computes ch_idx = -1 for mic 0, and NumPy negative indexing reads the last column of the buffer without raising. The mic-1 output file would be filled with the wrong channel and labelled as FIFINE, with no warning, no error, and wrong training data. The recorder now rejects non-positive mic numbers at startup.

The quietest catch was in batch 5. recording[:, ch_idx] * gain with gain as a Python float promotes the sounddevice float32 buffer to float64. soundfile.write() picks up that dtype and writes 64-bit float WAVs, twice the intended file size, with no warning and no error. The fix is one np.float32(gain), one .astype(np.float32, copy=False), and an explicit subtype="FLOAT" on write.

Signal density from Copilot dropped sharply once the structural bugs were gone. Batches 1–3 produced 10 essential-or-good catches across 12 comments (83%); batches 4–6 produced 5 across 11 (45%), with 4 of those being PR metadata or stale doc phrasing. Six of the 10 essential bugs were in record_mics.py; only four were in propagate_annotations.py. CLI tools with real hardware side effects accumulate silent-failure bugs exactly because their test surface is narrow, which is the argument for running automated review on them in the first place.

What This Buys the Dataset

Two Panama Hortigal Estate roasts are already captured with the new setup: 12.8 and 15.1 minutes, sample-locked on both channels, pending Label Studio annotation. The prototype detected first crack correctly on roast 1 without any model changes. Those two recordings alone add ~28 minutes of paired audio, and going forward each roast contributes one mic-1 and one mic-2 sample at a 1:1 ratio: the fastest path to closing the 9:6 imbalance behind the 27.4-second delay.

What hasn't shipped yet is tracked as S21 (#49). The recorder currently buffers audio in RAM and writes on Ctrl-C, which is fine for the M3 Max target hardware but loses everything on SIGTERM. S21 replaces the in-memory buffer with a streaming writer thread and adds a --verify flag that runs post-session peak, balance, and sample-lock checks before files enter the training pipeline.

Part 3 will cover the training story: two hyperparameter attempts, the oscillating loss from a learning rate that was too aggressive for 587 samples, and the diagnosis that got from 87.5% to 100% precision.

References

1. Data Leakage in Time-Series & Audio ML

2. Class Imbalance with PyTorch and Hugging Face

PyTorch nn.CrossEntropyLoss. The weight argument is documented for unbalanced training sets.
Hugging Face Trainer customization. Subclass Trainer.compute_loss without rewriting the training loop.

3. Audio Annotation & Label Studio

4. Condenser vs. Dynamic Mics

Dynamic vs. Condenser Mics, Shure Insights. Covers the sensitivity and attack-transient differences that explain the 27.4-second delay on the dynamic-mic test recording.

5. Hugging Face Audio Ecosystem

Hugging Face Audio Course, Chapter 5: Building an Audio Dataset

Part 1: The Architecture & The Agent - Spec-Driven ML Development With Warp

syamaner — Tue, 14 Apr 2026 07:55:00 +0000

Last year I built a prototype coffee first crack detector and wrote about it in a 3 part series. The prototype works. I have been running it on my own roasts since November. But it carries the technical debt of something built to prove a concept rather than to last.

This series is the production rebuild. I built a system that detects coffee first crack in real-time using an AI model running on a Raspberry Pi, and I didn't write most of the code. The first stable rebuild snapshot reached 97.4% accuracy with zero false positives on its original test split. The current public baseline, dataset size, and Raspberry Pi 5 benchmarks now live in the model and dataset cards. The full pipeline, covering data preparation, training, evaluation, ONNX INT8 export, edge validation, and a Gradio UI, shipped in two evenings.

I didn't build this by hand-coding every part of the codebase myself. I acted as the engineering lead and ML reviewer, while Warp and its AI agent, Oz, handled much of the implementation from inside my terminal. My responsibilities were primarily architectural and scientific:

Designing the workflow: Setting the strict rules of engagement between the agent and the codebase.
Defining the science: Dictating the specs, testing strategy, evaluation metrics, and dataset annotation approach.
Directing the execution: Guiding the agent through the implementation and reviewing the output.

Operating this way over the weekend, I tracked the rebuild as an 18-story (at the time) epic across 10 pull requests. Warp/Oz handled much of the execution, 46 commits (at the time) were explicitly co-authored by the agent, and Copilot reviewed each PR across 26 review batches. The model is published on Hugging Face, the dataset is open-sourced, and the source is on GitHub.

This post is about the system that made that possible, not the model itself. The ML science comes in Posts 2 and 3. Here, I want to show the exact architecture I used to direct an AI agent through a complex, multi-phase ML project without losing control of the engineering decisions that matter.

Before the agent could train anything, I had to build the training data from scratch. I could not find a suitable public audio dataset for coffee-roasting first-crack detection, so I recorded roasting sessions, annotated them in Label Studio, and built a recording-level pipeline to avoid leakage between train, validation, and test recordings. The full data engineering story is in Post 2.

In this post

From Prototype to Production
The Director/Coder Dynamic
The Agentic Setup: AGENTS.md, Epics, and Skills
- AGENTS.md: the rulebook
- Epic state management: the checklist
- Parameterised skills: the playbooks
- A generalised AGENTS.md template
The Build & The Fails
Copilot as the Third Actor
By the Numbers
Links

From Prototype to Production

The prototype had accumulated real technical debt. The code was monolithic, the model had no reusable packaging, the MCP server architecture had flaws I had been working around, and nothing ran on edge hardware. I had to use my laptop for every roast.

This series covers the production rebuild. Same domain, completely new architecture:

A standalone, Hugging Face-native training repository.
Strict data engineering to prevent audio leakage.
ONNX INT8 quantization for Raspberry Pi 5 edge deployment.
A live Gradio Space for public inference.

The Director/Coder Dynamic

The core pattern was a strict, enforced separation of concerns between three actors:

I (the human) owned:

The Architecture: Defining repository structure, module boundaries, and enforcing Hugging Face's save_pretrained/from_pretrained as the standard packaging contract.
The ML Science: Model selection (AST over CNN), data split strategy (recording-level to prevent leakage), class weighting, and hyperparameter choices.
The Workflow Constraints: Defining the project rules, writing the parameterised skills, and managing the state of the epic.
The Quality Gates: Reviewing every PR, interpreting the evaluation metrics, and deciding when to retrain versus when to ship.

Oz (Warp's terminal-native agent) owned:

Terminal Execution: Running training loops, evaluations, ONNX exports, and SSH sessions directly on the Raspberry Pi.
Code Generation: Writing the boilerplate WeightedLossTrainer subclasses, CLI argument parsers, pytest scaffolds, and audio data loaders.
Playbook Invocation: Executing repo-local SKILL.md playbooks (for example under .claude/skills/) that encode exact command sequences and validation checks.
State Management: Reading the epic document, updating context, and checking off stories after completing a phase.

GitHub Copilot owned:

Async Code Review: Flagging type safety issues, API misuse, missing error handling, and dependency hygiene across all 10 PRs.
The Reality Check: Copilot never once caught a machine learning logic error. Every data leakage fix, hyperparameter correction, and precision/recall tradeoff decision came from Oz and I. Copilot acts as an aggressive linter for code, not a reviewer for ML science.

This three-way split was not just a convention; it was encoded into the project via an AGENTS.md file. In Warp, project rules from AGENTS.md are applied automatically when the agent is working inside the repository, so Oz started each task inside that rulebook.

The Agentic Setup: AGENTS.md, Epics, and Skills

Three files controlled the entire project.

1. `AGENTS.md`: The Rulebook

This file sits at the repository root. In Warp, project rules in AGENTS.md are applied automatically when the agent is working in the repository. It contains the project rules, quick commands, codebase architecture, and platform-specific constraints. Here is the exact rules section from this project:

## Rules

- Python 3.11+ with full type hints on all public functions and methods
- Google-style docstrings
- `ruff check` and `ruff format` must pass before marking code complete
- `pyright` must pass with no errors on new code
- All dependencies declared in `pyproject.toml` - never install ad-hoc
- Large files (WAV, checkpoints, ONNX models) go to Hugging Face Hub - never commit to git
- `data/`, `experiments/`, and `exports/` are `.gitignore`'d - keep them that way
- Seed all RNG using `configs/default.yaml` seed value
- One PR per story, branch: `feature/{issue-number}-{slug}`
- Before starting a task: read `docs/state/registry.md` → open epic file → check GitHub issue

That last line is the critical one. In practice, it creates a state-reading loop before code changes begin. On a long-running epic, that was the difference between working from current project state and working from stale context.

The file also includes a codebase architecture map, quick commands for every operation (training, evaluation, export, benchmarking), and platform-specific notes for MPS, CUDA, and the RPi5. The full file is on GitHub.

2. Epic State Management: The Checklist

A registry file (docs/state/registry.md) points to the active epic. The epic file itself (docs/state/epics/coffee-first-crack-detection.md) contains 18 stories grouped into 6 phases, each linked to a GitHub issue. Before and after every task, the agent reads the epic state and updates it according to this protocol:

Before starting any task:
1. Read docs/state/registry.md to find the active epic
2. Open the epic file - check story status
3. Open the GitHub story issue - read comments for latest requirements
4. Work on a branch: feature/{issue-number}-{slug}

After completing a story:
1. Check off the story in the epic doc
2. Update Active Context section with what was built
3. Comment on the GitHub story issue, then close it
4. Tick the checkbox in GitHub epic issue #1
5. Open a PR referencing the story issue

This is how 18 stories were delivered without losing track of what was done, what was next, or what had changed. More precisely, the project kept an externalised state in files and issue links that the agent could read and update, rather than relying on short-lived conversational context alone.

The larger stories also started as a spec, not a prompt. The GitHub issue body and a Warp Plan fixed the acceptance criteria and module boundaries before Oz wrote any code, so implementation started against an approved contract rather than a one-line ask. Post 2 walks through a concrete example: the multi-mic recorder went through five Plan iterations in a single conversation before implementation began.

Here is Oz running the earlier phase-6 data preparation pipeline, chunking 973 audio segments, performing the recording-level split, and then invoking the /train-model skill:

3. Parameterised Skills: The Playbooks

These repo-local SKILL.md playbooks live under .claude/skills/ and encode exact command sequences for common operations. Each playbook defines the prerequisites, the commands, and the validation steps. I wrote four:

train-model/SKILL.md: End-to-end training with data validation and checkpoint saving.
evaluate-model/SKILL.md: Test-set evaluation with metrics report generation.
export-onnx/SKILL.md: ONNX export (FP32 + INT8) with size and latency benchmarking.
push-to-hub/SKILL.md: Publish model and dataset to the Hugging Face Hub.

When I told Oz to "train the model," it didn't improvise. It read the skill file and followed the exact sequence I defined. This eliminated an entire class of errors where the agent guesses at flags, skips validation steps, or forgets to save the feature extractor configuration alongside the model weights.

Here is Oz chaining the /export-onnx and /push-to-hub skills to export the model and publish everything to Hugging Face Hub in a single sequence:

A Generalised AGENTS.md Template

Here is a stripped-down version you can drop into any project. Replace the placeholders with your domain-specific rules.

This file is not documentation for humans. It is a system prompt for your codebase. Every rule you omit is a decision the agent will make on its own, and it will make it differently every time.

# AGENTS.md - [Project Name]

Project rules and context for AI coding agents.

## Rules
- [Language] [version]+ with [typing/linting requirements]
- [Formatter] and [linter] must pass before marking code complete
- All dependencies declared in [manifest file] - never install ad-hoc
- Large files go to [remote storage] - never commit to git
- Before starting a task: read `docs/state/registry.md` → open epic → check issue

## Quick Commands
### Setup
[environment setup commands]

### Build / Test / Deploy
[the exact commands for each operation]

## Codebase Architecture
[directory tree with one-line descriptions per module]

## Epic State Management
Before starting any task:
1. Read docs/state/registry.md
2. Check story status in the epic file
3. Read the GitHub issue for latest requirements
4. Branch: feature/{issue-number}-{slug}

After completing a story:
1. Check off the story in the epic doc
2. Update Active Context
3. Close the GitHub issue
4. Open a PR

The Build & The Fails

The first commit after the initial scaffold was feat(S5/S6/S8): implement train.py, evaluate.py, inference.py. In a single pass, Oz generated the training pipeline, evaluation harness, and sliding-window inference module. It followed the AGENTS.md rules, used the correct base model (MIT/ast-finetuned-audioset-10-10-0.4593), and wired up the WeightedLossTrainer subclass with class-weighted CrossEntropyLoss exactly as I specified.

Then training failed.

The `input_features` vs `input_values` Bug

Oz wrote the dataset adapter to return input_features as the tensor key, a plausible mistake if you have seen other Hugging Face audio pipelines. But for AST the expected key is input_values, not input_features, so the model input contract was wrong until I corrected it.

Here is the exact diff from the fix commit (75bbb4b):

# src/coffee_first_crack/train.py - _HFDatasetAdapter.__getitem__
-            "input_features": inputs["input_features"].squeeze(0),
+            "input_values": inputs["input_values"].squeeze(0),

It was a one-line bug, but a consequential one. Oz was pattern-matching from a common Hugging Face audio convention where input_features is the right key. For AST, the right key is input_values. More broadly, the Hugging Face audio stack still exposes inconsistent input naming across model families, so model-specific checks matter even when the surrounding pipeline looks standard. This is a known, unresolved inconsistency in the Hugging Face audio API.

The same commit also added accelerate>=0.26.0 to pyproject.toml, which exposed a missing runtime dependency in the training stack. That fits the current Hugging Face training stack as well, since Trainer is powered by Accelerate under the hood. Oz didn't catch it during code generation because it never triggered an ImportError until actual training.

Here is the earlier 15-roast baseline evaluated on a Raspberry Pi 5 (191 test samples, INT8 quantised, 4 threads, via SSH from Warp):

This is what the validation loop looks like in practice. Oz hits a pyright failure, diagnoses the type issues, fixes them, then runs the full ruff check → ruff format → pyright → pytest chain until all checks pass:

Copilot as the Third Actor

Across the 10 PRs in this project, Copilot submitted 26 review batches containing 110 individual comments. Here is how they broke down by PR:

PR #23 (RPi5 ONNX validation): 36 comments across 6 review rounds, the most reviewed PR by far.
PR #17 (Export, scripts, tests): 26 comments across 5 rounds.
PR #27 (Data prep + mic-2 expansion): 16 comments across 3 rounds.
PR #16 (Train, eval, inference): 10 comments.
PR #28 (Gradio Space): 9 comments across 3 rounds.

The pattern was consistent. Copilot caught:

Type safety: Missing type hints, incorrect return types, untyped function signatures.
Unused imports: Dead code left behind after refactoring.
API misuse: Deprecated parameters, missing synchronisation calls, incorrect exception handling.
Dependency hygiene: Missing explicit dependencies, version pinning issues.
Docs and copy: Misleading docstrings, inaccurate UI text in the Gradio Space.

However, Copilot did not catch the core machine learning logic issues. To be fair, this is largely because my workflow required me to intercept them before they ever reached a PR:

The input_features vs input_values key mismatch: This was fixed locally during the active dev loop before opening the PR.
Data leakage from chunk-level splitting: This is the biggest ML risk in this project, but this was addressed architecturally during the setup phase.
Hyperparameter choices: Overfitting issues were identified and corrected interactively by reading the local training logs.
The precision/recall tradeoff: The class weighting strategy was a deliberate human decision delivered prior to code review.

This is not a criticism of Copilot. It is doing exactly what it should: catching code-level defects at review time. But if you are relying on AI code review to validate your ML pipeline logic, you will ship broken models with clean code.

By the Numbers

Editorial note: the table below reflects the initial rebuild snapshot described in this post. The current public baseline, dataset counts, and Raspberry Pi benchmarks are maintained in the Hugging Face model and dataset cards.

Metric	Result
Wall-clock time	Two evenings (Fri → Sat)
Stories completed	18 across 6 phases
Pull requests	10 merged
Total commits	61 (52 non-merge)
Oz co-authored	46 commits
Lines of code	11,087 insertions across 75 files
Copilot reviews	26 batches, 110 individual comments
Model accuracy	97.4% test / 100% precision on the initial 15-roast rebuild snapshot
Edge latency	2.09s per 10s window on the initial 15-roast rebuild snapshot (RPi5, INT8, 4 threads)
Dataset	Initial release: 973 chunks / 15 roasts (see the dataset card for the current public count)

The live model, dataset, and source are all public. I now treat the model card and dataset card as the canonical source for the latest metrics, dataset counts, and Raspberry Pi benchmark details.

Next up: Post 2, The Data, covers how I built the dataset for coffee-roasting first-crack detection, how I handled annotation and recording-level splits, and how those decisions shaped the precision/recall tradeoff.

Upload a 10 second roasting clip or use an existing sample:

Links

Project:

Tools:

Part 3: From Neural Networks to Autonomous Coffee Roasting - Orchestrating MCP Servers with .NET Aspire 13 and n8n Agents

syamaner — Sun, 16 Nov 2025 18:48:48 +0000

Introduction

In Part 1, we have fine tuned a neural network to detect coffee first crack from audio using PyTorch and the Audio Spectrogram Transformer. In Part 2, we have built two MCP (Model Context Protocol) servers - one to control my Hottop KN-8828B-2K+ roaster and another to detect first crack using a microphone in real-time.

This is where put it all together. But first: can .NET Aspire orchestrate Python MCP servers and n8n workflows to autonomously roast coffee?

Spoiler alert: Yes, it can. And the coffee tastes spot on.

The Challenge

Autonomous coffee roasting isn't just about detecting when first crack happens. It's a complex orchestration problem involving:

Multiple systems: Python MCP servers to interact with hardware, an agent layer for orchestration (n8n workflows to begin with), containerised services.
Real-time decision making: Monitoring sensors every few seconds and deciding on actions depending on the status.
Safety-critical control: Managing heat and fan speed to avoid burning / wasting green beans.
Precise timing: Detecting bean charge event (when beans were added during pre heating stage), first crack, and hitting target development time percentage by adjusting controls available.
Observability: Tracking telemetry across Python, n8n, and .NET components.

The solution? .NET Aspire 13 orchestrating everything.

Why Aspire 13?

Aspire 13.0 (released with .NET 10) brings significant improvements for Python integration and container orchestration—perfect for this use case:

Simplified Python Hosting with `AddPythonModule`

Aspire 13 replaces the old AddPythonApp API with three specialized methods:

AddPythonModule: Runs Python modules with -m flag (e.g., python -m src.mcp_servers.roaster_control.sse_server).
AddPythonScript: Runs standalone Python scripts.
AddPythonExecutable: Runs executables from virtual environments (e.g., uvicorn, gunicorn).

For MCP servers running as modules, AddPythonModule is cleaner and more explicit:

// Old way (Aspire 9)
builder.AddPythonApp("roaster-control", projectRoot, "-m", venvPath)
    .WithArgs("src.mcp_servers.roaster_control.sse_server")

// New way (Aspire 13)
builder.AddPythonModule("roaster-control", projectRoot, "src.mcp_servers.roaster_control.sse_server")
    .WithVirtualEnvironment(venvPath)

Cleaner AppHost Project Structure

The new Aspire.AppHost.Sdk/13.0.0 simplifies project files:

No separate <Sdk Name="..." /> element needed.
Aspire.Hosting.AppHost package included automatically.
Removes IsAspireHost property (implicit).

Enhanced Container Orchestration

Better lifecycle management for containers (n8n in this project).
Improved health check support.
More granular control over container runtime arguments.

Built-in OpenTelemetry Integration

Out-of-the-box observability with .WithOtlpExporter() for:

Structured logging from Python processes.
Distributed tracing across MCP calls.
Real-time metrics in the Aspire dashboard.

Architecture Overview

Here's what .NET Aspire orchestrates:

Why Aspire?

Single command startup: dotnet run starts all 3 services with proper dependency ordering
Shared configuration: Environment variables, Auth0 credentials, OpenTelemetry
Python support: Built-in virtual environment management with AddPythonModule
Container orchestration: Manages n8n container
Observability: Unified dashboard with logs, traces, and metrics from all components
Development velocity: Changes to Python code auto-reload, no container rebuilds needed

The n8n Autonomous Roasting Workflow

As a first step, n8n is selected for the agent layer. The visual workflow setup and the constructs provided by n8n allowed for radio verification of the agentic roasting process. The heart of the system is an n8n workflow that acts as the "roasting brain." Here's what it does:

Phase 1: Initialisation & Preheating (Preheating Agent)

Start → Read Roaster Status → Start Roaster → Monitor Temperature

The workflow:

Connects to both MCP servers via SSE (Server-Sent Events).
Starts the roaster at 100% heat, 30% fan.
Monitors bean temperature rising toward ~170°C during preheating.
Uses an AI Agent node (Preheating Agent) with custom instructions to detect preheating completion.

Key metrics tracked:

Bean temperature.
Rate of Rise (°C/min).
Fan speed (%).
Heat level (%).

Phase 2: Bean Charge Detection (Preheating Agent)

Monitor Temp → Detect Temperature Delta threshold → Mark T0 Timestamp

When green beans are added to the hot roaster, the temperature suddenly drops (e.g., from 170+°C → less than 90°C). Then the workflow:

Tracks rolling temperature averages.
Detects sudden drops > 40°C.
Marks "T0" - the beginning of roast time.
All subsequent metrics are relative to T0.

From the logs:

{
  "t0_detected": true,
  "beans_added_temp_c": 96,
  "t0_timestamp_utc": "2025-11-15T21:21:56.490259+00:00"
}

Phase 3: First Crack Detection (Roast Agent)

Loop: Poll First Crack MCP → Check Status → Wait

The workflow continuously calls the First Crack Detection MCP server:

Streams microphone audio to the PyTorch model.
Uses sliding window inference (10-second windows).
Implements "pop-confirmation" logic (minimum 3 pops within 30 seconds).
Reports when first crack is confirmed.

Detection event:

{
  "first_crack_temp_c": 184.0,
  "first_crack_time_display": "08:42",
  "roast_elapsed_seconds": 522
}

Phase 4: Development Time Management (Roast Agent)

This phase is important as it can lead to under-roasted or over-roasted beans. The agent's objective is to adjust fan and heat to extend development time.

Development time percentage is the percentage of the time spent between first crack and end of roast compared to the overall roasting time. The goal is to get this period around 15-20%. On my machine, I have noticed that this needs to be achieved before bean temperatures go above 196°C to get the results I prefer.

Loop: Adjust Heat/Fan → Monitor Development % → Check Target

Once first crack is detected, the critical development phase begins. The workflow's AI agent:

Monitors:

Current bean temperature
Rate of Rise (to prevent stalling or rushing)
Development time percentage (target: 15-20%)
Time since first crack

Controls:

Reduces heat (100% → 60% → 40%)
Increases fan speed (30% → 50% → 70%)
Slows the roast to extend development time

Decision logic (via AI agent):

IF development_time_percent >= 15% AND development_time_percent <= 20%:
    IF bean_temp_c >= 190 AND bean_temp_c <= 195:
        → DROP BEANS (optimal light roast)
    ELSE IF bean_temp_c > 195:
        → DROP BEANS (approaching medium roast)
ELSE:
    → CONTINUE MONITORING

Actual output from workflow:

{
  "phase": "development",
  "action": "monitor",
  "bean_temp_c": 191,
  "message": "Development: 191°C, 8.9%"
}

Then moments later:

{
  "phase": "cooling",
  "action": "drop",
  "bean_temp_c": 193,
  "message": "Optimal! Dropping beans."
}

Phase 5: Completion

Drop Beans → Set Cooling Fan to 100% → Stop Heat → Cool

The workflow:

Commands the roaster to drop beans into cooling tray.
Sets the cooling fan to 100% for maximum cooling.
Cuts heat to 0%.
Records final metrics for analysis.

Final roast profile:

{
  "roast_elapsed_seconds": 584,
  "roast_elapsed_display": "09:44",
  "beans_added_temp_c": 175.0,
  "first_crack_temp_c": 184.0,
  "first_crack_time_display": "08:42",
  "development_time_seconds": 62,
  "development_time_display": "01:02",
  "development_time_percent": 10.6,
  "total_roast_duration_seconds": 584
}

The Aspire Orchestration Code

Here's how .NET Aspire 13 makes this all work (from Program.cs):

Python MCP Servers

// Roaster Control MCP Server
var roasterControl = builder.AddPythonModule(
        "roaster-control",
        projectRoot,
        "src.mcp_servers.roaster_control.sse_server")
    .WithVirtualEnvironment(sharedVenvPath)
    .WithHttpEndpoint(port: 5002, env: "ROASTER_CONTROL_PORT")
    .WithEnvironment("AUTH0_DOMAIN", auth0Domain)
    .WithEnvironment("AUTH0_AUDIENCE", auth0Audience)
    .WithEnvironment("USE_MOCK_HARDWARE", useMockHardware)
    .WithEnvironment("OTEL_EXPORTER_OTLP_PROTOCOL", "grpc")
    .WithOtlpExporter();

// First Crack Detection MCP Server
var firstCrackDetection = builder.AddPythonModule(
        "first-crack-detection",
        projectRoot,
        "src.mcp_servers.first_crack_detection.sse_server")
    .WithVirtualEnvironment(sharedVenvPath)
    .WithHttpEndpoint(port: 5001, env: "FIRST_CRACK_DETECTION_PORT")
    .WithEnvironment("AUTH0_DOMAIN", auth0Domain)
    .WithEnvironment("AUTH0_AUDIENCE", auth0Audience)
    .WithEnvironment("OTEL_EXPORTER_OTLP_PROTOCOL", "grpc")
    .WithOtlpExporter();

What's happening here:

AddPythonModule: New in Aspire 13, replaces the old AddPythonApp API
WithVirtualEnvironment: Points to shared Python 3.11 venv at repo root
WithHttpEndpoint: Configures SSE endpoints for n8n to connect
WithOtlpExporter: Sends telemetry to Aspire dashboard
Modules run with -m flag implicitly (e.g., python -m src.mcp_servers.roaster_control.sse_server)

Container Services

// n8n Workflow Engine
var n8n = builder.AddContainer("n8n", "n8nio/n8n", "latest")
    .WithHttpEndpoint(port: 5678, targetPort: 5678, name: "n8n-ui")
    .WithBindMount("./n8n-data", "/home/node/.n8n")
    .WithEnvironment("N8N_HOST", "0.0.0.0")
    .WithEnvironment("N8N_PORT", "5678")
    .WithEnvironment("WEBHOOK_URL", "http://localhost:5678/")
    .WithEnvironment("N8N_METRICS", "true");

Key features:

Bind mount for persisting workflows and credentials.
Exposes port 5678 for web UI.
Metrics enabled for observability.
Auto-restarts on failure.

Real-World Results

The First Autonomous Roast

Stats:

Total roast time: 9:44 (584 seconds)
First crack: 8:42 at 184°C
Development time: 1:02 (10.6% - slightly under target but acceptable)
Final temperature: 193°C
Result: Light roast, consistent colour and smooth taste.

What worked:

Temperature drop detection caught bean addition instantly..
First crack detection was accurate (within 20 seconds of my ears). This is why 10% development percentage is not an issue.
Heat/fan adjustments prevented burning.
Development % monitoring kept roast in safe zone.

What could improve:

Development time was 10.6% instead of target 15-20%.
Could start reducing heat earlier after first crack.
Rate of Rise could be smoother in final phase.

The Aspire Dashboard Experience

The unified Aspire dashboard shows:

Services:

roaster-control (Python) - Running
first-crack-detection (Python) - Running
n8n (Container) - Running

Metrics:

Lessons Learned

1. MCP Server Design Matters

Current design has two MCP servers. One for roaster control, one for first crack detection. The original idea was, the roaster control MCP server could run on a low powered device connected to the roaster and the First Crack Detector could run on the laptop due to hardware requirements.

This design adds coordination overhead to the agent and makes it more complicated than necessary. A unified MCP server that returns all metrics in a single call would simplify the agent logic and likely lead to more predictable behaviour. Before moving onto multiple agent framework comparison, this will be one area to improve.

2. Aspire's Python Support is Production-Ready

Before Aspire:

Multiple terminal windows or docker compose
Manual venv activation
Additional effort to add open telemetry collectors and dashboards

With Aspire:

One command: dotnet run.
Automatic venv management.
Shared configuration.
Structured logging and tracing.
Custom metrics.

3. n8n is Powerful for Agent Orchestration

Why n8n worked well:

Visual debugging: See workflow execution in real-time.
Built-in AI Agent node: Uses OpenAI with tool calling.
MCP client support: Native SSE connections.
Error handling: Built-in retry logic and error branches.
State management: Workflow variables persist between runs.

4. MCP Protocol Makes Tool Integration a Breeze

The MCP servers exposed simple HTTP/SSE endpoints:

Roaster Control Tools:

read_roaster_status → Returns current sensors + metrics
adjust_heat(level: int) → Sets heat 0-100%
adjust_fan(speed: int) → Sets fan 0-100%
stop_roaster() → Emergency stop
start_roaster() → Begin roast

First Crack Tools:

start_first_crack_detection() → Start audio monitoring
get_first_crack_status() → Check if first crack detected
stop_first_crack_detection() → Stop monitoring

The n8n AI Agent called these tools naturally:

Agent: "I need to check the roaster status"
→ Calls read_roaster_status
→ Receives JSON with temp, fan, heat, metrics
→ Makes decision
→ Calls adjust_heat(60) to reduce heat

5. Observability is Critical

When the roast is in progress, you need:

Real-time monitoring: See temperature changing every 2 seconds
Error visibility: Know immediately if MCP server crashes
Performance metrics: Ensure control commands complete in <500ms
Historical data: Review roast profile after completion

Aspire's OpenTelemetry integration gave us all of this for free.

The Development Experience with Warp Agent

Throughout this project, I used Warp Agent extensively:

For Aspire upgrade (9 → 13):

Warp Agent searched Microsoft Learn docs via MCP
Found breaking changes in AddPythonApp → AddPythonModule
Generated migration plan with test steps
Verified builds and runtime behavior

For n8n workflow debugging:

Analyzed MCP server logs to diagnose connection issues
Suggested retry logic for transient network errors
Helped structure AI agent prompts for decision-making

For Python model optimization:

Profiled inference latency
Suggested caching strategies for feature extraction
Optimized sliding window parameters

What made Warp Agent effective:

Context awareness: Understood the full stack (C#, Python, n8n)
MCP integration: Could fetch latest Microsoft docs and Context 7 for n8n.
Iterative debugging: Quickly test → analyse → fix cycles.
Code generation: Created boilerplate while I focused on logic.

Next Steps

Short Term

Roast profile tuning: Adjust heat/fan curves to hit 15-20% development consistently.
Data collection: Log every roast for analysis (temp curves, timestamps, outcomes).
- Add support for automatically exporting roast statistics and ability to rate roasts later.
Improved first crack detection: Capture manual recording sessions using different environmental setup to improve detection. What we have is impressive given we only had 9 roasting sessions for fine tuning. But we can do better.
Implement multiple agent frameworks and compare pros and cons.
Test MCP servers running on a Raspberry PI 5.

Medium Term

Train an emulation roast model using historical roast logs.
- This will allow experimentation without using actual hardware and will also allow realistic response taking heat, fan and time variables to emulate roaster heating process.
Machine learning on roast profiles: Train model to predict optimal heat/fan adjustments once there is enough roast samples and ratings.
Custom UI: Build dedicated roasting interface (replace n8n for end users) to allow unified experience across agent frameworks.
Multi-origin support: Adjust profiles based on bean origin (Kenya vs Brazil)

Conclusion

Can .NET Aspire roast coffee? Absolutely.

More importantly, it provided:

Unified orchestration for polyglot services (C#, Python, Node.js containers)
Developer productivity with single-command startup and hot reload
Production observability with unified logs, traces, and metrics
Flexibility to iterate quickly on both code and workflows

The combination of:

PyTorch model for first crack detection (Part 1)
MCP servers for hardware control and detection (Part 2)
.NET Aspire orchestration with n8n workflows (Part 3)

...resulted in a fully autonomous coffee roasting system that produces genuinely good coffee.

From 9 raw audio recordings to autonomous coffee roasting—all orchestrated with a single command: dotnet run

The coffee tastes great. The code is open source. And yes, .NET Aspire can definitely roast coffee.

For reference, Today's roast incurred $0.76 OpenAI API usage cost.

Resources

Code and Articles

Code repository: Bean Agent
Part 1: Training the Audio Detection Model
Part 2: Building MCP Servers

.NET Aspire Documentation

Aspire Overview: .NET Aspire documentation
Upgrade to Aspire 13: Upgrade guide
Python Hosting in Aspire: Orchestrate Python apps

Tools and Protocols

n8n Workflow Automation: n8n.io
Model Context Protocol: modelcontextprotocol.io
OpenTelemetry: opentelemetry.io

Model & ML:

Audio Spectrogram Transformer (AST) - Pre-trained model
AST Documentation
Fine-Tuning AST Tutorial
Original AST Paper - Gong et al., 2021

The first roast:

Part 2: Building MCP Servers to Control a Home Coffee Roaster - An Agentic Development Journey with Warp Agent

syamaner — Sun, 02 Nov 2025 14:58:02 +0000

Introduction

In this 3-part series, we are building an autonomous coffee roasting agent with Warp. The first part covered how we fine-tuned a model to detect first crack — a critical phase in the roasting process. This was a nice warm-up implementing a key component for our end goal, but detection alone isn't enough. Now we need to expose this functionality so the agent we'll build can both detect first crack and control the roasting process.

This post focuses on:

The objective: Turning ML predictions into real-world roaster control actions.
Solution overview: Model Context Protocol (MCP) servers as the bridge between AI agents and hardware
Implementation: The two MCP servers we built—First Crack Detector MCP + Hottop Controller MCP

📊 TL;DR

Connect trained ML model to physical roaster control using an agent to achieve autonomous coffee roasting
Build two MCP servers — FirstCrackDetector + HottopController
- Stack: Python MCP SDK, pyserial, pyhottop, Auth0 authentication
Real-time detection + safe roaster control via AI agents (208+ tests passing)
Using Warp Agent mode, Context7 MCP Server, Auth0 MCP Server during development
Next Part: Part 3 orchestrates both servers with Microsoft Agent Framework

Integrating software, hardware and agents without reinventing the wheel

Traditional approach: Build custom APIs, handle authentication, manage state

Write integration code using imperative/declarative patterns to manage task lifecycles.
Homegrown specifications make it harder to leverage emerging ML/AI technologies.
Each new AI model or agent requires custom integration work.

The MCP option: Standardised protocol for AI <-> tool communication

Provides deterministic tools for non-deterministic AI systems.
Benefits: Discoverability, type safety, streaming support, composability, and interoperability across AI models and agents.
Write once, connect to MCP-compatible AI (Claude, ChatGPT, custom agents).

🤖 Warp Agent Contributions

Provided MCP server scaffolding and tool definitions using standard MCP SDK.
Helped integrate pyhottop library for Hottop serial protocol communication.
Debugged serial communication timing issues and state synchronization.
Generated Auth0 authentication middleware with role-based access control.
Created comprehensive test suites (208+ tests across both MCP servers).
Suggested testing strategies including simulation mode for hardware free development.

What is MCP (Model Context Protocol) and Why Use It?

What is MCP?

MCP is a protocol that enables AI assistants like Claude, ChatGPT, Warp Agent Mode to connect to external resources through a client-server architecture following standard protocols. This allows using the same MCP server with various agent technologies without having to modify the code.

Client-Server Concept

MCP Server

A program that exposes specific data and tools (functionality) that an AI Agents can use. For example, a server might provide access to a database, file system, or even as in our case specific hardware.

MCP Client

The application that connects to MCP servers and makes their capabilities available to the AI. As an example, Claude /ChatGPT acts as an MCP client.

Transport Types

MCP supports different ways for clients and servers to communicate:

stdio (Standard Input/Output)
- Most common for local integrations
- Server runs as a subprocess, communicating via stdin/stdout
- Simple and works well for local tools
SSE (Server-Sent Events) -Used for remote servers over HTTP
- Server pushes updates to client
- Good for web-based integrations
Custom transports can also be implemented
- The protocol is designed to be transport-agnostic

Flow

When user asks Claude (MCP Client) something that is exposed by an MCP Server, the client can call the MCP server to retrieve data or execute tools, then use that information in its response.

Server 1: First Crack Detector MCP

In this section, we will briefly cover how the detector we have trained in the previous article is exposed as an MCP Server.

The following diagram illustrates how the components are exposed as an MCP Server:

Implementation Details

MCP SDK setup: Server initialisation using standard Python MCP SDK with stdio and SSE transports.
Tool definitions: start_detection, stop_detection, get_status with Auth0 role-based authorisation.
Session management: Thread-safe singleton pattern with idempotency enforcement.
Real-time monitoring: Streaming detection events via SSE for live status updates.
Error handling: Audio device enumeration failures, model loading issues, thread crashes, timeout scenarios.

Code Walkthrough

# Key implementation components:
# - MCP SDK decorators for tool registration
# - Auth0 JWT validation middleware
# - Session manager with thread-safe state
# - OpenTelemetry tracing integration

# First Crack Detection MCP Server setup
from mcp.server import Server
from mcp.types import Tool, TextContent

mcp_server = Server("first-crack-detection")

@mcp_server.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="start_first_crack_detection",
            description="Start monitoring audio for first crack events",
            inputSchema={
                "type": "object",
                "properties": {
                    "audio_source_type": {
                        "type": "string",
                        "enum": ["audio_file", "usb_microphone", "builtin_microphone"]
                    }
                },
                "required": ["audio_source_type"]
            }
        )
    ]

@mcp_server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    if name == "start_first_crack_detection":
        # Thread-safe session management
        result = session_manager.start_session(
            audio_config=AudioConfig(**arguments)
        )
        return [TextContent(
            type="text",
            text=json.dumps(result, indent=2)
        )]

Testing Approach

Custom test scripts: Python scripts using stdio communication (test_mcp_roaster.py)
Shell integration tests: Bash scripts for end-to-end workflows (test_roaster_server.sh)
Unit test coverage: 86 passing tests with pytest, mocking audio devices and model inference
Manual hardware testing: Real USB microphone validation with live roasting sessions

Server 2: Hottop Roaster Controller MCP

The Second MCP Server is to expose the status of the roaster and also to run commands to set heat, fan as well as start / stop commands for the roast.

This is implemented as a separate MCP server due to different hardware requirements which might require running these on different hosts.

The Hottop KN-8828B-2K+ Protocol

Initial approach with pyhottop library

Initially, we have attempted to use the pyhottop library for serial communication with the Hottop KN-8828B-2K+ roaster. However, we encountered compatibility issues that prevented reliable operation and had to consider an alternative approach.

Adapting to Artisan's protocol

After pyhottop proved unreliable, we analyzed the Artisan roasting software source code — a mature, widely-used open-source application for coffee roaster control. Being a user of Artisan's Hottop integration, I knew it has been working well for me. Given it has been battle-tested by the roasting community for a long time, it was an obvious next choice.

Warp Agent Mode has successfully analysed and adapted Artisan's serial protocol implementation.

Implementation Details

Serial connection management: USB serial at 115200 baud, continuous 0.3s command intervals (required by Hottop)
Command encoding: Artisan-compatible 36-byte protocol with checksums
Status parsing: Real-time temperature readings (bean + chamber) from serial responses
Input validation:
- Heat/fan values: 0-100% in 10% increments
- Connection state checks before commands
- Thread-safe state management with locks
Authentication: Auth0 JWT with role-based access control
- read:roaster - Status monitoring only
- write:roaster - Full hardware control
- Per-user audit logging

Code Architecture

# Key implementation layers:
# 1. MCP Server (sse_server.py) - Auth0 + tool definitions
# 2. SessionManager - Thread-safe orchestration
# 3. HardwareInterface - Artisan serial protocol
# 4. Continuous command loop - 0.3s intervals with temperature polling

Testing Strategy

MockRoaster: Realistic thermal simulation for development without hardware. Unfortunately, due to time restrictions, this has provided limited utility.
Hardware verification: Validated with physical Hottop KN-8828B-2K+ (October 2025).
Test coverage: 122 passing unit tests.
Manual test scripts: test_hottop_interactive.py, test_hottop_auto.py.
Integration tests: SSE transport, Auth0 authentication, command sequences.

Hardware verification

The implementation was verified with physical Hottop KN-8828B-2K+ hardware on October 25, 2025:

Drum motor control
Heat control (0-100%)
Fan control (0-100%)
Bean drop sequence
Cooling system
Continuous temperature readings (Bean & Chamber)

Security and Safety Considerations

Transport and Security Architecture

Why SSE (Server-Sent Events)?

Although both MCP servers currently run on the same machine as the agent (hint hint: stdio transport would be simpler), we designed for a distributed architecture from the start. The plan is to eventually deploy:

Roasting MCP servers need to run close to the hardware - Running on the machine physically connected to the roaster and microphone.
Agent on a separate device - A cloud server or different local machine for orchestration.

This approach means we could potentially use a low powered computer (such as RaspBerry PI) for the servers and also make it easier to plugh and start instead of having to use a laptop next to roaster every time.

SSE benefits: Real-time streaming, works over standard HTTP/HTTPS, firewall-friendly
MCP compatibility: Follows MCP specification for HTTP+SSE transport
Future-proof: Easy transition from localhost to remote deployment

Authentication and Authorisation with Auth0

Once MCP servers are exposed over the Internet / network, security becomes critical—especially for hardware control. This section briefly covers our approach for integrating Auth0 for authentication and authorisation.

Why Auth0?

Ease of integration: Well-documented SDKs and middleware
OAuth 2.0 Client Credentials: Perfect for machine-to-machine authentication
Role-based access control: Granular permissions via scopes
Bonus: Auth0 MCP Server for Warp Agent Mode to perform configuration tasks.

Security implementation:

JWT validation: Every MCP request validates Auth0 JWT tokens
Scope-based authorization:
- read:roaster - Status monitoring only (observer role)
- write:roaster - Full hardware control (operator role)
- admin:roaster - Administrative functions (future)
Token expiration: JWTs expire, requiring regular re-authentication

This architecture ensures that only authorised clients can control the roaster, with full traceability of who did what and when—essential for safety-critical hardware operations.

Lessons Learned

It has been a fun side project seeing how far Warp Agent Mode (Coding Agent) and emerging MCP Servers for developer documentation (Context 7) and service access (Auth0 MCP Server) cam in terms of speeding up development process.

It is still required have a clear architecture, requirements and a final picture before starting. When these are used in conjunction with Agentic development tools, tasks could take several days can be completed in a day or so.

What Worked Well and Lessons Learned

Standard MCP SDK: Provided solid foundation for both stdio and SSE transports.
SSE transport: Future-proofed the architecture for distributed deployment.
- Tested working with N8N, LangFlow and Python based local agents.
Auth0 integration: Straightforward OAuth 2.0 implementation with role-based access control.
Warp Agent Mode assistance: Accelerated MCP protocol understanding, test generation, and Auth0 middleware implementation.
Context7 and Auth0 MCP servers: Using MCP servers to build MCP servers (via Warp) streamlined development.
pyhottop library: Initial attempt failed, but pivoting to Artisan's proven protocol worked as expected.
MockRoaster simulation: Due to time restrictions, this was not explored sufficiently and ended up manually testing using local agents.
- This is needs to be revisited in the future especially to be able to test agent loop.

MCP-Specific Insights

Transport choice matters: SSE enables remote deployment but requires careful auth implementation.
Tool vs resource patterns: Tools for actions (hardware control), resources for data streams (status monitoring) so far working well with agents.
Idempotency is critical: Start/stop commands must be safe to call multiple times.
- Initial attempts caused hardware to go into start stop loop and required manual intervention.
Scope-based authorization: Fine-grained permissions (read:roaster vs write:roaster) essential for hardware safety.
- Although in the current integration mode, only a single client is used.
Observability: Using Open Telemetry helped troubleshooting issues quickly.
Testing strategies: Combination of unit tests (208 total), MockRoaster simulation, and manual hardware validation.

Real-World Usage Examples

Warp Agnt with Roaster Control (initial testing)

The following are sample prompts intended for testing the MCP server from Warp.

User: "Start the roaster"
Warp Agent Mode: [calls roaster_control MCP: start_roaster()]
         "Roaster started. Ready to add beans."

User: "Check the roaster status"
Warp Agent Mode: [calls roaster_control MCP: read_roaster_status()]
         "Status:
         • Drum: Running
         • Bean temp: 172°C
         • Chamber temp: 185°C
         • Heat: 0%
         • Fan: 0%"

User: "Stop the drum"
Warp Agent Mode: [calls roaster_control MCP: stop_roaster()]
         "Roaster drum stopped. Heat automatically turned off for safety."

What's Next: Part 3 Preview

Now that we have:

A trained first crack detection model (Part 1).
Built two MCP servers with hardware control + ML inference (Part 2).
Used Auth0 for security and SSE transport for distributed deployment (Part 2).

In Part 3: Building an Autonomous Roasting Agent, we'll bring it all together:

Comparing multiple Agent frameworks and assessing their suitability for such long running workflows.
- N8N
- LangFlow
- Python based
- Using Microsoft Agent Framework
- Using OpenAI Python SDK
Real-time decision making: Agent analyses temperature trends, RoR, and first crack to adjust roast.
Extending OpenTelemetry observability: Distributed tracing across agent + MCP servers + hardware.
Safety systems:
- Temperature bounds monitoring.
- Emergency stop on anomalies.
- Human override via UI.
Full end-to-end test: Press start -> Preheat -> Add beans -> Hands off -> First crack -> Development -> Drop -> (hope for) Perfect roast :)

The goal: An AI that roasts coffee consistently, safely, and (hopefully) better than manual control.

Resources

MCP & Standards:

Model Context Protocol Spec
Official MCP Python SDK - Used in this project

Project Code:

Coffee Roasting Repository - Complete source code
First Crack Detector MCP
Roaster Control MCP

Hardware & Serial Communication:

Artisan Roaster Scope - Source of Hottop protocol implementation
pyserial Documentation - USB serial communication
Hottop KN-8828B-2K+ - Roaster hardware

Authentication:

Auth0 Documentation - Identity provider
OAuth 2.0 Client Credentials - Machine-to-machine auth

Tools:

Warp Terminal - AI-assisted development environment
Auth0 MCP Server - Used via Warp for Auth0 configuration
Warp MCP Documentation
Context7 MCP Server - Used for documentation lookup during development

Part 1: Training a Neural Network to Detect Coffee First Crack from Audio - An Agentic Development Journey with Warp Agent

syamaner — Mon, 27 Oct 2025 20:59:05 +0000

When it comes to coffee, everyone has their preferences. I usually prefer smooth, naturally sweet coffee with nice fragrance - no bitter or smoky flavours.

There is a challenge though: achieving that perfect roast at home requires split-second timing. Miss the "first crack" by 30 seconds? You've got bitter, over-roasted beans. Finish the roast early? Enjoy your grassy / earthy tasting coffee.

This post is about teaching a neural network to detect that critical moment from audio alone.

While home roasting has been niche, over the recent years there are more options available for roasting coffee at home. These devices usually have smaller capacity ~ 250 / 500g and compact and lightweight enough to run over a counter.

To achieve my desired roast level I generally aim for a light / medium roast which requires development phase to be about 10% - 15% of the roast time. Development phase is the duration from the first crack starting until the end of roast where beans are ejected from the roaster.

First crack is the audible popping sound that occurs when coffee beans rapidly expand and release moisture and CO2 due to the buildup of internal pressure during roasting. Many light roast profiles end just after first crack begins, while medium roasts continue for 1-3 minutes beyond this point. On my setup, first crack typically begins around 170°C-180°C, and I aim to finish the roast at approximately 195°C. This gives me 1-3 minutes of development time after first crack starts. This value is based on my observations on a Hottop KN8828B-2K+ home roaster.

Detecting the First Crack event is important for the end goal as we need to adjust heat and fan from that point to slow down the roast and stretch the development phase.

The current series of posts will cover the following:

Training a Neural Network to Detect Coffee First Crack from Audio - An Agentic Development Journey
Part 2: Building an MCP server to control a home coffee roaster
Part 3: Building a Coffee roasting Agent with Aspire to automate coffee roasting

I have been recording coffee roasting audio during summer and have been looking into fine-tuning an existing model to be able to train and run inference on an arm based laptop. The task is performing binary classification on an audio stream to identify either first crack happened in the sample or not. For example a common baseline benchmark is making a random choice to predict class a or b (coin toss) which makes the baseline random performance 50% for any binary classification problem when using random guessing. Our goal is to beat this with minimal data available for fine-tuning.

My initial objective has been utilising a pre-trained AST (Audio Spectrogram Transformer) model from Hugging face that was originally trained on AudioSet and fine-tuning it for first crack vs not first crack binary classification task. In this approach, the model architecture remains the same, but we are updating the weights through training on our coffee roasting audio data.

To tackle this challenge systematically, I decided to leverage modern development tools and adopted an AI-first development approach. In the next section, the details of the setup will be discussed.

📊 TL;DR

Problem: Detect coffee "first crack" from audio to optimize roast profiles
Solution: Fine-tune MIT's AST model on 9 recording sessions
Results: 93.3% accuracy, 0.986 ROC-AUC with minimal data
Tools: Warp Agent, Label Studio, PyTorch, Hugging Face
Next: Part 2 builds MCP servers, Part 3 creates autonomous roasting agent

Why Automated First Crack Detection?

Manual first crack detection requires constant attention during a 10-12 minute roast. Environmental factors (noisy extractors, ambient sounds) can mask the cracks and pops.

This project aims to:

Free the roaster to multitask during the roast
Provide consistent detection regardless of ambient noise
Enable data-driven roast profile development
Lay groundwork for fully autonomous roasting (Part 3)

🤖 Warp Agent Contributions

Throughout development, Warp's Agent Mode:

Suggested Label Studio over manual Audacity annotation
Generated data preprocessing pipeline architecture
Created train, test eval split logic
Debugged overfitting with annotation strategy advice
Auto-generated evaluation scripts

Setting Up the Development Environment with Warp Agent

Having used Warp Agent Mode at work for the past few months and how it transformed my development flow, it was a natural choice for this project.

I have started with creating a readme file and shared my starting requirements and setup. I have included links to the tutorials of interest, the libraries I intend to use and the model I would like to use for fine-tuning.

Warp's Agent Mode helped me structure the project, suggest tools, and iterate on the implementation approach from training scripts, evaluation to inference and manual testing scripts.

Project Evolution and Documentation

The readme above was pretty much all I shared with Warp and then asked to focus on Phase 1 and create an implementation plan for Phase 1.

I have lost the recordings I made over summer and therefore I had to start with minimal data - only 4 recording sessions ~10 minutes each. This was enough to build the initial workflow with Warp.

Data Collection Strategy

For data collection, I have used a USB microphone pointed towards the roaster and recording each roasting session. A session takes about 10 - 12 minutes. At the time of starting, I only had 4 recording sessions available.

Recordings have the following properties:

Sample rate: 44.1kHz (recommended for compatibility)
Format: WAV (uncompressed)
Bit depth: 16-bit minimum
Channels: Mono sufficient
Recording duration: Full roast cycle (10-15 minutes)

Data Annotation and Labeling

When I started, I was intending to do manual annotation using a free and open source audio editor and recording application Audacity. However Warp Agent pointed me towards Label Studio and even provided the configuration snippets and described how to use it.

With the initial 4 recordings, I have used sparse labels and proceeded to training and evaluation. This has led to overfitting and the results were not reliable.

Initial Results with Sparse Labeling

With only 4 recording sessions and sparse annotation (marking only obvious first crack events), the model showed signs of overfitting:

Metric	Value	Issue
Validation Accuracy	100% (epochs 2-7)	Perfect scores = memorisation
Training Accuracy	100% (epochs 3-7)	No learning after epoch 3
Test Precision	75%	High false positive rate
Class Imbalance	15% / 85%	Severe imbalance

The problem: The model memorised the limited training data rather than learning generalisable acoustic features of first crack.

The solution: Expanding to 9 sessions with balanced annotation (equal first_crack and no_first_crack samples) dramatically improved precision from 75% → 95.2% while maintaining excellent recall.

As I had increased the recordings to 9, I spent more time annotating and aimed at building a balanced data with enough samples for first crack and no first crack. Each annotated sample was 3 - 6 seconds.

Warp Agent Mode has also provided the configuration snippet for Label Studio

<View>
  <Header value="Coffee Roast First Crack Detection"/>
  <Text name="instructions" value="Listen to the audio and mark regions where first crack occurs. Mark other regions as no_first_crack."/>
  <Audio name="audio" value="$audio" zoom="true" speed="true"/>
  <Labels name="label" toName="audio">
    <Label value="no_first_crack" background="#3498db"/>
    <Label value="first_crack" background="#e74c3c"/>
  </Labels>
  <TextArea name="notes" toName="audio" 
            placeholder="Optional notes about this region (e.g., 'very clear pops', 'subtle', etc.)"
            editable="true"/>
</View>

Data Preprocessing Pipeline

Coffee roasting is driven by many variables ranging from ambient temperature to the bean type, machine, Heating type and so on. A basic electric roaster like the one used here is slow to respond to change in controls as the heating element needs to warm up and cool down depending on the command. Accurately identifying the current phase of the roast is crucial and this can be done by audio analysis, visual, a combination of time and temperature to varying degree of success. In my manual roasts, recently I have been getting better results by adjusting the parameters once first crack is reached and therefore decided to fine tune a model to detect these.

So given we have a microphone pointing at the roaster during roasting process and a relatively controlled environment, how do we get the recording and convert it into the format needed to support our fine-tuning process.

Challenges:

Raw audio files are captured from multiple roasting sessions of varying length.
- Additionally ~ 10 previously recorded sessions lost accidentally.
First crack events are sparse. Happens around about 12-25% of the whole duration. And they are also not continuous.
- This leads to an imbalance in samples.
We need a workflow and a pipeline to process these and end up with a balance dates for training evaluation and test.
At the beginning we also have a limited number of sessions recorded (9 at the time of writing)
Labelling should be easy and repeatable to avoid user errors.

Labelling Process

While the fine tuning approach, and the base model was instructed to Warp, Label Studio was not in the original requirements. And Warp has not only recommended using Label Studio but also provided detailed steps about running and configuring and get going. These worked out of the box.

┌─────────────────────────────────────────────────────────────────┐
│                    Label Studio (Web UI)                        │
│          Manually annotate audio files                          │
│          Mark "first crack" / "not first crack" time regions    |
└────────────────────────┬────────────────────────────────────────┘
                         │
                         │ Export JSON
                         ▼
        📄 project-1-at-2025-10-18-20-44-9bc9cd1d.json
                         │
                         │
    ╔════════════════════▼═══════════════════════════════════════╗
    ║  STEP 1: convert_labelstudio_export.py                     ║
    ║  • Strip hash prefixes from filenames                      ║
    ║  • Compute audio durations                                 ║
    ║  • Extract labeled time regions from the raw files         ║
    ║  • Output one JSON per audio file                          ║
    ╚════════════════════╦═══════════════════════════════════════╝
                         │
                         ▼
              📁 data/labels/
              ├── roast-1.json
              ├── roast-2.json
              └── roast-3.json
                         │
                         │
    ╔════════════════════▼═══════════════════════════════════════╗
    ║  STEP 2: audio_processor.py                                ║
    ║  • Read annotation JSONs                                   ║
    ║  • Load raw audio files (44.1kHz mono)                     ║
    ║  • Extract time segments (start→end)                       ║
    ║  • Save chunks as WAV files by label                       ║
    ║  • Generate processing_summary.md                          ║
    ╚════════════════════╦═══════════════════════════════════════╝
                         │
                         ▼
              📁 data/processed/
              ├── first_crack/
              │   ├── roast-1_chunk_000.wav
              │   └── roast-1_chunk_001.wav
              └── no_first_crack/
                  ├── roast-1_chunk_002.wav
                  └── roast-2_chunk_000.wav
                         │
                         │
    ╔════════════════════▼═══════════════════════════════════════╗
    ║  STEP 3: dataset_splitter.py                               ║
    ║  • Collect all chunks by label                             ║
    ║  • Train, validation and test split                        ║
    ║    (70% train, 15% val, 15% test)                          ║ 
    ║  • Copy files to split directories                         ║
    ║  • Generate split_report.md                                ║
    ╚════════════════════╦═══════════════════════════════════════╝
                         │
                         ▼
              📁 data/splits/
              ├── train/     (70%)
              │   ├── first_crack/
              │   └── no_first_crack/
              ├── val/       (15%)
              │   ├── first_crack/
              │   └── no_first_crack/
              └── test/      (15%)
                  ├── first_crack/
                  └── no_first_crack/

Once the steps above are complete, we are ready for training and evaluation.

Dataset Overview

Total Samples: 298 chunks from 9 roasting sessions

Overall Class Balance

Class	Count	Percentage	Avg Duration
first_crack	145	48.7%	4.5s
no_first_crack	153	51.3%	4.0s

Split Distribution

Split	Total Samples	first_crack	no_first_crack	Split Ratio
Train	208	101 (48.6%)	107 (51.4%)	69.8%
Validation	45	22 (48.9%)	23 (51.1%)	15.1%
Test	45	22 (48.9%)	23 (51.1%)	15.1%

Class Balance Across Splits

Class	Train	Validation	Test	Total
first_crack	101	22	22	145
no_first_crack	107	23	23	153

Per-Session Breakdown

Recording Session	first_crack	no_first_crack	Total	Balance
25-10-19_1103-costarica-hermosa-5	13	14	27	48.1% / 51.9%
25-10-19_1136-brazil-1	19	19	38	50.0% / 50.0%
25-10-19_1204-brazil-2	20	15	35	57.1% / 42.9%
25-10-19_1236-brazil-3	18	17	35	51.4% / 48.6%
25-10-19_1315-brazil4	15	14	29	51.7% / 48.3%
roast-1-costarica-hermosa-hp-a	16	17	33	48.5% / 51.5%
roast-2-costarica-hermosa-hp-a	16	19	35	45.7% / 54.3%
roast-3-costarica-hermosa-hp-a	13	19	32	40.6% / 59.4%
roast-4-costarica-hermosa-hp-a	15	19	34	44.1% / 55.9%

Key Observations:

Nearly balanced dataset (48.7% vs 51.3%)
Stratified split maintains balance across train/val/test
9 recording sessions, mix of Costa Rica and Brazil beans
Average chunk duration: 4.2 seconds
Total annotated audio: ~21 minutes

Evaluation Metrics

For this binary classification task, we use multiple metrics to evaluate the model performance:

Accuracy

The proportion of correct predictions (true positives and true negatives) among all predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

This metric provides an overall sense of model correctness. However, accuracy alone can be misleading with imbalanced datasets.

Precision

Of all samples predicted as first_crack, what proportion actually were first crack events?

Precision = TP / (TP + FP)

High precision means fewer false alarms. Critical when we don't want to prematurely adjust roaster settings based on incorrect detections.

Recall (Sensitivity)

Of all actual first_crack events, what proportion did the model correctly identify?

Recall = TP / (TP + FN)

High recall means we catch most first crack events. Missing first crack (false negative) is likely to result in over-roasting.

F1 Score

The harmonic mean of precision and recall, providing a single balanced metric.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Balances precision and recall. Useful when both false positives and false negatives are costly.
In case of roasting, these could mean under roasted or dark roast which is not desirable from this project perspective.

ROC-AUC (Area Under the Receiver Operating Characteristic Curve)

Measures the model's ability to distinguish between classes across all classification thresholds.

Confusion Matrix

The confusion matrix visualises the model's predictions versus actual labels:

                    Predicted
                    first_crack  no_first_crack
Actual  first_crack      TP            FN
        no_first_crack   FP            TN

Where:

TP (True Positive): Correctly predicted first crack
TN (True Negative): Correctly predicted no first crack
FP (False Positive): Predicted first crack, but was actually no first crack (false alarm)
FN (False Negative): Predicted no first crack, but was actually first crack (missed detection)

Training and Evaluation

With our dataset properly split and balanced and or metrics defined, we're ready to fine-tune the Audio Spectrogram Transformer (AST) model for first crack detection.

Model Architecture

The project uses MIT's pre-trained AST model (MIT/ast-finetuned-audioset-10-10-0.4593) from Hugging Face, which was originally trained on AudioSet. The model architecture:

Input: Audio spectrograms (16kHz, 10-second windows)
Architecture: Vision Transformer adapted for audio
Transfer Learning: We keep the pre-trained weights and fine-tune for binary classification
Output: Two classes - first_crack vs no_first_crack

Training Configuration

The training process uses the following configuration (defined in models/config.py):

TRAINING_CONFIG = {
    'batch_size': 8,
    'learning_rate': 1e-4,
    'num_epochs': 50,
    'device': 'mps',  # Apple Silicon GPU
    'sample_rate': 16000,
    'target_length_sec': 10.0
}

Key training features:

Class-weighted loss: Addresses class imbalance
AdamW optimizer: With cosine annealing learning rate schedule
Early stopping: Based on validation F1 score
TensorBoard logging: Real-time metrics visualization

Training Process

To start training:

./venv/bin/python src/training/train.py \
  --data-dir data/splits \
  --experiment-name baseline_v1

The training script:

Loads train/val data using AudioDataset (automatic resampling to 16kHz)
Applies class weights to handle imbalance
Trains with early stopping (patience: 10 epochs)
Saves best model based on validation F1 score
Writes checkpoints to experiments/runs/<experiment_name>/

Results: Exceeding Expectations

With only 9 recording sessions (~21 minutes of annotated audio):

Metric	Baseline (Random)	Our Model	Improvement
Accuracy	50.0%	93.3%	+86.6%
Precision	50.0%	95.2%	+90.4%
Recall	50.0%	90.9%	+81.8%
F1 Score	50.0%	93.0%	+86.0%
ROC-AUC	0.50	0.986	+97.2%

Translation: The model correctly identifies first crack 93 times out of 100,
with only 1 false alarm and 2 missed detections across the test set.

Confusion Matrix
                    Predicted
                    no_first_crack  first_crack
Actual  no_first_crack     22            1
        first_crack         2           20

This is excellent performance for a model trained on just 9 recording sessions! The higher overlap (70% vs previous experiments) likely contributed to the improved results. This demonstrates the power of transfer learning with pre-trained audio models.

Performance breakdown:
• Only 1 false alarm (FP) - down from 2
• Only 2 missed detections (FN) - same as before
• 22/23 correct no_first_crack predictions (95.7%)
• 20/22 correct first_crack predictions (90.9%)

This balanced performance is crucial for real-time roasting control where both missing first crack and triggering false adjustments have consequences.

Evaluation on Test Set

To evaluate the final model:

./venv/bin/python src/training/evaluate.py \
  --checkpoint experiments/final_model/model.pt \
  --test-dir data/splits/test

This generates:

Classification report with per-class metrics
Confusion matrix visualization
ROC curve analysis
Detailed results saved to text files

Key Learnings

What Worked:

Transfer learning from AudioSet significantly reduced data requirements
Balanced annotation (equal first_crack/no_first_crack samples) improved performance
10-second windows captured enough context for accurate detection
Class-weighted loss handled remaining imbalance effectively

Challenges:

Initial sparse labelling with only 4 sessions led to overfitting
Limited training data (9 sessions) required careful annotation strategy
Environmental noise kept to a minimum under a controlled environment

Future Improvements:

Collect more diverse roasting sessions (different beans, temperatures, extractor configuration)
Experiment with data augmentation (time stretching, pitch shifting)
Test shorter inference windows for faster real-time detection

Real-Time Inference

The trained model can now detect first crack in real-time from either audio files or live microphone input:

# File-based detection
./venv/bin/python src/inference/first_crack_detector.py \
  --audio data/raw/roast-1.wav \
  --checkpoint experiments/final_model/model.pt

# Live microphone detection
./venv/bin/python src/inference/first_crack_detector.py \
  --microphone \
  --checkpoint experiments/final_model/model.pt

The detector uses sliding window inference with "pop-confirmation" logic:

Analyzes 10-second audio windows with 70% overlap (3-second hop between windows)
Requires minimum of 3 positive detections (pops) within a 30-second confirmation window
Maintains detection history to filter false positives
Returns timestamp in MM:SS format when first crack is confirmed

This forms the foundation for Part 2, where we'll wrap this detector in an MCP server for integration with AI agents.

Real-Time Performance

Hardware: Apple M3 Max (MPS - Metal Performance Shaders)

Speed Metrics

Metric	Value
Real-Time Factor (RTF)	87.64x
Per-window Latency	70-90ms
Throughput	~18 windows/second
Processing Speed	1 hour of audio in ~41 seconds

Batch Inference Results

File	Duration	Processing Time	RTF
roast-1	10:39 (639.7s)	7.67s	83.46x
roast-2	10:16 (616.6s)	6.92s	89.06x
roast-3	10:25 (625.6s)	7.05s	88.74x
roast-4	9:44 (584.8s)	6.55s	89.29x
Total	41.1 min	28.2s	87.64x

Latency Breakdown (per 10s window)

Audio loading: 1-2ms
Feature extraction: 20-30ms
Model inference: 50-60ms
Total: ~70-90ms per window

Resource Usage

CPU Usage: 5-10% during inference
Memory: ~1.5GB for model + 100MB working
GPU Memory: ~2GB on MPS
Latency overhead: 0.9% (90ms used / 10,000ms available)

Key Insight: The model processes audio 87x faster than real-time, providing a 111x headroom for real-time streaming detection. A 10-minute roast is fully processed in just ~7 seconds, making real-time monitoring easily achievable even with additional processing overhead.

The Warp Agent Advantage

Throughout this project, Warp's Agent Mode was instrumental in:

Rapid Prototyping - From idea to working pipeline in hours, not days

Best Practice Guidance - Suggested Label Studio and evaluation workflows
Code Generation - Created complete scripts for data processing, training, and inference

Iterative Refinement - Helped debug overfitting issues and improve annotation strategy

Documentation - Generated summaries, reports, and README documentation automatically

The development workflow felt more like pair programming with an engineer who knew PyTorch and audio processing.

What's Next?

The first crack detector is working well, but it's just the beginning.

In Part 2: Building MCP Servers for Coffee Roasting, we'll:

Wrap the detector in an MCP server for real-time streaming
Build a second MCP server to control the Hottop roaster (heat, fan, cooling)
Implement authentication and safety controls
Test end-to-end detection → action loop

In Part 3: Creating an Autonomous Roasting Agent, we'll bring it all together:

Use .NET Aspire to orchestrate multiple MCP servers
Build AI agents that make real-time roasting decisions
Implement safety rails and human override
Roast a batch fully autonomously and compare against manual profiles

The goal: Press start, add beans when ready, hand off and observe and enjoy perfectly roasted coffee.

Follow along on GitHub or subscribe for Part 2!

Resources

Project & Tools:

Project Repository - Complete code and documentation
Warp Terminal - AI-assisted development environment
Label Studio - Audio annotation tool

Model & ML:

Audio Spectrogram Transformer (AST) - Pre-trained model
AST Documentation
Fine-Tuning AST Tutorial
Original AST Paper - Gong et al., 2021

Audio Processing:

LibROSA - Audio analysis library
PyTorch Audio - Audio I/O and transforms

Coffee Roasting Context:

First Crack Explained - For readers unfamiliar with roasting

Beyond Basic RAG: Measuring Embedding and Generation Performance with RAGAS

syamaner — Sat, 12 Apr 2025 14:33:03 +0000

Introduction

In the previous post, we looked at a basic Retrieval Augmented Generation (RAG) example using .Net for both retrieval and generation. This is built using out of the box components offered by Semantic Kernel and used an out of the box chunking approach.

The barriers of entry to achieve this is low which helps to democratise access to Large Language Models (LLMs) in wider ecosystems and drive innovation. For instance; in .Net, it is possible to use Microsoft Semantic Kernel or Aspire.Azure.AI.OpenAI (OpenAI, Azure OpenAI as well as compatible local options such as Ollama). There is even an emerging open source .NET port of Langchain with JetBrains being an official supporter. For those who would like to run inference in process (CPU or GPU) without HTTP APIs, there is also LLamaSharp which is a .Net wrapper around llama.cpp supporting CPU and GPU inference.

However, given that there are many parameters / tweaks to ingestion, retrieval and generation, how can we measure the quality and outcome when building such applications?

The following Google Trends chart compares search terms LLM (Blue), RAG (green), RAG Evaluation (Red) and langchain (Yellow) between January 2023 and April 2025.

We observe that:

Earlier in 2023, RAG was a more popular search term.
Around January 2024, LLMs started to takeover in popularity.
langchain remained in a steady position during the time frame.
RAG Evaluation has negligible existence in the trends.

Given Google is a public and general purpose search engine, the results do not mean there is no interest in evaluation but the general public may not be thinking about these aspects yet.

Another look from academic papers perspective comparing "RAG" and "RAG Evaluation" yields different results. The publications has grown from 14 papers on "RAG" / 3 papers on "RAG Evaluation" during 2022 to 1041 on "RAG" / 454 on "RAG Evaluation" in 2024 (Source: ArXiv Trends, 2025). As RAG topic becomes mainstream and popular, the evaluation methods also become a popular topic of research.

This post will cover the following sections:

RAG Evaluation
System under Evaluation
RAGAS
Evaluation Approach
Results
Conclusion

RAG Evaluation

Evaluating Retrieval-Augmented Generation (RAG) systems is a crucial aspect of solutions that incorporate such technologies.

Unlike traditional software where testing involves a deterministic process (given the input, we know the expected outcome), RAG outputs depend on two probabilistic / non deterministic components:

Retrieval accuracy (finding relevant source data, rewriting user query, and similar approaches)
Generation quality (producing coherent, factual responses)
Variation in ingestion and generation: Chunking strategies, using metadata or not, tweaking / versioning prompts or inference parameters.

Without systematic evaluation:

How can we tell the difference in results from random noise?
Hallucinations and irrelevant answers can go undetected.
Optimisation by guesswork: Let me change this parameter and see.
How do we deal with regressions?
Cost / benefits.
- Given runtime costs include input / output tokens, for production applications, these are also crucial metrics.
- If these are not included evaluation and comparison, a well performing system that might be too expensive to run could be the end result. Given this is a hobby project, this step is excluded from the current experiments.

There are established data sets and benchmarks for Retrieval Augmented Generation systems. These benchmarks typically include datasets for tests including the query, expected answer, expected context. These are then used against the RAG system under test to evaluate the results using various metrics as defined below. Google Frames Benchmark is one example that provides dataset based on Wikipedia Articles to evaluate. metrics such as factuality, retrieval accuracy, and reasoning.

These approaches introduce some challenges as following:

Such datasets are generic and do not necessarily factor domain specific nuances in the target use case.
It is possible that the test data might have been included in the training data set.
There can be bias agains specific metrics.

In this post, we will focus on how to measure both retrieval performance (e.g., Context Precision) and generation quality (e.g., faithfulness, semantic similarity) using RAGAS evaluation framework.

We will be utilising RAGAS and LLM as judge approach to generate evaluation data from our documents and then run evaluation using Jupyter Notebooks running on Aspire to see the results.

System under Evaluation

We are ingesting Markdown documentation from official Microsoft .NET Aspire repository.
Using Semantic Kernel for ingestion and a very simple chunking approach.
Using Semantic Kernel for search.
We register a dedicated Qdrant Vector store for each embedding model we use for evaluation. We also register an embedding model with semantic kernel for each model we will evaluate.
Lastly we also register chat completion models for each LLM we are evaluating using model name as key.

This approach allows us to use correct vector store at runtime for ingestion and retrieval depending on the request parameters used. This also ensures the request can select the LLMs for generation aspect when we are running evaluation.

System Overview

Ingestion and Query

RAGAS

Ragas is one of the libraries that simplify the evaluation of Large Language Model (LLM) applications. It provides the necessary tools to generate test data as well as evaluate the results using various approaches and metrics.

RAGAS Metrics

In this section a brief overview of the metrics used in this post will be provided. For more details, please refer to RAGA Metrics documentation.

The following metrics are summarised from official RAGAS documentation metrics section.

Semantic Similarity

Measures the similarity between the answer from the LLM and the reference answer in the test dataset.

Starts with the answer embeddings and the reference embeddings. Then computes cosine similarity between the two vectors.

Answer Relevancy

Answer relevancy measures how relevant the response is to the user input. This is calculated as the following:

Using an LLM, and the response under evaluation, generate a set of (3) artificial questions.
Compute cosine similarity between the embedding of the user input and the embedding of the generated questions.
Average of the scores will determine the Answer Relevancy.

Factual Correctness

Factual Correctness metric is used to evaluate the factual accuracy of the generated response against the reference. This metric uses the LLM to first break down the response and reference into claims and then uses natural language comparison to determine the factual overlap between the response and the reference. This overlap is quantified using precision, recall, and F1 score.

Precision: Measured number of positive predictions that were correct. This metric is higher when there are low false positive.

Recall: Measured how many of the actual positives were correctly identified. Recall is higher when false negatives are low.

F1: When the difference between Precision and Recall is large, F1 score can be used to balance. F1 score will be closer to the lower of two other metrics. F1 can only be high if both precision and recall are high.

Faithfulness

Faithfulness metric can be used to measure how factually consistent a response is with the retrieved context.

A faithful response is a response where all claims included in the response are consistent with the retrieved context from vector store.

This is a measure that addresses the hallucination detection. If the provided answer can be backed up by context, then it means there are no additions from the generative model.

Context Recall

Context recall measures the number of relevant documents retrieved from the vector store. If retrieval ensures that no important information is missed at this stage, then the Context Recall is considered high.

First, the reference from the evaluation dataset is broken down into claims. Then each claim in the reference answer is analysed to determine whether it can be attributed to the retrieved context or not. Ideally, all claims in the reference answer should be attributable to the retrieved context.

Evaluation Approach

Our evaluation approach involves using multiple embedding and generation models. So the process goes as following:

Configuration

Full evaluation dataset.
Embedding models:
- mxbai-embed-large (335M parameters, Ollama local)
- text-embedding-3-large OpenAI
For each embedding model, we evaluate using the following generative models:
- phi4 (14B parameters)
- chatgpt-4o-latest
- qwen2.5:32b
- deepseek-r1 (7B parameters)
- llama3.2 (3B parameters)
- llama3.3 (70b parameter)
- deepseek-r1:70b
- gemma3:12b
- mistral-small3.1 (24B parameters)
- gemma3 (4B parameters)
- gemma3:27b
We then save evaluation results a csv file.

Test Data Generation (RAGAS, GPT-4o, Jupyter Notebooks)

The selected approach is LLM As a judge and we use RAGAS to generate test dataset from our documents (.NET Aspire documentation)

We use TestsetGenerator class from RAGAS as documented in basic usage. The steps are:

Load our documents.
Define the personas to be used for generating queries (technical, novice, expert, ...)
Declare the distribution for question types (simple, complex or reasoning)
Generate test datasets.

We are using GPT-4o for generative model and text-embedding-ada-002 (default) for the embedding model. This is based on the assumption that using state of the art models will provider better quality test data generation.


# Initialise personas and generator

#https://docs.ragas.io/en/stable/howtos/customizations/testgenerator/_persona_generator/#personas-in-testset-generation

personas = [
    Persona(
        name="Technical Analyst",
        role_description="Focuses on detailed system specifications and API documentation"
    ),
    Persona(
        name="Novice User",
        role_description="Asks simple questions using layman terms and basic functionality"
    ),
    Persona(
        name="Security Auditor",
        role_description="Focuses on compliance, data protection, and access control aspects"
    ),
    Persona(
        name="Docker expert",
        role_description="Has in depth experience with Docker and DSocker compose and expert at cloud native concepts"
    )
]

generator_llm = LangchainLLMWrapper(ChatOpenAI(model=openai_model, temperature=0.1))  
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=generator_embeddings,
    persona_list=personas
)

# Initialise query distribution and generate dataset.

# https://docs.ragas.io/en/stable/references/synthesizers/

from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer
)
query_distribution = [
    (SingleHopSpecificQuerySynthesizer(), 0.4),  # Simple questions
    (MultiHopSpecificQuerySynthesizer(), 0.4),   # Complex questions
    (MultiHopAbstractQuerySynthesizer(), 0.2)    # Reasoning questions
]

dataset = generator.generate_with_langchain_docs(
    docs,
    testset_size=100,
    query_distribution=query_distribution
)

Generated test dataset contains the following columns:

user_input : The question we will pass to our RAG system.
reference_contexts: The reference context that would be retrieved in the ideal case for the given question.
reference: The ideal response to the user input.
synthesizer_name: The type of synthesiser used to generate the row. (single hop, multiple hop abstract, multi hop specific.

SingleHopSpecificQuerySynthesizer

RAGAS uses a knowledge graph based approach to passing the documents to create test dataset. A single hop specific query synthesiser will use only one node from the graph(headlines or key phrases) to generate query. Single hop in this context means using a single node from a knowledge graph built from the input document.

Example question:

"How does .NET Aspire manage launch profiles for ASP.NET Core service projects?"

MultiHopSpecificQuerySynthesizer

Similar to previous synthesiser, this also uses specific properties. However this is achieved by using multiple chunks that overlap with each other. These would require the retrieval process to be able to retrieve multiple documents or sections to build the expected context.

Example question:

How does the Azure SDK impact the ability to run Azure services locally in containers and provision infrastructure using .NET Aspire?

MultiHopAbstractQuerySynthesizer

Intends to provide generalised (abstract) queries using multiple notes of the document.

Example question:

How can a custom command be created and tested in .NET Aspire to clear the cache of a Redis resource?

Generated test data available via Github Repository

The notebook to generate the test data is also accessible via GitHub Repository

RAG Pipeline (.NET)

Ingestion:
- For each Embedding model:
- Create a vector store for persistence.
- Generate embeddings using the current Embedding model and add them to the given Vector store matching the embedding model name.
Retrieval and Generation
- Request specifies embedding model and generation model
- Retrieve using the vector store named after the requested embedding model
- Use the contest with the desired Generative model from the request.
- Semantic Kernel simplified this by allowing keyed registration support multiple embedding and chat models via .NET Dependency Injection.

The vector stores for embedding models can be seen below:

Evaluation Run

This is achieved using Python as RAGAS is a Python library. Evaluation is performed as following:

Pick n (5) random entries from eval dataset
For each embedding model
- For each generative model
- Call our API to get embeddings (context) using vector search
- Call our API to run RAG query and return answer.
- Set the retrieved_contexts and response in the eval dataset.
- Generate the data and run evaluation.

RAG Evaluation Dataset Structure

Column Name	Description
user_input	Generated question used to query the system
reference_contexts	Reference context documents generated prior to evaluation
retrieved_contexts	Actual context documents returned from the API during runtime
reference	Reference answer generated prior to evaluation (ground truth)
response	Actual response generated by the RAG system during evaluation
embedding_model	The embedding model used for retrieval (e.g., text-embedding-3-large)
chat_model	The generative model used to produce the final response (e.g., ChatGPT-4o)

code example:

# Retrieve context for evaluation (vector search) using eval input from the current dataset row.

endpoint = f"{base_url}/vector-search"
params = {
    "query": query,
    "embeddingModel": embedding_model
}

# RAG query using the eval query, current embedding model and current generative model

endpoint = f"{base_url}/chat-with-context"

params = {
    "query": query,
    "embeddingModel": embedding_model,
    "chatModel": chat_model
}

Full code available at GitHub Repository](https://github.com/syamaner/moonbeans/blob/bulk-performance_evaluation/src/AspireRagDemo.AppHost/Jupyter/Notebooks/evaluation.ipynb)

Results

Embedding and LLM Model Performance Comparison

Edit 14/04/2025 - Results for the full dataset with 101 questions and answers

As seen below, when using the full evaluation dataset (101 questions and answers), top performers for faithfulness are:

text-embedding-3-large + gemma3: 80.42%
mxbai-embed-large + chatgpt-4o-latest: 79.37%
text-embedding-3-large + phi4: 79.28%

In above cases, we have at least one component that is not open source. However, this also means we can optimise our choices on models if needed finding a hybrid approach.

Considering top performers for semantic similarity, OpenAI embeddings come on top. However, open source chat models can be competitive when combined with text-embedding-3-large as seen below.

text-embedding-3-large + gemma3:12b: 93.68%
text-embedding-3-large + deepseek-r1:70b: 93.64%
text-embedding-3-large + phi4: 93.59%

Observations

text-embedding-3-large generally achieves higher semantic similarity scores compared to mxbai-embed-large
The combination of embedding model and chat model significantly impacts performance metrics
Larger models don't always outperform their smaller counterparts (e.g., gemma3:12b vs gemma3:27b)

Embedding Model	Chat Model	Faithfulness Score	Semantic Similarity
text-embedding-3-large	gemma3	80.42%	93.41%
mxbai-embed-large	chatgpt-4o-latest	79.37%	92.71%
text-embedding-3-large	phi4	79.28%	93.59%
mxbai-embed-large	deepseek-r1	78.62%	92.42%
mxbai-embed-large	qwen2.5:32b	78.26%	93.00%
text-embedding-3-large	gemma3:12b	78.21%	93.68%
text-embedding-3-large	llama3.2	77.85%	93.46%
mxbai-embed-large	gemma3	77.83%	92.22%
mxbai-embed-large	gemma3:12b	77.73%	92.98%
text-embedding-3-large	llama3.3	77.64%	93.58%
mxbai-embed-large	gemma3:27b	76.63%	92.44%
mxbai-embed-large	llama3.3	76.59%	92.91%
text-embedding-3-large	deepseek-r1:70b	76.46%	93.64%
mxbai-embed-large	mistral-small3.1	76.40%	93.01%
text-embedding-3-large	deepseek-r1	75.76%	93.18%
mxbai-embed-large	llama3.2	75.29%	92.38%
mxbai-embed-large	deepseek-r1:70b	75.19%	92.60%
text-embedding-3-large	chatgpt-4o-latest	74.56%	93.56%
text-embedding-3-large	gemma3:27b	74.53%	92.83%
text-embedding-3-large	qwen2.5:32b	74.25%	93.40%
text-embedding-3-large	mistral-small3.1	74.12%	93.10%
mxbai-embed-large	phi4	73.50%	92.28%

Key Takeaways

Open source models can be competitive

mxbai-embed-large outperformed OpenAI's premium text-embedding-3-large + chatgpt-4o-latest combination in faithfulness metrics.
Small models can be mighty

Local 3B-parameter models like llama3.2 achieved 93.46% semantic similarity, proving even quantised and smaller models have recently become more powerful.
The Hidden Cost of Accuracy

OpenAI's top-performing combination (text-embedding-3-large + ChatGPT-4o) had 74.56% faithfulness vs. open source alternatives 80.42% - a critical tradeoff between commercial API provider costs and local precision.
Microsoft is investing on AI with .Net Platform

We can now build a RAG system and achieve decent performance using nearly out of the box implementation. .NET Aspire takes this even further giving a flexible local development environment where we can mix and match the hosts for local inference as a container, host machine or over local network.

Future

The results are based on out of the box code for ingestion, generation as well as test data generation and first step towards establishing a baseline.

Given the evaluation process uses OpenAI API, it is costly to perform evaluation on a large dataset for hobby purposes. The next steps I would ideally follow are:

Start experimenting with prompts and versioning.
Run evaluation again.
Consider further tests using different chunking, retrieval / reranking strategies and compare the results.
If using a local llm as a judge proves effective, carry on the experiments using local models on a larger scale.

Jupyter AI & .NET Aspire: Building an LLM-Enabled Jupyter Environment

syamaner — Mon, 17 Feb 2025 20:58:44 +0000

In this post, we will cover installing and configuring Jupyter AI with Jupyter while driving configuration from .NET Aspire. The approach documented here provides out of the box solution using Jupyter AI without having to manually set it up.

What we will cover?

What is Jupyter AI?
Adding Jupyter AI to Jupyter image
Adding Microsoft.dotnet-interactive Jupyter kernel to add c# support to Jupyter Notebooks
.NET Aspire Configuration to specify code and embedding models / model providers
Running the custom image using .NET Aspire
And a quick Python demo

Jupyter AI

Jupyter AI is an extension that adds generative AI support to JupyterLab. With this extension it is possible to use the chat interface to ask about our code in Jupyter Notebooks, the source files and documentation accessible. It can also be used to generate code and inject to a cell.

To get Jupyter AI working the following steps are necessary:

Install and activate the extension.
Configure the embedding and language model (model name, endpoint, api keys if needed)
Access the extension and interact with it

Installing Jupyter AI and .NET Interactive kernel using a custom Dockerfile

In this section we will cover the Dockerfile used as well as the configuration via custom entry point script.

Creating the Dockerfile

This part is straightforward. We start with an appropriate Jupyter base image and then:

Install .NET 9
Install Python dependencies using requirements.txt
Install Microsoft.dotnet-interactive (so we can install .Net Interactive Kernel)
Copy our entry point file (run.sh)
Call entry point as a non root user

Yes with this minimal Dockerfile we have JupyterLabs server, Jupyter AI and even .Net Interactive Kernel allowing us using C#, F# and even PowerShell in nor notebooks.

FROM jupyter/base-notebook:ubuntu-22.04
ENV PYTHONDONTWRITEBYTECODE=1

USER root
RUN apt-get update && apt-get install software-properties-common cmake build-essential  libc6  -y

RUN add-apt-repository ppa:dotnet/backports \
    && apt-get update \
    && apt-get install -y dotnet-sdk-9.0  libgl1-mesa-dev  libglib2.0-0 \
    && apt-get clean && rm -rf /var/cache/apt/archives /var/lib/apt/lists/*

USER ${NB_UID}

RUN dotnet tool install -g Microsoft.dotnet-interactive
ENV PATH="${PATH}:/home/jovyan/.dotnet/tools"

COPY ./requirements.txt /home/jovyan/requirements.txt
RUN pip install --no-cache-dir -r /home/jovyan/requirements.txt
RUN dotnet interactive jupyter install

USER root
COPY ./run.sh /home/jovyan/run.sh
RUN chmod +x /home/jovyan/run.sh

USER ${NB_UID}

ENTRYPOINT ["/home/jovyan/run.sh"]

Configuring Jupyter AI via entry point script

This is achieved by our run.sh file as following:

We know that .NET Aspire injects the connection strings and additional config as environment variables. So we will utilise this:

Pass '' as token so we do not have auth for local Jupyter server.
We know that Jupyter AI extension is configured at startup by passing --AiExtension.* arguments to Jupyter lab command.
- We inject relevant --AiExtension.* arguments as passed from our Aspire host using environment variables.
- Pass EMBEDDING_MODEL and CODE_MODEL as language and embedding models using relevant arguments.
- Optionally set Embedding and Language model urls (if not using OpenAI).
- Inject the API Keys (required ig using OpenAI)
Execute the built entry command.

#!/bin/bash
CODEMODELURL=$(echo "${ConnectionStrings__codemodel}" | cut -d'=' -f2 | cut -d';' -f1)
EMBEDDINGMODELURL=$(echo "${ConnectionStrings__embeddingmodel}" | cut -d'=' -f2 | cut -d';' -f1)

# Base command
CMD="jupyter lab --NotebookApp.token=''"
echo "code model: $CODEMODELURL"
echo "embedding model: $EMBEDDINGMODELURL"

# Add embedding model
CMD="$CMD --AiExtension.default_embeddings_model=$EMBEDDING_MODEL"
# Add code model
CMD="$CMD --AiExtension.default_language_model=$CODE_MODEL"
CMD="$CMD --AiExtension.default_api_keys='{\"OPENAI_API_KEY\":\"${OPENAI_API_KEY}\", \"HUGGINGFACEHUB_API_TOKEN\":\"$HUGGINGFACEHUB_API_TOKEN\"}'"
CMD="$CMD --AiExtension.default_max_chat_history=12"
#,

# Add embedding model URL if specified
if [ ! -z "$EMBEDDINGMODELURL" ]; then
    CMD="$CMD --AiExtension.model_parameters $EMBEDDING_MODEL='{\"base_url\":\"$EMBEDDINGMODELURL\"}'"
fi

# Add code model URL if specified
if [ ! -z "$CODEMODELURL" ]; then
    CMD="$CMD --AiExtension.model_parameters $CODE_MODEL='{\"base_url\":\"$CODEMODELURL\"}'"
fi

# Execute the command
eval "$CMD"

The entry command above will provide us the following when we run out Aspire host:

.NET Aspire configuration and execution

In this section, we will cover the structure of launchSettings.json and the Aspire code putting it all together.

Configuration

Provided example has 3 profiles as following:

http-ollama-host : Using Ollama running on host with ollama:qwen2.5-coder:32b as code model and ollama:nomic-embed-text as embedding model.
http-ollama-local : ollama:qwen2.5-coder:14b as code model and ollama:nomic-embed-text as embedding model. and
http-openai : openai-chat:chatgpt-4o-latest as code model and openai:text-embedding-3-large as embedding model.

As an example, the following is how the settings is configured within launchsettings.json:

    "http-ollama-host": {
      ... 
      "environmentVariables": {
        ... 
        "CODE_MODEL": "ollama:qwen2.5-coder:32b",
        "CODE_MODEL_PROVIDER": "OllamaHost",
        "EMBEDDING_MODEL": "ollama:nomic-embed-text",
        "EMBEDDING_MODEL_PROVIDER": "OllamaHost",
        "EXTERNAL_OLLAMA_CONNECTION_STRING": "Endpoint=http://host.docker.internal:11434;"
      }
    }

Aspire Code

There is not much new here. We are spinning up a Jupyter container using the Dockerfile and optionally spinning up Ollama if the configuration requires so. For more reference the source file is available here

The Dockerfile is run as a container as below:

var jupyter = builder
    .AddDockerfile(Constants.ConnectionStringNames.JupyterService, "./Jupyter")
    .WithBuildArg("PORT", applicationPorts[Constants.ConnectionStringNames.JupyterService])
    .WithBindMount("./Jupyter/Notebooks/", "/home/jovyan/work")
    .WithHttpEndpoint(targetPort: applicationPorts[Constants.ConnectionStringNames.JupyterService], env: "PORT", name:"http")
    .WithLifetime(ContainerLifetime.Session)
    .WithOtlpExporter()
    .WithEnvironment("OTEL_SERVICE_NAME", "jupyterdemo")
    .WithEnvironment("OTEL_EXPORTER_OTLP_INSECURE", "true")
    .WithEnvironment("PYTHONUNBUFFERED", "0")
    .WithEnvironment("CODE_MODEL", chatConfiguration.CodeModel)
    .WithEnvironment("EMBEDDING_MODEL", chatConfiguration.EmbeddingModel);

Python Demo

For the demo the following use case is considered:

"Using the coding assistant, write code to extract SIFT features from two images and match them using approximate nearest neighbour approach (ANN). Then guide the assistant to implement RANSAC using Homography to improve the matches and eliminate false positives."

The initial prompt was straightforward and got a working code without much effort. However this approach is also full of false positives as seen below:

Initial Matches

From prompt two, we start asking improving matches by RANSAC and things start going wrong. however after a number of prompt and /fix commands, we actually get working code without human interaction

Improved matches

The conversation can be seen by inspecting the notebook snapshot:
src/AspireJupyterAI.AppHost/Jupyter/Notebooks/py-qwen-2-5-coder-32b-ollama-host.ipynb

C# Demo

Same use case was tried with a C# notebook however it took GPT-4o to come up with the solution. Given that OpenCV support in .Net is a bit of a niche, it is not surprising the models are not as effective. In addition as we are using .NET interactive in Jupyter, we are also in a niche territory there.

Here is an example notebook with c# (the prompts that led to the final code are missing unfortunately. csharp-openai-gpt-40-latest.ipynb

To get this working on an ARM laptop, the Dockerfile is also more complicated as we actually build OpenCv and OpenCVSharp as a stage in pour Docker file then copy native libraries and bindings to our final stage. Modified Dockerfile to support OpenCvSharp on ARM

The changes to this file are adopted from one of my older posts - Docker multi-architecture, .NET 6.0 and OpenCVSharp