syamaner

Posted on Apr 18 • Edited on May 9

Part 2: Building an Open Coffee Roasting Audio Dataset with Warp

#machinelearning #ai #warp #audio

I could not find a publicly available audio dataset for coffee roasting first crack detection. Hugging Face, Kaggle, and the academic literature I searched returned nothing directly usable for this task. First crack is a sparse acoustic event: short, irregular pops embedded in continuous drum and airflow noise, which makes both the recordings and the label schema non-trivial. Before the model in Part 1 could train a single epoch, I had to build one from scratch: recording sessions, annotating audio in Label Studio, and building a pipeline that mitigates a specific source of metric inflation in time-series audio models.

The result is a publicly available coffee roasting audio dataset: 973 annotated 10-second segments from 15 roasting sessions, published on Hugging Face. Segments are disjoint (non-overlapping) 10-second chunks; each is labelled first_crack when at least 50% of its window overlaps with the annotated first crack region, and no_first_crack otherwise. Two design choices contributed to the model achieving 100% precision and zero false positives on the held-out test set: splitting data at the recording level, and weighting the loss function to account for a 20/80 class imbalance.

My contribution was the domain judgment: what to record, how to annotate, and which constraints to enforce. Warp implemented the data pipeline (convert_labelstudio_export.py, chunk_audio.py, dataset_splitter.py) in PR #27 from a spec I iterated on before any code was written. This post focuses on those decisions, delivered across a two-night sprint; the addendum covers the follow-up infrastructure work.

The Gap

When there is no dataset, the options are limited: reuse adjacent data and accept a mismatch, or collect new data. Coffee roasting audio is specific enough that generic kitchen or industrial recordings are not a good substitute. A roaster has a consistent acoustic background (drum rotation and airflow), followed by short, irregular pops during first crack that vary with bean and roast profile.

A search across Hugging Face at the time of writing returns only text datasets related to coffee. Kaggle and Papers with Code do not surface annotated roasting audio in their public indexes. No widely used baseline appears in the public literature I searched.

This makes the dataset a primary contribution of this work. The model can be reproduced. The dataset allows other approaches to be explored without repeating the data collection work.

Recording & Annotation

Nine recordings were captured with a FIFINE 669B USB condenser microphone pointed at the roaster during the roasting process. Six additional recordings (S17) were captured with an Audio-Technica ATR2100x dynamic microphone for dedicated data collection. Both record via USB into Audacity (mono, 16-bit PCM at 44.1 kHz), with resampling to 16 kHz for the feature extractor.

The microphone choice affects the data. Condenser and dynamic microphones differ in sensitivity and frequency response. With more condenser recordings in the training set, the model primarily learned that acoustic profile. This imbalance later showed up as delayed detection (27.4 seconds) on recordings from the second microphone.

Each recording is a full roast (8–12 minutes) containing one first crack phase.

Annotation v1: The Fragmented Approach

The initial approach for the prototype treated first crack as a set of discrete events. Each audible burst was annotated as a short region. This produced fragmented labels: multiple short segments with gaps between them. The Brazil recordings from October 2025 show this clearly: 18 separate FC regions across a 90-second first crack window, ranging from 1.3 seconds (chunk_011: 455.75–457.09s) to 5.8 seconds, with gaps of up to 9 seconds between them.

The model trained and achieved 91.1% accuracy, but produced false positives. The issue was not data volume but label consistency.

The actual problem was in how chunk_audio.py assigns labels. Every 10-second window gets a binary label based on whether its overlap with annotated first_crack regions meets a ≥50% threshold:

# src/coffee_first_crack/data_prep/chunk_audio.py
def label_window(
    window_start: float,
    window_end: float,
    regions: list[dict[str, Any]],
    overlap_threshold: float = 0.5,
) -> str:
    window_duration = window_end - window_start
    overlap = compute_overlap(window_start, window_end, regions)
    if overlap >= overlap_threshold * window_duration:
        return "first_crack"
    return "no_first_crack"

With fragmented annotations, many windows containing real first crack audio fell below the threshold. The very first cracking window in brazil-3 is a 1.3-second annotated region at 455.75s. A 10-second training window spanning 455–465s overlaps that annotation for only 1.3 seconds (13% overlap), so it is labelled no_first_crack, even though it contains a real first crack pop. The same logic produces mislabelled windows at every inter-region gap throughout the recording, introducing systematic noise into the training data.

Annotation v2: Phase-Level

The correction was to treat first crack as a continuous phase. All 15 recordings were re-annotated with a single region per roast, spanning from the first audible pop to the end of consistent cracking. This aligns the annotation with the physical process rather than subjective segmentation.

Everything outside the annotated region is implicitly no_first_crack. One decision, enforced by the schema.

With one continuous region, the ≥50% threshold in label_window becomes deterministic. Every 10-second window inside the first crack period gets clean overlap. There are no inter-region gaps to corrupt the training signal. Baseline_v2, trained on these re-annotated labels, produced 0 false positives on a 191-sample test set. That is 100% precision.

The Pipeline Warp Built

The Label Studio export is a single JSON file. Before any training can start, that file needs to be parsed per recording, each recording chunked into 10-second windows, and the chunks split into train/val/test. I specified the constraints (fixed window size, ≥50% overlap threshold for labels, and splitting at the recording level), and Warp implemented the three scripts that handle it, working from inside the Warp terminal as part of PR #27:

convert_labelstudio_export.py parses annotations into per-recording files
chunk_audio.py generates fixed 10-second windows and assigns labels
dataset_splitter.py performs the train/validation/test split

Here is Warp executing the full pipeline (Label Studio conversion, chunking all 973 segments, and recording-level split) before immediately invoking the /train-model skill:

Data Leakage

Audio data introduces a specific form of leakage. A recording's acoustic fingerprint is constant throughout; the background drum hum, the extractor hood (home roasting), noise from outside (street), the room resonance, the mic's frequency response.

If chunks from the same recording appear in both training and test sets, the model can rely on background characteristics rather than the target signal.

The earlier prototype split at the chunk level, which allowed near-identical segments from the same recording to appear in both sets. Reported accuracy was therefore optimistic. Its train/test split looked as follows:

# coffee-roasting prototype — splits/
test/first_crack/roast-1-costarica-hermosa-hp-a_chunk_018.wav
train/first_crack/roast-1-costarica-hermosa-hp-a_chunk_019.wav
train/first_crack/roast-1-costarica-hermosa-hp-a_chunk_020.wav

chunk_018 and chunk_019 are consecutive windows from the same roasting session, milliseconds apart in time, sharing identical background characteristics. The prototype's 91.1% accuracy on that test set was real arithmetic on real predictions. The generalisation it implied was not: the model had acoustically seen those recordings before.

The revised pipeline groups chunks by their source recording and assigns entire recordings to a split. This ensures that evaluation is performed on unseen sessions.

Once recordings are grouped, a split assigns them 70/15/15, stratified by whether each recording contains any first crack chunks, so FC-containing recordings are distributed across all three sets rather than clustering in one. The result is 15 recordings split 9/3/3, producing 587/195/191 chunks.

Every chunk in the 191-sample test set comes from a recording the model had never encountered in any form during training.

Class Imbalance

Fixed non-overlapping 10-second chunking reveals the true distribution: about 20% of windows contain first crack.

Without correction, the loss function is dominated by the majority class. A model predicting only "no first crack" would achieve high accuracy without detecting anything useful. Standard CrossEntropyLoss treats all samples equally, so it provides 4× more gradient signal from the majority class per epoch.

The fix is a small subclass of the HuggingFace Trainer that overrides compute_loss to apply class-weighted CrossEntropyLoss:

# src/coffee_first_crack/train.py
class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        weights = self.class_weights.to(logits.device)
        loss_fn = nn.CrossEntropyLoss(weight=weights)
        loss = loss_fn(logits, labels)
        return (loss, outputs) if return_outputs else loss

The weights come from inverse class frequency on the training set, which increases the contribution of minority class samples during training without resampling or augmentation.

There is a trade-off. Increasing recall can increase false positives. In this system, a false positive is more problematic than a delayed detection. The resulting model favours precision: no false positives on the test set, with reduced recall.

Dataset by the Numbers

The full dataset is published at huggingface.co/datasets/syamaner/coffee-first-crack-audio. The source recordings, annotation JSONs, and pipeline code are in the GitHub repository.


Total chunks	973 (10-second WAV, 16kHz mono)
first_crack	197 (~20%)
no_first_crack	776 (~80%)
Recordings	15 (9 mic-1 FIFINE 669B, 6 mic-2 ATR2100x)
Bean origins	Costa Rica Hermosa, Brazil, Brazil Santos
Split (recordings)	9 train / 3 val / 3 test
Split (chunks)	587 train / 195 val / 191 test
Annotation tool	Label Studio, one `first_crack` region per recording
Chunking	Fixed 10s window, ≥50% overlap threshold
Split strategy	Recording-level (prevents data leakage)
License	Apache-2.0

All four engineering decisions described in this post (the annotation redesign, fixed non-overlapping chunking, recording-level splits, and class-weighted training) are encoded in the pipeline and reproducible from the published data.

Addendum: After the First Release

The sections above describe what shipped at the end of a two-night sprint: the problem, the annotation redesign, recording-level splits, class-weighted training, and the published dataset. This addendum covers what came after the artefact was in production: reframing the 27.4-second mic-2 delay as a data problem, the macOS audio work needed for paired recording, the two scripts Warp built from a spec we iterated on in the Warp terminal, the MCP conflict that broke the first real dual-mic roast, and a retrospective on six rounds of Copilot review on PR #48.

Reframing: A Data Problem, Not a Model Problem

The 27.4-second detection delay on the mic-2 test recording is reproducible and it does not move under hyperparameter tuning. Nine of the 15 recordings used the FIFINE K669B condenser; only 6 used the ATR2100x dynamic. The model learned first crack primarily through the FIFINE's acoustic lens: its sensitivity curve, its noise floor, its handling of attack transients. When the ATR2100x presents a different acoustic signature for the same physical event, the model stalls until it has accumulated enough evidence to override its prior.

Hyperparameter changes will not fix a distribution shift that lives in the training data. The fix is more paired data: recordings that capture the same first crack event through both microphones simultaneously, so every new roast adds one mic-1 and one mic-2 sample to the training set at a 1:1 ratio. That reframes the problem from tuning around a sensor bias to recording two USB microphones in lockstep for ten minutes, which is a solvable macOS audio problem, covered next.

Sample-Locking Two USB Mics

Each USB microphone ships with its own internal clock crystal, and opening two independent sounddevice.InputStream instances lets each one run on its own clock. The waveforms drift over a 10-minute roast, small enough to be invisible for the first few minutes, large enough by the end to make cross-mic timestamps unreliable. That breaks the "annotate once, propagate to both files" workflow I wanted next.

The fix is a macOS CoreAudio Aggregate Device. In Audio MIDI Setup I created a virtual multi-channel input called RoastMics combining the FIFINE K669B and the ATR2100x, with the FIFINE set as Primary Clock and Drift Correction enabled on the ATR2100x. CoreAudio continuously resamples the secondary mic so the two channels stay sample-locked: if a first crack pop arrives at sample index N on channel 0, it arrives at index N on channel 1.

That single property, physical sample-locking, is what makes paired annotation trivial. A dual-mic roast needs one Label Studio pass on the FIFINE track; the ATR2100x annotation is a straight copy because the timestamps are identical. Full Audio MIDI Setup steps and the calibrated gain values (FIFINE 23.00 dB, ATR2100x Front Left/Right 20.1 dB, validated on the Panama Hortigal Estate roasts) are in docs/multi_mic_setup.md.

What the Spec Asked For, What Warp Built

Before any code for the recorder was written, I drafted S19 as a GitHub issue and a Warp Plan artifact, and we iterated on it together. The spec went through five rounds in a single conversation:

Initial dual-mic design with a fixed --duration parameter.
Drop --duration in favour of indefinite recording until Ctrl-C.
Generalise from dual-mic to 1–N mics via a --mics list.
Add --labels with config-backed defaults under recording.mic_labels in configs/default.yaml.
Close three open questions: pin channel mapping, add a _partial suffix for sessions under --min-duration (default 60s), and add a --quiet flag.

Every round updated the GitHub issue body and the Warp Plan in lockstep. By the time implementation began the acceptance criteria, the argument table, and the session JSON schema were fixed. No design decisions were made while writing code.

Warp then implemented two scripts. scripts/record_mics.py opens RoastMics via sounddevice.InputStream, accumulates blocks, and on Ctrl-C splits channels, applies per-mic gain with a clipping guard, and writes one WAV per mic plus a {origin}-roast{n}-session.json containing hardware labels, gains, duration, and an ISO timestamp. scripts/propagate_annotations.py slots between convert_labelstudio_export.py and chunk_audio.py: it reads each *-session.json, finds the primary mic's annotation JSON, and writes an identical annotation for each paired mic with only audio_file and duration updated. The 15 pre-existing recordings have no session JSON, so the propagator does not see them (backward-compatible by construction).

The resulting end-to-end flow is:

python scripts/record_mics.py record --origin panama-hortigal-estate --roast-num 1
python -m coffee_first_crack.data_prep.convert_labelstudio_export ...
python scripts/propagate_annotations.py       # mic1 annotation → mic2 annotation (automatic)
python -m coffee_first_crack.data_prep.chunk_audio ...
python -m coffee_first_crack.data_prep.dataset_splitter ...

Three of those five steps existed before PR #48. The two new ones sit at positions 1 and 3. The rest of the pipeline is untouched.

Live Debugging: The MCP Exclusive-Lock

The first real dual-mic roast attempt failed. With the first crack detection MCP server running (the same detector I rely on during roasts), the RoastMics Aggregate Device dropped to "Offline device" with 0 input channels, and record_mics.py had nothing to open. Nothing in the spec had anticipated this.

Root cause took one turn. Warp read audio_devices.py in the coffee-roasting repo, traced find_usb_microphone() to the line returning the raw FIFINE device index, and identified the conflict: opening the raw subdevice holds an exclusive CoreAudio stream claim, which then makes the Aggregate Device unable to enumerate that same subdevice as a member. The fix was four lines:

# src/mcp_servers/first_crack_detection/audio_devices.py
for dev in devices:
    if "roastmics" in dev["name"].lower():
        return dev["index"]
# ... falls back to raw USB device

If RoastMics exists, the MCP detector now opens the aggregate at channel 0 (the FIFINE channel); CoreAudio multiplexes the hardware internally, and both the live detector and record_mics.py run in the same session. The diagnosis and the patch happened inside the same Warp conversation that reported the failure. I never left the terminal to open a separate editor or debugger.

Copilot Review Retrospective

Copilot reviewed PR #48 across multiple passes. Of the resulting comments, 23 warranted classification. Classified by actual impact rather than by what sounded plausible in the comment body, the breakdown was 10 essential bugs, 5 real-but-lower-blast-radius catches, 4 marginal style or documentation points, and 4 pure noise (PR title wording, stale epic-doc phrasing). The full per-comment table lives in the session summary; two findings stand out for this post.

The highest-impact catch was in batch 2. The --mics list argument translates mic numbers into NumPy channel indices via ch_idx = m - 1. Passing --mics 0 2 silently computes ch_idx = -1 for mic 0, and NumPy negative indexing reads the last column of the buffer without raising. The mic-1 output file would be filled with the wrong channel and labelled as FIFINE, with no warning, no error, and wrong training data. The recorder now rejects non-positive mic numbers at startup.

The quietest catch was in batch 5. recording[:, ch_idx] * gain with gain as a Python float promotes the sounddevice float32 buffer to float64. soundfile.write() picks up that dtype and writes 64-bit float WAVs, twice the intended file size, with no warning and no error. The fix is one np.float32(gain), one .astype(np.float32, copy=False), and an explicit subtype="FLOAT" on write.

Signal density from Copilot dropped sharply once the structural bugs were gone. Batches 1–3 produced 10 essential-or-good catches across 12 comments (83%); batches 4–6 produced 5 across 11 (45%), with 4 of those being PR metadata or stale doc phrasing. Six of the 10 essential bugs were in record_mics.py; only four were in propagate_annotations.py. CLI tools with real hardware side effects accumulate silent-failure bugs exactly because their test surface is narrow, which is the argument for running automated review on them in the first place.

What This Buys the Dataset

Two Panama Hortigal Estate roasts are already captured with the new setup: 12.8 and 15.1 minutes, sample-locked on both channels, pending Label Studio annotation. The prototype detected first crack correctly on roast 1 without any model changes. Those two recordings alone add ~28 minutes of paired audio, and going forward each roast contributes one mic-1 and one mic-2 sample at a 1:1 ratio: the fastest path to closing the 9:6 imbalance behind the 27.4-second delay.

What hasn't shipped yet is tracked as S21 (#49). The recorder currently buffers audio in RAM and writes on Ctrl-C, which is fine for the M3 Max target hardware but loses everything on SIGTERM. S21 replaces the in-memory buffer with a streaming writer thread and adds a --verify flag that runs post-session peak, balance, and sample-lock checks before files enter the training pipeline.

Part 3 will cover the training story: two hyperparameter attempts, the oscillating loss from a learning rate that was too aggressive for 587 samples, and the diagnosis that got from 87.5% to 100% precision.

References

1. Data Leakage in Time-Series & Audio ML

2. Class Imbalance with PyTorch and Hugging Face

PyTorch nn.CrossEntropyLoss. The weight argument is documented for unbalanced training sets.
Hugging Face Trainer customization. Subclass Trainer.compute_loss without rewriting the training loop.

3. Audio Annotation & Label Studio

4. Condenser vs. Dynamic Mics

Dynamic vs. Condenser Mics, Shure Insights. Covers the sensitivity and attack-transient differences that explain the 27.4-second delay on the dynamic-mic test recording.

5. Hugging Face Audio Ecosystem

Hugging Face Audio Course, Chapter 5: Building an Audio Dataset

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.