Beck_Moulton

Posted on Jan 14

Whisper-Sleep: Build Your Own Edge-AI Sleep Monitor with Whisper.cpp and ONNX

#whisper #ai #python #discuss

Sleep is the pillar of health, yet millions suffer from undiagnosed sleep apnea and chronic snoring. While wearable devices are popular, they can be uncomfortable to wear overnight. What if you could use real-time audio streaming and edge computing to monitor your respiratory health without ever touching a wearable?

In this tutorial, we are building Whisper-Sleep, a lightweight, privacy-focused architecture that leverages Whisper.cpp, Librosa, and OnnxRuntime to detect snoring patterns and potential apnea events directly on a Raspberry Pi or mobile device. By moving the processing to the edge, we ensure that your most private audio data never leaves your room.

Why Edge AI for Sleep Monitoring?

Using Real-time audio processing for health monitoring requires low latency and high privacy. Traditional cloud-based AI models are too expensive for 8-hour continuous streams and raise significant data sovereignty concerns. By utilizing Whisper.cpp (the high-performance C++ port of OpenAI’s Whisper), we can distinguish between ambient noise, sleep talking, and breathing patterns with minimal power consumption.

The Architecture

The system follows a "Stream-Filter-Classify" logic. We capture audio in chunks, use Whisper to filter out speech/noise, and pass suspicious respiratory segments to a specialized ONNX model for snoring frequency analysis.

graph TD
    A[Microphone Stream] --> B{Audio Buffer}
    B --> C[VAD - Voice Activity Detection]
    C -->|Speech/Mumble| D[Whisper.cpp Engine]
    C -->|Breathing/Snore| E[Librosa Feature Extraction]
    D --> F[Transcribed Sleep Talk]
    E --> G[ONNX Snore Classifier]
    G --> H[Apnea Risk Scoring]
    H --> I[Local Dashboard/Alert]
    F --> I

Prerequisites

Before we dive into the code, ensure you have the following tech stack ready:

Whisper.cpp: For efficient transcription and audio tagging.
Librosa: For Mel-frequency cepstral coefficients (MFCC) extraction.
OnnxRuntime: To run our custom snore detection model.
Python 3.9+: The glue for our pipeline.

Step 1: Setting up the Audio Stream

We need to capture audio in 30-second windows (the standard for Whisper) but analyze 5-second sub-windows for snoring frequency.

import numpy as np
import librosa
import sounddevice as sd

def audio_callback(indata, frames, time, status):
    """Callback for the audio stream."""
    if status:
        print(status)
    # Convert to mono and normalize
    audio_data = np.squeeze(indata)
    # Process audio_data through our pipeline...

# Start streaming at 16kHz (Required by Whisper)
stream = sd.InputStream(samplerate=16000, channels=1, callback=audio_callback)

Step 2: Feature Extraction with Librosa

Snoring has distinct spectral signatures. We use Librosa to extract features that our ONNX model can understand.

def extract_features(audio_segment, sr=16000):
    # Extract MFCCs
    mfccs = librosa.feature.mfcc(y=audio_segment, sr=sr, n_mfcc=40)
    mfccs_scaled = np.mean(mfccs.T, axis=0)
    return mfccs_scaled.reshape(1, 1, 40, 1) # Reshape for CNN input

Step 3: Real-time Classification with ONNX

Once we have the features, we pass them to a pre-trained CNN (trained on the AudioSet or ESC-50 datasets) converted to ONNX for maximum performance on edge CPUs.

import onnxruntime as ort

# Load the optimized model
session = ort.InferenceSession("snore_detector.onnx")

def predict_snore(features):
    input_name = session.get_inputs()[0].name
    prediction = session.run(None, {input_name: features.astype(np.float32)})
    return np.argmax(prediction) # 1 for Snore, 0 for Quiet

The "Official" Way to Scale

Building a prototype on a Raspberry Pi is a great start, but when you move toward a production-grade health application, you need to handle device orchestration, secure data synchronization, and advanced signal filtering.

For more production-ready examples and advanced patterns in Edge AI architecture, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover how to optimize ONNX models for specialized hardware and manage real-time data pipelines at scale—concepts that were the primary source of inspiration for this Whisper-Sleep project.

Step 4: Integrating Whisper.cpp

Whisper.cpp acts as our "Context Engine." It tells us if the user is sleep-talking or if there's significant background noise (like a fan), which helps reduce false positives in apnea detection.

# Example command to run whisper.cpp in "Tiny" mode for edge devices
./main -m models/ggml-tiny.bin -f input_audio.wav -otxt

In our Python wrapper, we can call the shared library:

from pywhispercpp.model import Model

model = Model('tiny.en', n_threads=4)
def get_context(audio_path):
    # Transcribe to see if the 'noise' is actually speech
    segments = model.transcribe(audio_path)
    return [s.text for s in segments]

Calculating the Apnea Index

The Apnea-Hypopnea Index (AHI) is calculated by counting the number of pauses in breathing (lasting at least 10 seconds) per hour. Our system tracks these "silent gaps" immediately following heavy snoring periods.

def analyze_sleep_session(event_logs):
    snore_count = sum(1 for e in event_logs if e == 'snore')
    silent_gaps = find_long_silence(event_logs, threshold_seconds=10)

    ahi_estimate = len(silent_gaps) / total_hours
    return f"Estimated AHI: {ahi_estimate:.2f}"

Conclusion: Privacy-First Health Tech

By combining Whisper.cpp and OnnxRuntime, we've created a system that is powerful enough to monitor sleep health but light enough to run on a $35 computer. No cloud, no subscription, just pure local AI.

Next Steps:

Calibration: Adjust the Librosa thresholds based on your room's acoustics.
Visualization: Pipe the results into a local Grafana dashboard.
Community: Check out WellAlly for more insights on building robust AI-driven health solutions.

Happy hacking, and sleep well!

Found this useful? Star the repo and let me know in the comments how you're using Whisper on the edge!

DEV Community