Sleep is the pillar of health, yet millions suffer from undiagnosed sleep apnea and chronic snoring. While wearable devices are popular, they can be uncomfortable to wear overnight. What if you could use real-time audio streaming and edge computing to monitor your respiratory health without ever touching a wearable?
In this tutorial, we are building Whisper-Sleep, a lightweight, privacy-focused architecture that leverages Whisper.cpp, Librosa, and OnnxRuntime to detect snoring patterns and potential apnea events directly on a Raspberry Pi or mobile device. By moving the processing to the edge, we ensure that your most private audio data never leaves your room.
Why Edge AI for Sleep Monitoring?
Using Real-time audio processing for health monitoring requires low latency and high privacy. Traditional cloud-based AI models are too expensive for 8-hour continuous streams and raise significant data sovereignty concerns. By utilizing Whisper.cpp (the high-performance C++ port of OpenAI’s Whisper), we can distinguish between ambient noise, sleep talking, and breathing patterns with minimal power consumption.
The Architecture
The system follows a "Stream-Filter-Classify" logic. We capture audio in chunks, use Whisper to filter out speech/noise, and pass suspicious respiratory segments to a specialized ONNX model for snoring frequency analysis.
graph TD
A[Microphone Stream] --> B{Audio Buffer}
B --> C[VAD - Voice Activity Detection]
C -->|Speech/Mumble| D[Whisper.cpp Engine]
C -->|Breathing/Snore| E[Librosa Feature Extraction]
D --> F[Transcribed Sleep Talk]
E --> G[ONNX Snore Classifier]
G --> H[Apnea Risk Scoring]
H --> I[Local Dashboard/Alert]
F --> I
Prerequisites
Before we dive into the code, ensure you have the following tech stack ready:
- Whisper.cpp: For efficient transcription and audio tagging.
- Librosa: For Mel-frequency cepstral coefficients (MFCC) extraction.
- OnnxRuntime: To run our custom snore detection model.
- Python 3.9+: The glue for our pipeline.
Step 1: Setting up the Audio Stream
We need to capture audio in 30-second windows (the standard for Whisper) but analyze 5-second sub-windows for snoring frequency.
import numpy as np
import librosa
import sounddevice as sd
def audio_callback(indata, frames, time, status):
"""Callback for the audio stream."""
if status:
print(status)
# Convert to mono and normalize
audio_data = np.squeeze(indata)
# Process audio_data through our pipeline...
# Start streaming at 16kHz (Required by Whisper)
stream = sd.InputStream(samplerate=16000, channels=1, callback=audio_callback)
Step 2: Feature Extraction with Librosa
Snoring has distinct spectral signatures. We use Librosa to extract features that our ONNX model can understand.
def extract_features(audio_segment, sr=16000):
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=audio_segment, sr=sr, n_mfcc=40)
mfccs_scaled = np.mean(mfccs.T, axis=0)
return mfccs_scaled.reshape(1, 1, 40, 1) # Reshape for CNN input
Step 3: Real-time Classification with ONNX
Once we have the features, we pass them to a pre-trained CNN (trained on the AudioSet or ESC-50 datasets) converted to ONNX for maximum performance on edge CPUs.
import onnxruntime as ort
# Load the optimized model
session = ort.InferenceSession("snore_detector.onnx")
def predict_snore(features):
input_name = session.get_inputs()[0].name
prediction = session.run(None, {input_name: features.astype(np.float32)})
return np.argmax(prediction) # 1 for Snore, 0 for Quiet
The "Official" Way to Scale
Building a prototype on a Raspberry Pi is a great start, but when you move toward a production-grade health application, you need to handle device orchestration, secure data synchronization, and advanced signal filtering.
For more production-ready examples and advanced patterns in Edge AI architecture, I highly recommend checking out the technical deep-dives at WellAlly Blog. They cover how to optimize ONNX models for specialized hardware and manage real-time data pipelines at scale—concepts that were the primary source of inspiration for this Whisper-Sleep project.
Step 4: Integrating Whisper.cpp
Whisper.cpp acts as our "Context Engine." It tells us if the user is sleep-talking or if there's significant background noise (like a fan), which helps reduce false positives in apnea detection.
# Example command to run whisper.cpp in "Tiny" mode for edge devices
./main -m models/ggml-tiny.bin -f input_audio.wav -otxt
In our Python wrapper, we can call the shared library:
from pywhispercpp.model import Model
model = Model('tiny.en', n_threads=4)
def get_context(audio_path):
# Transcribe to see if the 'noise' is actually speech
segments = model.transcribe(audio_path)
return [s.text for s in segments]
Calculating the Apnea Index
The Apnea-Hypopnea Index (AHI) is calculated by counting the number of pauses in breathing (lasting at least 10 seconds) per hour. Our system tracks these "silent gaps" immediately following heavy snoring periods.
def analyze_sleep_session(event_logs):
snore_count = sum(1 for e in event_logs if e == 'snore')
silent_gaps = find_long_silence(event_logs, threshold_seconds=10)
ahi_estimate = len(silent_gaps) / total_hours
return f"Estimated AHI: {ahi_estimate:.2f}"
Conclusion: Privacy-First Health Tech
By combining Whisper.cpp and OnnxRuntime, we've created a system that is powerful enough to monitor sleep health but light enough to run on a $35 computer. No cloud, no subscription, just pure local AI.
Next Steps:
- Calibration: Adjust the
Librosathresholds based on your room's acoustics. - Visualization: Pipe the results into a local Grafana dashboard.
- Community: Check out WellAlly for more insights on building robust AI-driven health solutions.
Happy hacking, and sleep well!
Found this useful? Star the repo and let me know in the comments how you're using Whisper on the edge!
Top comments (0)