SleepSentry: Privacy-First Sleep Apnea Detection on Raspberry Pi using Whisper and Librosa

#ai #raspberrypi #machinelearning #python

Sleep apnea is often called a "silent killer," affecting millions of people worldwide who remain undiagnosed. While many mobile apps claim to track sleep, they often rely on uploading sensitive bedroom audio to the cloud—a massive privacy nightmare.

In this tutorial, we are building SleepSentry, an edge-computing solution that performs Sleep Apnea Detection and snoring classification locally on a Raspberry Pi. By leveraging Audio Signal Processing with Librosa and efficient inference with TensorFlow Lite, we ensure that your raw audio never leaves the device. We only process features, not recordings, keeping your data 100% private.

The Architecture: From Sound Waves to Insights

To achieve real-time classification on low-power hardware, we separate the pipeline into feature extraction and lightweight inference. We use Faster-Whisper for contextual audio analysis (like identifying sleep talking) and Librosa for the heavy lifting of Fast Fourier Transforms (FFT).

graph TD
    A[USB Microphone] -->|Raw PCM Audio| B(Librosa Feature Extraction)
    B -->|MFCCs / Mel-Spectrogram| C{Privacy Filter}
    C -->|Feature Vectors Only| D[TFLite CNN Classifier]
    D -->|Snore/Apnea/Normal| E[Local Dashboard]
    A -->|Intermittent Context| F[Faster-Whisper]
    F -->|Transcribed Sleep Talk| E
    E -->|Alert| G[User Notification]

Prerequisites

Before we dive into the code, ensure you have the following:

Hardware: Raspberry Pi 4 (4GB+) or Raspberry Pi 5.
Audio: A high-quality USB condenser microphone.
Tech Stack:
- Librosa: For Short-Time Fourier Transform (STFT) and MFCC extraction.
- TensorFlow Lite: For running our pre-trained CNN.
- Faster-Whisper: For optimized local transcription.
- NumPy: For high-speed matrix operations.

Step 1: Privacy-First Feature Extraction

Instead of saving .wav files, we immediately convert audio into the frequency domain. MFCCs (Mel-frequency cepstral coefficients) are perfect for this because they represent the "texture" of the sound without retaining enough data to reconstruct intelligible speech easily.

import librosa
import numpy as np

def extract_features(audio_path, sample_rate=16000):
    """
    Extracts Mel-Spectrogram and MFCCs from raw audio.
    Crucial for identifying Obstructive Sleep Apnea (OSA) patterns.
    """
    # Load audio (buffered in memory, never saved to disk)
    y, sr = librosa.load(audio_path, sr=sample_rate)

    # Extract Mel-frequency cepstral coefficients
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
    mfccs_scaled = np.mean(mfccs.T, axis=0)

    # Extract Spectral Contrast to differentiate between snoring and gasping
    spectral_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)

    return np.hstack([mfccs_scaled, np.mean(spectral_contrast.T, axis=0)])

# Example usage:
# features = extract_features("stream_chunk.raw")
# print(f"Feature vector shape: {features.shape}")

Step 2: Edge Inference with TensorFlow Lite

On a Raspberry Pi, running a full TensorFlow model is overkill and slow. We use TFLite to classify the extracted features into three categories: Normal, Snoring, and Apnea Event.

import tensorflow as tf

def classify_event(feature_vector):
    # Load TFLite model and allocate tensors.
    interpreter = tf.lite.Interpreter(model_path="sleep_sentry_model.tflite")
    interpreter.allocate_tensors()

    # Get input and output tensors.
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Prepare input data
    input_data = np.array([feature_vector], dtype=np.float32)
    interpreter.set_tensor(input_details[0]['index'], input_data)

    # Run inference
    interpreter.invoke()

    # Get prediction
    prediction = interpreter.get_tensor(output_details[0]['index'])
    classes = ["Normal", "Snoring", "Apnea"]
    return classes[np.argmax(prediction)]

Step 3: Adding Context with Faster-Whisper

Sometimes, "noises" are actually sleep-talking or environmental sounds. To provide better context without heavy CPU usage, we use Faster-Whisper to transcribe specific segments where the energy levels are high but the CNN is uncertain.

from faster_whisper import WhisperModel

# Use 'tiny' or 'base' for Raspberry Pi performance
model_size = "tiny.en"
model = WhisperModel(model_size, device="cpu", compute_type="int8")

def transcribe_context(audio_segment):
    segments, info = model.transcribe(audio_segment, beam_size=5)
    for segment in segments:
        print(f"[Context]: {segment.text}")

The "Official" Way: Advanced Patterns

While this DIY setup is great for hobbyists, building medical-grade or enterprise-level health monitoring requires rigorous validation and more robust data pipelines.

For more production-ready examples, advanced signal processing patterns, and deep dives into AI safety, I highly recommend checking out the official WellAlly Tech Blog. It's an incredible resource for developers looking to move from prototypes to scalable, high-performance AI applications.

Conclusion

By combining Librosa for feature extraction and TFLite for classification, we've created a powerful, privacy-respecting health monitor. SleepSentry proves that you don't need a massive GPU cluster to solve real-world problems—just a Raspberry Pi and some smart signal processing.

What's next?

Dashboarding: Hook the output up to a Grafana dashboard using InfluxDB.
Alerts: Integrate with Home Assistant to toggle a smart light if an apnea event is detected.

Got questions about audio classification at the edge? Drop a comment below!