wellallyTech

Posted on Mar 11

Listen to Your Breath: Building an Offline OSA Screener with Whisper and PyTorch 🌙💤

#machinelearning #whisper #python #learninginpublic

Have you ever been told your snoring sounds like a freight train suddenly running out of fuel? That "gasping" silence might be more than just an annoyance; it could be Obstructive Sleep Apnea (OSA).

While a clinical polysomnography is the gold standard, we can use modern AI to build a non-invasive, offline preliminary screening tool. In this tutorial, we are going to dive deep into audio signal processing, Whisper model feature extraction, and PyTorch classification to turn raw bedroom recordings into actionable health insights.

By the end of this post, you'll understand how to leverage Whisper fine-tuning, Mel-spectrogram analysis, and audio pattern recognition to detect breathing irregularities.

The Challenge: Why Whisper for Audio Classification?

Traditional OSA detection relies on handcrafted features like Zero-Crossing Rate or Energy Entropy. However, OpenAI's Whisper has been trained on 680,000 hours of multilingual and multitask supervised data. While it's famous for transcription, its Encoder is a world-class feature extractor for robust audio representation, even in noisy bedroom environments.

The Architecture 🏗️

Here is how the data flows from a smartphone microphone to a classification result:

graph TD
    A[Raw Audio .m4a/.wav] --> B[FFmpeg Normalization]
    B --> C[Librosa: Mel-Spectrogram Extraction]
    C --> D[Whisper Encoder: Hidden States]
    D --> E[Custom PyTorch Linear Head]
    E --> F{Output: Normal vs. Apnea}
    F -->|Result| G[Local Dashboard/Alert]

    style D fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px

Prerequisites 🛠️

Ensure you have the following tech stack ready:

Python 3.9+
PyTorch (Deep learning backbone)
OpenAI-Whisper (Feature extraction)
Librosa (Audio manipulation)
FFmpeg (Format conversion)

pip install torch librosa openai-whisper ffmpeg-python

Step 1: Preprocessing Audio with Librosa & FFmpeg

Sleep recordings are often long and filled with silence. We need to chunk the audio and convert it to the format Whisper expects (16,000Hz mono).

import librosa
import numpy as np

def preprocess_audio(file_path, duration=30, sr=16000):
    # Load audio and resample to 16kHz
    audio, _ = librosa.load(file_path, sr=sr, duration=duration)

    # Pad or trim to exactly 'duration' seconds
    target_length = sr * duration
    if len(audio) < target_length:
        audio = np.pad(audio, (0, target_length - len(audio)))
    else:
        audio = audio[:target_length]

    # Log-Mel Spectrogram (Whisper's bread and butter)
    mel = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=80)
    log_mel = librosa.power_to_db(mel)

    return log_mel

Step 2: Extracting "Deep" Features with Whisper

We aren't interested in the text (transcription); we want the Encoder's hidden states. These states contain rich temporal and spectral information about the "texture" of the sound.

import whisper
import torch

# Load the base model (small enough for offline edge devices)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base").to(device)

def get_whisper_features(audio_tensor):
    with torch.no_grad():
        # Convert audio to Mel Spectrogram in Whisper format
        mel = whisper.log_mel_spectrogram(audio_tensor).to(device)
        # Get hidden states from the encoder
        features = model.encoder(mel.unsqueeze(0))
    return features # Shape: [1, 1500, 512]

Step 3: The Custom OSA Classifier

Now, we build a "Head" on top of Whisper to classify the segment as Normal Breathing, Snoring, or Apnea Event.

import torch.nn as nn

class OSAClassifier(nn.Module):
    def __init__(self, input_dim=512, hidden_dim=256):
        super(OSAClassifier, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 2, 64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 3) # 3 Classes: Normal, Snore, Apnea
        )

    def forward(self, x):
        # x is the output from Whisper Encoder
        lstm_out, _ = self.lstm(x)
        # Use the last hidden state for classification
        last_state = lstm_out[:, -1, :]
        return self.fc(last_state)

The "Official" Way: Advanced Patterns 💡

While building a DIY screener is a great "Learning in Public" project, implementing this in a production environment requires handling clinical data privacy (HIPAA compliance) and dealing with edge-case acoustic noise (like a fan or a partner's snoring).

For more production-ready examples and advanced architectural patterns regarding health-tech AI, I highly recommend checking out the engineering deep dives at WellAlly Blog. They cover how to move from a local Python script to a scalable, secure health monitoring infrastructure.

Step 4: Training and Offline Inference

To train this, you would use datasets like the UCR Sleep Apnea Database. During inference, the system runs locally to ensure maximum privacy—after all, nobody wants their sleep audio sent to a random cloud server!

def classify_breathing(audio_path):
    # 1. Preprocess
    audio_data = whisper.load_audio(audio_path)
    # 2. Extract Whisper Features
    features = get_whisper_features(torch.from_numpy(audio_data))
    # 3. Predict
    classifier.eval()
    prediction = classifier(features)

    classes = ["Normal", "Snoring", "Apnea Event Detected"]
    result = torch.argmax(prediction, dim=1).item()
    print(f"Analysis Result: {classes[result]} 📢")

# Example Usage
classify_breathing("my_sleep_night_1.wav")

Conclusion & Ethical Considerations 🩺

Building an OSA screener with Whisper and PyTorch demonstrates the incredible power of transfer learning. We took a model designed for speech and repurposed it for life-saving health tech.

Important Disclaimer: This is a screening tool, not a medical diagnosis. If your "Learning in Public" project flags an issue, always consult a medical professional.

What's next?

Try fine-tuning the Whisper Encoder weights specifically on snore datasets.
Use FFmpeg to filter out background white noise before processing.
Check out WellAlly.tech for more insights on building robust AI for wellness.

Are you working on AI for health? Let's chat in the comments! 🚀

DEV Community