wellallyTech

Posted on Apr 20

Invisible Breathing Monitoring: Building an AI-Powered Sleep Apnea Detector with Whisper and DFT

#ai #fastapi #whisper #sleep

Sleep is the ultimate productivity hack, but for millions, it’s a silent battleground. Sleep apnea—a condition where breathing repeatedly stops and starts—often goes undiagnosed because nobody likes sleeping in a lab covered in wires. 😴

In this "Learning in Public" session, we are going to build an Invisible Breathing Monitor. By combining Audio Signal Processing, OpenAI Whisper, and Discrete Fourier Transform (DFT), we’ll create an edge-ready system that distinguishes between normal rhythmic breathing and the erratic, high-frequency signatures of obstructive sleep apnea.

If you’ve been looking to dive deep into Edge AI and Real-time Audio Analysis, you’re in the right place! 🚀

The Architecture: From Soundwaves to Insights

Capturing snoring is easy; understanding the pathology behind the sound is hard. Our pipeline uses a dual-track approach: Statistical Signal Processing (to catch the physics of the sound) and Deep Learning (to understand the context).

graph TD
    A[Ambient Audio] -->|PyAudio| B(Circular Buffer)
    B --> C{Feature Extraction}
    C -->|DFT/FFT| D[Librosa: Spectral Analysis]
    C -->|STT/Features| E[OpenAI Whisper: Acoustic Embedding]
    D --> F[Feature Fusion Layer]
    E --> F
    F --> G[TensorFlow Lite Classification]
    G -->|Normal| H[Logged]
    G -->|Apnea Event| I[Alert/Trigger]

Prerequisites 🛠️

To follow along, you’ll need a Python environment with the following tech_stack:

PyAudio: For real-time stream handling.
Librosa: The gold standard for audio math.
OpenAI Whisper: To extract robust acoustic features.
TensorFlow Lite: For low-latency inference on the edge.

Step 1: Real-Time Audio Capture with PyAudio

First, we need to "listen." We’ll set up a non-blocking stream to capture audio chunks.

import pyaudio
import numpy as np

CHUNK = 1024 * 4  # 4096 frames
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000 # Whisper prefers 16kHz

p = pyaudio.PyAudio()
stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE,
                input=True, frames_per_buffer=CHUNK)

def get_audio_chunk():
    data = stream.read(CHUNK, exception_on_overflow=False)
    return np.frombuffer(data, dtype=np.int16).astype(np.float32) / 32768.0

Step 2: The Physics of a Snore (DFT & Librosa)

Snoring isn't just noise; it has a specific frequency footprint. Using a Discrete Fourier Transform (DFT), we can move from the time domain to the frequency domain to calculate the Spectral Centroid and Zero-Crossing Rate.

import librosa

def extract_signal_features(y, sr=16000):
    # Calculate the Short-Time Fourier Transform (STFT)
    stft = np.abs(librosa.stft(y))

    # Spectral Centroid: Where the "center of mass" of the sound is
    centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

    # Zero Crossing Rate: Helps identify percussive/choking sounds
    zcr = librosa.feature.zero_crossing_rate(y)

    return np.mean(centroid), np.mean(zcr)

Step 3: Leveraging OpenAI Whisper for Feature Extraction

While Whisper is famous for transcription, its encoder is a beast at understanding acoustic environments. We can use the Whisper model to generate "audio embeddings" that represent the quality of the breathing.

import whisper

# Load the base model (use 'tiny' or 'base' for edge devices)
model = whisper.load_model("tiny")

def get_whisper_features(audio_segment):
    # We use Whisper to check if it 'hears' speech vs. non-speech
    # and to extract the internal latent representation
    mel = whisper.log_mel_spectrogram(audio_segment).to(model.device)
    # In a production app, you'd pull the encoder outputs here
    result = model.decode(mel, whisper.DecodingOptions(fp16=False))
    return result.text

The "Official" Way: Advanced Patterns 🥑

Building a prototype is one thing; deploying a HIPAA-compliant, medical-grade monitoring system is another.

For more production-ready examples, including advanced signal filtering techniques and how to handle multi-tenant audio streams in a cloud environment, I highly recommend checking out the engineering deep-dives at WellAlly Tech Blog. They cover excellent patterns for scaling AI-driven health tech that goes beyond a simple script.

Step 4: Edge Classification with TensorFlow Lite

Now, we combine the DFT features and Whisper's context. Since we want this to run on a Raspberry Pi or a phone, we use a quantized TensorFlow Lite model.

import tensorflow as tf

interpreter = tf.lite.Interpreter(model_path="breathing_classifier.tflite")
interpreter.allocate_tensors()

def classify_event(features):
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Push our combined vector into the model
    interpreter.set_tensor(input_details[0]['index'], features)
    interpreter.invoke()

    prediction = interpreter.get_tensor(output_details[0]['index'])
    return "Apnea Warning" if prediction > 0.8 else "Normal"

Conclusion: Making Sound Matter

By combining the raw mathematical power of DFT with the semantic understanding of Whisper, we’ve built a system that doesn't just record noise—it understands human health. This approach to "Invisible Monitoring" is the future of preventative medicine. 🏥

What’s next?

Optimization: Try using FastAPI to stream this data to a dashboard.
Privacy: Ensure all processing stays on the device (Edge AI!).

Are you working on audio-based AI? Drop a comment below or share your thoughts on signal processing! And don't forget to visit WellAlly's Blog for more advanced tutorials.

Happy hacking! 💻🔥

DEV Community