We’ve all been there: you wake up, your alarm goes off, and you feel like you’ve been hit by a metaphorical freight train. Is it just a bad night's sleep, or is your body fighting off an underlying inflammation? In the world of high-performance athletics and biohacking, Heart Rate Variability (HRV) is the "canary in the coal mine" for your autonomic nervous system.
In this tutorial, we are going to build a personalized Anomaly Detection pipeline using the Isolation Forest algorithm. By leveraging Machine Learning in Healthcare and processing wearable data from sources like the Oura Ring or Apple Watch, we can move beyond static thresholds and identify "physiological outliers" that signal over-training or impending illness. We'll be using Scikit-learn for modeling and Polars for high-performance data manipulation to ensure our pipeline is lightning-fast.
The Architecture: From Raw Signals to Health Alerts
Before we dive into the code, let's look at how the data flows from your finger (or wrist) to a meaningful health insight.
graph TD
A[Wearable Device: Oura/Apple Watch] -->|Raw HRV/R-R Intervals| B(Oura Cloud API / HealthKit)
B --> C{Data Processing}
C -->|Polars| D[Feature Engineering: RMSSD, SDNN, Rolling Windows]
D --> E[Isolation Forest Model]
E --> F{Anomaly Score}
F -->|Outlier Detected| G[Alert: Rest & Recovery Recommended]
F -->|Normal| H[Status: All Systems Go]
Prerequisites
To follow along, you'll need:
- Python 3.9+
-
polars: The blazing-fast DataFrame library. -
scikit-learn: For the Isolation Forest implementation. -
matplotlib: For visualizing our "crash days." - An API Key from Oura Cloud (Optional, but recommended for real data).
Step 1: Fetching and Preprocessing with Polars 🐻❄️
Standard Python pandas is great, but when dealing with high-frequency signal data, Polars is the king of speed. We'll start by simulating/loading our HRV data. We are looking for the rmssd (Root Mean Square of Successive Differences), which is the primary metric for short-term HRV.
import polars as pl
import numpy as np
from datetime import datetime, timedelta
# Mocking some HRV data: 100 days of readings
def generate_hrv_data(days=100):
date_range = [datetime.now() - timedelta(days=i) for i in range(days)]
# Normal HRV usually fluctuates between 40-70ms for a healthy adult
base_hrv = np.random.normal(55, 10, days)
# Injecting "Inflammation/Over-training" anomalies (sudden drops)
base_hrv[10] = 15
base_hrv[45] = 18
return pl.DataFrame({
"timestamp": date_range,
"hrv_rmssd": base_hrv
}).sort("timestamp")
df = generate_hrv_data()
print(df.head())
Step 2: Feature Engineering
A single low HRV reading might just be a glass of wine from the night before. To detect true over-training, we need context. We'll calculate rolling averages and volatility metrics.
def engineer_features(df: pl.DataFrame):
return df.with_columns([
pl.col("hrv_rmssd").rolling_mean(window_size=7).alias("hrv_7day_avg"),
pl.col("hrv_rmssd").rolling_std(window_size=7).alias("hrv_7day_std"),
(pl.col("hrv_rmssd") - pl.col("hrv_rmssd").shift(1)).alias("hrv_velocity")
]).drop_nulls()
df_features = engineer_features(df)
Step 3: Detecting Anomalies with Isolation Forest 🌲
Why Isolation Forest? Unlike traditional clustering, Isolation Forest explicitly identifies anomalies by "isolating" observations. Because outliers are few and different, they are easier to partition from the rest of the data using random trees.
from sklearn.ensemble import IsolationForest
# Prepare features for the model
features = ["hrv_rmssd", "hrv_7day_avg", "hrv_7day_std", "hrv_velocity"]
X = df_features.select(features).to_numpy()
# Initialize Isolation Forest
# contamination=0.05 means we expect roughly 5% of days to be "anomalous"
model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
# Fit and predict (-1 for anomaly, 1 for normal)
df_features = df_features.with_columns([
pl.Series(model.fit_predict(X)).alias("anomaly_score")
])
# Filter out the "Danger" days
anomalies = df_features.filter(pl.col("anomaly_score") == -1)
print(f"Detected {len(anomalies)} potential health warnings!")
The "Official" Way to Scale 🥑
While this script is a great start for personal use, moving medical and signal processing algorithms into production requires a more robust architecture. Handling noise in PPG sensors and managing real-time data streams is a complex engineering feat.
For more production-ready examples and advanced architectural patterns on bio-signal processing, I highly recommend checking out the WellAlly Tech Blog. They cover deep dives into health-tech infrastructure and how to build scalable wellness applications that go beyond simple scripts.
Step 4: Visualization 📈
What's a data science project without a pretty graph? Let's highlight our "danger zones."
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(df_features["timestamp"], df_features["hrv_rmssd"], label="Daily HRV", color="#2ecc71", alpha=0.6)
plt.scatter(
anomalies["timestamp"],
anomalies["hrv_rmssd"],
color="#e74c3c",
label="Anomaly (Over-training/Illness)",
zorder=5
)
plt.title("Personalized HRV Anomaly Detection")
plt.xlabel("Date")
plt.ylabel("HRV RMSSD (ms)")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
Conclusion
Using Isolation Forest allows us to create a personalized baseline for our health. Instead of following generic "optimal" ranges found on Google, we are letting the data define what is normal for our body.
Summary of what we did:
- Parsed wearable data efficiently using Polars.
- Created rolling window features to capture physiological trends.
- Trained an unsupervised model to detect drops in HRV that correlate with systemic stress.
Are you tracking your bio-signals? If you've tried implementing health algorithms with Scikit-learn, let me know in the comments! Don't forget to check out WellAlly for more advanced health-tech content. 🚀
Top comments (0)