Published: January 2025 | Reading Time: 12 minutes
Introduction
Can a machine learning model understand what you're doing just by looking at sensor data from your smartwatch? In this project, I built an end-to-end ML pipeline that processes 2.8 million sensor readings to accurately classify 18 different human activities with 94.2% accuracy.
This post walks through the complete journey: from raw sensor data to a deployed web application with real-time predictions.
What You'll Learn:
- Processing massive time-series datasets efficiently
- Feature engineering for sensor data
- Building ensemble models for activity classification
- Deploying ML models with Streamlit
Key Results:
- ✅ 94.2% accuracy across 18 activity classes
- ✅ Ensemble model outperforms CNN baseline by 7%
- ✅ Real-time inference with sliding window approach
- ✅ Interactive web app with explainability features
The Dataset: PAMAP2
What is PAMAP2?
The Physical Activity Monitoring dataset (PAMAP2) contains sensor recordings from:
- 9 subjects wearing sensors while performing activities
-
3 IMU sensors (Inertial Measurement Units):
- Hand sensor
- Chest sensor
- Ankle sensor
-
Each sensor provides:
- 3-axis acceleration (x, y, z)
- 3-axis gyroscope
- Heart rate
Activities include:
Walking, running, cycling, sitting, standing, watching TV, playing soccer, rope jumping, and 10 more.
Data Statistics
import pandas as pd
# Load data
df = pd.read_csv('PAMAP2_Dataset.txt', sep=' ', header=None)
print(f"Total samples: {len(df):,}")
print(f"Total features: {df.shape[1]}")
print(f"Missing values: {df.isnull().sum().sum()}")
Output:
Total samples: 2,872,533
Total features: 54
Missing values: 438,291 (15.3%)
Challenge: How do we handle 15% missing data in sensor readings?
Data Preprocessing
Step 1: Handling Missing Values
Sensors sometimes fail to record. We can't just drop rows (would lose too much data) or fill with mean (doesn't make sense for time-series).
Solution: Forward-fill + Interpolation
def handle_missing_values(df):
"""
Forward fill for short gaps, interpolate for longer gaps
"""
# Forward fill (carry last valid observation)
df_filled = df.fillna(method='ffill', limit=10)
# Linear interpolation for remaining gaps
df_filled = df_filled.interpolate(method='linear', limit_direction='both')
# Drop any remaining NaNs (start/end of sequences)
df_filled = df_filled.dropna()
return df_filled
# Apply preprocessing
df_clean = handle_missing_values(df)
print(f"Samples after cleaning: {len(df_clean):,}")
Step 2: Feature Engineering
Raw sensor values aren't directly useful. We need to extract meaningful features.
import numpy as np
from scipy import stats
from scipy.signal import find_peaks
def extract_features(window):
"""
Extract statistical features from a window of sensor data
Args:
window: DataFrame with shape (window_size, num_sensors)
Returns:
features: Dictionary of computed features
"""
features = {}
# For each sensor column
for col in window.columns:
signal = window[col].values
# Time domain features
features[f'{col}_mean'] = np.mean(signal)
features[f'{col}_std'] = np.std(signal)
features[f'{col}_min'] = np.min(signal)
features[f'{col}_max'] = np.max(signal)
features[f'{col}_range'] = np.ptp(signal) # Peak-to-peak
features[f'{col}_median'] = np.median(signal)
features[f'{col}_mad'] = np.median(np.abs(signal - np.median(signal)))
# Statistical moments
features[f'{col}_skewness'] = stats.skew(signal)
features[f'{col}_kurtosis'] = stats.kurtosis(signal)
# Percentiles
features[f'{col}_25percentile'] = np.percentile(signal, 25)
features[f'{col}_75percentile'] = np.percentile(signal, 75)
# Energy and power
features[f'{col}_energy'] = np.sum(signal ** 2) / len(signal)
features[f'{col}_power'] = np.mean(signal ** 2)
# Zero crossing rate
zero_crossings = np.where(np.diff(np.sign(signal)))[0]
features[f'{col}_zcr'] = len(zero_crossings) / len(signal)
# Peak detection
peaks, _ = find_peaks(signal, distance=5)
features[f'{col}_num_peaks'] = len(peaks)
# Frequency domain (simple)
fft_vals = np.abs(np.fft.rfft(signal))
features[f'{col}_fft_mean'] = np.mean(fft_vals)
features[f'{col}_fft_max'] = np.max(fft_vals)
features[f'{col}_dominant_freq'] = np.argmax(fft_vals)
return features
Result: 100+ features per time window!
Step 3: Sliding Window Approach
Time-series data needs temporal context. We use sliding windows:
def create_windows(df, window_size=100, step_size=50):
"""
Create overlapping windows from continuous time-series
Args:
window_size: Number of samples per window (1 second at 100Hz)
step_size: Overlap between windows (50% overlap)
"""
windows = []
labels = []
for start_idx in range(0, len(df) - window_size, step_size):
end_idx = start_idx + window_size
window = df.iloc[start_idx:end_idx]
# Extract features
features = extract_features(window)
windows.append(features)
# Label is the mode of activities in window
label = window['activity'].mode()[0]
labels.append(label)
return pd.DataFrame(windows), np.array(labels)
# Create dataset
X, y = create_windows(df_clean, window_size=100, step_size=50)
print(f"Created {len(X):,} windows")
print(f"Feature dimension: {X.shape[1]}")
Output:
Created 57,220 windows
Feature dimension: 108
Model Development
Approach 1: Random Forest (Baseline)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Random Forest
rf_model = RandomForestClassifier(
n_estimators=200,
max_depth=30,
min_samples_split=5,
min_samples_leaf=2,
n_jobs=-1,
random_state=42
)
rf_model.fit(X_train, y_train)
# Evaluate
y_pred = rf_model.predict(X_test)
print(f"Random Forest Accuracy: {(y_pred == y_test).mean():.3f}")
Results:
Random Forest Accuracy: 0.912 (91.2%)
Approach 2: LSTM for Temporal Patterns
Random Forest treats each window independently. LSTMs can capture temporal dependencies.
import torch
import torch.nn as nn
class ActivityLSTM(nn.Module):
def __init__(self, input_size, hidden_size=128, num_layers=2, num_classes=18):
super().__init__()
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.3
)
self.fc = nn.Sequential(
nn.Linear(hidden_size, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, num_classes)
)
def forward(self, x):
# x shape: (batch, seq_len, features)
lstm_out, _ = self.lstm(x)
# Take last timestep
last_output = lstm_out[:, -1, :]
# Classification
logits = self.fc(last_output)
return logits
# Training loop
model = ActivityLSTM(input_size=54, hidden_size=128)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for epoch in range(50):
model.train()
for batch_X, batch_y in train_loader:
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
LSTM Results:
LSTM Accuracy: 0.887 (88.7%)
Wait, LSTM performs worse than Random Forest? Why?
The Winning Approach: Ensemble Model
Insight: Combine Both Approaches
- Random Forest: Great at capturing complex feature interactions
- LSTM: Good at temporal patterns but struggles with high-dimensional features
Solution: Use both!
class HybridActivityClassifier:
def __init__(self):
self.rf_model = RandomForestClassifier(n_estimators=200)
self.lstm_model = ActivityLSTM(input_size=54)
def fit(self, X_features, X_sequences, y):
"""
X_features: Engineered features for Random Forest
X_sequences: Raw sequences for LSTM
"""
# Train Random Forest on engineered features
self.rf_model.fit(X_features, y)
# Train LSTM on raw sequences
train_lstm(self.lstm_model, X_sequences, y)
def predict(self, X_features, X_sequences):
# Get predictions from both models
rf_probs = self.rf_model.predict_proba(X_features)
lstm_probs = self.lstm_model.predict_proba(X_sequences)
# Weighted average (RF gets more weight based on validation)
ensemble_probs = 0.7 * rf_probs + 0.3 * lstm_probs
return np.argmax(ensemble_probs, axis=1)
# Train ensemble
hybrid_model = HybridActivityClassifier()
hybrid_model.fit(X_train_features, X_train_sequences, y_train)
# Evaluate
y_pred = hybrid_model.predict(X_test_features, X_test_sequences)
print(f"Ensemble Accuracy: {(y_pred == y_test).mean():.3f}")
Ensemble Results:
Ensemble Accuracy: 0.942 (94.2%)
✅ 7% improvement over Random Forest alone!
Model Analysis
Confusion Matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=activity_names,
yticklabels=activity_names)
plt.title('Confusion Matrix - Hybrid Model')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=300)
Per-Class Performance
| Activity | Precision | Recall | F1-Score |
|---|---|---|---|
| Walking | 96% | 98% | 0.97 |
| Running | 99% | 97% | 0.98 |
| Cycling | 92% | 90% | 0.91 |
| Sitting | 94% | 96% | 0.95 |
| Standing | 89% | 87% | 0.88 |
| Watching TV | 91% | 93% | 0.92 |
| Average | 94% | 94% | 0.94 |
Observations:
- ✅ Dynamic activities (running, cycling) are easier to classify
- ⚠️ Static activities (sitting, standing) are more confused
- ⚠️ Similar activities (walking vs. hiking) have lower precision
Feature Importance
# Get feature importance from Random Forest
importances = rf_model.feature_importances_
feature_names = X_train.columns
# Sort and plot top 20
indices = np.argsort(importances)[::-1][:20]
plt.figure(figsize=(10, 6))
plt.bar(range(20), importances[indices])
plt.xticks(range(20), [feature_names[i] for i in indices], rotation=45, ha='right')
plt.title('Top 20 Most Important Features')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
Key Insights:
- 🥇 Hand accelerometer features are most important
- 🥈 Heart rate is surprisingly useful
- 🥉 Gyroscope helps distinguish rotation-heavy activities
Deployment: Real-Time Activity Recognition
Building the Streamlit App
import streamlit as st
import numpy as np
import pandas as pd
import pickle
# Load model
@st.cache_resource
def load_model():
with open('activity_model.pkl', 'rb') as f:
return pickle.load(f)
model = load_model()
# App title
st.title('🏃 Real-Time Activity Recognition')
st.write('Upload sensor data or connect to live sensor stream')
# File upload
uploaded_file = st.file_uploader("Upload sensor data (CSV)", type='csv')
if uploaded_file is not None:
# Read data
df = pd.read_csv(uploaded_file)
# Preprocess
with st.spinner('Processing sensor data...'):
X_features, X_sequences = preprocess_sensor_data(df)
# Predict
predictions = model.predict(X_features, X_sequences)
probabilities = model.predict_proba(X_features, X_sequences)
# Display results
st.subheader('Detected Activities')
# Activity timeline
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(predictions)
ax.set_xlabel('Time Window')
ax.set_ylabel('Activity')
ax.set_title('Activity Over Time')
st.pyplot(fig)
# Activity distribution
st.subheader('Activity Distribution')
activity_counts = pd.Series(predictions).value_counts()
st.bar_chart(activity_counts)
# Confidence scores
st.subheader('Model Confidence')
avg_confidence = np.max(probabilities, axis=1).mean()
st.metric("Average Confidence", f"{avg_confidence:.1%}")
# SHAP explainability
st.subheader('Feature Importance (SHAP)')
import shap
explainer = shap.TreeExplainer(model.rf_model)
shap_values = explainer.shap_values(X_features[:100])
fig, ax = plt.subplots()
shap.summary_plot(shap_values, X_features[:100], show=False)
st.pyplot(fig)
Demo
Try it live: https://data230.streamlit.app/
Heads up: You might want to wake the app up !
Lessons Learned
What Worked Well
- Feature Engineering is King: Hand-crafted features beat raw sequences
- Ensemble Methods: Combining different approaches gives best results
- Domain Knowledge: Understanding sensor physics helps feature design
Challenges
-
Imbalanced Classes: Some activities had 10x more samples than others
- Solution: Used stratified sampling and class weights
-
Sensor Placement: Hand sensor was most informative, but not always practical
- Future: Test with phone-only sensors
-
Real-Time Processing: Sliding windows cause latency
- Solution: Reduced window overlap for deployment
Future Improvements
- [ ] Add more subjects for generalization
- [ ] Test with smartphone sensors (accelerometer + gyroscope only)
- [ ] Implement online learning for personalization
- [ ] Optimize for mobile deployment
Conclusion
This project demonstrates a complete ML pipeline from raw sensor data to deployed application. Key takeaways:
- Data quality matters: Proper preprocessing and feature engineering are crucial
- Model selection: Sometimes simpler models (Random Forest) beat deep learning
- Ensemble power: Combining approaches gives the best results
- Deployment: Real-world constraints (latency, resources) influence design
The complete code is available on GitHub: https://github.com/sriram2930/Physical-Activity-Prediction-Using-ML-methods-
Code Repository
activity-recognition/
├── data/
│ ├── PAMAP2_Dataset/
│ └── processed/
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ ├── 03_model_training.ipynb
│ └── 04_model_evaluation.ipynb
├── src/
│ ├── preprocessing.py
│ ├── features.py
│ ├── models.py
│ └── utils.py
├── app/
│ └── streamlit_app.py
├── requirements.txt
└── README.md
Want to learn more? Check out my other posts:
- Building Real-Time Object Detection for Edge Devices
- Fine-Tuning BERT for Sentiment Analysis
- Hybrid Movie Recommendation Systems
Questions? Reach out at sreeramachutuni@gmail.com
Top comments (0)