Ever looked at your dog mid-zoom and thought: "Is this joy or a cry for help?"
I did. So I built something.
This is a walkthrough of how I trained a lightweight multimodal classifier to detect emotional states in pets using short video clips — and how I deployed it as a real web app at mypettherapist.com.
Spoiler: the hardest part wasn't the model. It was the data.
The Problem With Pet Emotion AI
Most pet AI is a parlor trick. "Oh look, the model says your cat is surprised." Cool. But surprise is not actionable.
What is actionable:
- Is my pet anxious right now?
- Is this behavior getting worse over time?
- Should I call a vet?
That's what I wanted to build. A system that gives pet owners behavioral signals, not meme labels.
Architecture Overview
Video Clip (5–15s)
│
▼
Frame Sampler (every 0.5s)
│
▼
┌──────────────────────────────┐
│ MobileNetV3 (vision) │ ← body posture
│ Whisper-tiny (audio) │ ← vocalizations
│ Pose Keypoints (MediaPipe) │ ← tail, ears, spine
└──────────────────────────────┘
│
▼
Late Fusion Layer (concat embeddings)
│
▼
Classifier → [Relaxed / Alert / Anxious / Playful / Distressed]
Three modalities fused at inference time. Overkill? Maybe. But pet behavior is genuinely multimodal — a wagging tail means nothing without posture context.
Step 1: Data Collection (The Unglamorous Part)
I scraped ~4,200 labeled clips from:
- YouTube with CC (pet training channels)
- Donated clips from pet owners (consent form + upload portal)
- A small dataset from a vet behaviorist partner
Each clip got 3 independent human labels. Disagreements (>1 label apart) were thrown out. Final dataset: ~3,100 usable clips.
import pandas as pd
from collections import Counter
def consensus_label(labels: list[str]) -> str | None:
"""Return majority label, or None if no consensus."""
counts = Counter(labels)
top_label, top_count = counts.most_common(1)[0]
if top_count >= 2: # majority of 3 raters
return top_label
return None # discard ambiguous
df = pd.read_csv("raw_labels.csv")
df["label"] = df[["rater1", "rater2", "rater3"]].apply(
lambda row: consensus_label(row.tolist()), axis=1
)
df = df.dropna(subset=["label"])
print(df["label"].value_counts())
Output:
Relaxed 1021
Playful 847
Alert 612
Anxious 398
Distressed 223
Class imbalance. Classic. We'll handle it with weighted sampling.
Step 2: Frame + Audio Extraction
import cv2
import subprocess
from pathlib import Path
def extract_frames(video_path: str, fps: float = 2.0) -> list:
"""Extract frames at given FPS."""
cap = cv2.VideoCapture(video_path)
native_fps = cap.get(cv2.CAP_PROP_FPS)
interval = int(native_fps / fps)
frames = []
i = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if i % interval == 0:
frames.append(frame)
i += 1
cap.release()
return frames
def extract_audio(video_path: str, out_path: str) -> None:
"""Extract mono 16kHz audio for Whisper."""
subprocess.run([
"ffmpeg", "-i", video_path,
"-ar", "16000", "-ac", "1",
"-f", "wav", out_path,
"-y", "-loglevel", "quiet"
], check=True)
Step 3: MediaPipe Pose Keypoints for Pets
This is where it gets fun. MediaPipe's standard pose model is human-only — pets have different joint topology. So I used a fine-tuned version from Google's Animal Pose research track, plus some open-source pet keypoint models.
Key features I extract per frame:
- Tail angle relative to spine
- Ear position (flattened vs. upright)
- Spine curvature (hunched = stress signal)
- Center of mass velocity (stillness vs. agitation)
import mediapipe as mp
import numpy as np
import cv2
def extract_pet_keypoints(frame) -> dict | None:
mp_pose = mp.solutions.pose
with mp_pose.Pose(
static_image_mode=True,
model_complexity=2,
min_detection_confidence=0.5
) as pose:
results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
if not results.pose_landmarks:
return None
lm = results.pose_landmarks.landmark
return {
"spine_top": (lm[11].x, lm[11].y),
"spine_mid": (lm[23].x, lm[23].y),
"spine_base": (lm[24].x, lm[24].y),
"tail_base": (lm[27].x, lm[27].y),
}
def spine_curvature(kp: dict) -> float:
"""Simple curvature score: 0 = straight, 1 = fully hunched."""
top = np.array(kp["spine_top"])
mid = np.array(kp["spine_mid"])
base = np.array(kp["spine_base"])
v1 = base - top
v2 = mid - top
proj = np.dot(v2, v1) / np.dot(v1, v1) * v1
deviation = np.linalg.norm(v2 - proj)
body_len = np.linalg.norm(v1)
return float(deviation / body_len) if body_len > 0 else 0.0
Step 4: Late Fusion Model
Instead of training one big end-to-end model (complex, brittle), I train each encoder separately, then fuse their embeddings.
import torch
import torch.nn as nn
class PetEmotionFusion(nn.Module):
def __init__(
self,
vision_dim: int = 576, # MobileNetV3 small
audio_dim: int = 384, # Whisper-tiny encoder
pose_dim: int = 32, # Keypoint features
num_classes: int = 5
):
super().__init__()
total_dim = vision_dim + audio_dim + pose_dim
self.fusion = nn.Sequential(
nn.Linear(total_dim, 256),
nn.LayerNorm(256),
nn.GELU(),
nn.Dropout(0.3),
nn.Linear(256, 64),
nn.GELU(),
nn.Linear(64, num_classes)
)
def forward(self, vision_emb, audio_emb, pose_emb):
x = torch.cat([vision_emb, audio_emb, pose_emb], dim=-1)
return self.fusion(x)
Training with class-weighted loss:
from torch.utils.data import WeightedRandomSampler
class_counts = [1021, 847, 612, 398, 223]
class_weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
sample_weights = class_weights[train_labels]
sampler = WeightedRandomSampler(
weights=sample_weights,
num_samples=len(sample_weights),
replacement=True
)
criterion = nn.CrossEntropyLoss(weight=class_weights.cuda())
Step 5: FastAPI Inference Endpoint
from fastapi import FastAPI, UploadFile
from fastapi.responses import JSONResponse
import tempfile, shutil
from pathlib import Path
app = FastAPI()
@app.post("/analyze")
async def analyze_pet_video(file: UploadFile) -> JSONResponse:
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp:
shutil.copyfileobj(file.file, tmp)
tmp_path = tmp.name
try:
result = inference_pipeline(tmp_path)
return JSONResponse({
"emotion": result["label"],
"confidence": round(result["confidence"], 3),
"signals": result["top_signals"],
"recommendation": result["recommendation"]
})
finally:
Path(tmp_path).unlink(missing_ok=True)
Sample response:
{
"emotion": "Anxious",
"confidence": 0.847,
"signals": ["hunched spine", "low tail angle", "no vocalization"],
"recommendation": "Reduce stimuli. Consider consulting a pet behaviorist."
}
Results
| Metric | Value |
|---|---|
| Validation Accuracy | 78.3% |
| Weighted F1 | 0.76 |
| Inference time (CPU) | ~1.8s per 10s clip |
| Model size | 42MB (quantized INT8) |
Not perfect — but a vet behaviorist reviewed 100 random predictions and rated 81% as "clinically reasonable." For a v1, that's good enough to ship.
What I Learned
- Multimodal > unimodal — Adding pose keypoints boosted F1 by ~8 points over vision-only
- Label quality > label quantity — Throwing out ambiguous clips hurt dataset size but helped the model
- Late fusion is underrated — Easier to debug, easier to swap components, nearly as good as end-to-end
- Pets are hard — They don't hold still. Speed jitter + random crop augmentation was essential
What's Next
- Longitudinal tracking — Not just "anxious now" but "more anxious than last week?"
- Breed-specific fine-tuning — A Basenji's "relaxed" looks like a Labrador's "alert"
- On-device inference — ONNX export + Core ML for iOS
If you want to try the live demo on your own pet's video, it's free to start at mypettherapist.com 🐾
Questions about the architecture, training, or vet partnership — drop them in the comments.
What multimodal fusion patterns have worked well for you?
Top comments (0)