DEV Community

Esther Studer
Esther Studer

Posted on

I Built a Multimodal AI to Detect Pet Emotions from Video — Full Python Breakdown

Ever looked at your dog mid-zoom and thought: "Is this joy or a cry for help?"

I did. So I built something.

This is a walkthrough of how I trained a lightweight multimodal classifier to detect emotional states in pets using short video clips — and how I deployed it as a real web app at mypettherapist.com.

Spoiler: the hardest part wasn't the model. It was the data.


The Problem With Pet Emotion AI

Most pet AI is a parlor trick. "Oh look, the model says your cat is surprised." Cool. But surprise is not actionable.

What is actionable:

  • Is my pet anxious right now?
  • Is this behavior getting worse over time?
  • Should I call a vet?

That's what I wanted to build. A system that gives pet owners behavioral signals, not meme labels.


Architecture Overview

Video Clip (5–15s)
      │
      ▼
Frame Sampler (every 0.5s)
      │
      ▼
  ┌──────────────────────────────┐
  │  MobileNetV3 (vision)        │  ← body posture
  │  Whisper-tiny (audio)        │  ← vocalizations
  │  Pose Keypoints (MediaPipe)  │  ← tail, ears, spine
  └──────────────────────────────┘
      │
      ▼
Late Fusion Layer (concat embeddings)
      │
      ▼
Classifier → [Relaxed / Alert / Anxious / Playful / Distressed]
Enter fullscreen mode Exit fullscreen mode

Three modalities fused at inference time. Overkill? Maybe. But pet behavior is genuinely multimodal — a wagging tail means nothing without posture context.


Step 1: Data Collection (The Unglamorous Part)

I scraped ~4,200 labeled clips from:

  • YouTube with CC (pet training channels)
  • Donated clips from pet owners (consent form + upload portal)
  • A small dataset from a vet behaviorist partner

Each clip got 3 independent human labels. Disagreements (>1 label apart) were thrown out. Final dataset: ~3,100 usable clips.

import pandas as pd
from collections import Counter

def consensus_label(labels: list[str]) -> str | None:
    """Return majority label, or None if no consensus."""
    counts = Counter(labels)
    top_label, top_count = counts.most_common(1)[0]
    if top_count >= 2:  # majority of 3 raters
        return top_label
    return None  # discard ambiguous

df = pd.read_csv("raw_labels.csv")
df["label"] = df[["rater1", "rater2", "rater3"]].apply(
    lambda row: consensus_label(row.tolist()), axis=1
)
df = df.dropna(subset=["label"])
print(df["label"].value_counts())
Enter fullscreen mode Exit fullscreen mode

Output:

Relaxed      1021
Playful       847
Alert         612
Anxious       398
Distressed    223
Enter fullscreen mode Exit fullscreen mode

Class imbalance. Classic. We'll handle it with weighted sampling.


Step 2: Frame + Audio Extraction

import cv2
import subprocess
from pathlib import Path

def extract_frames(video_path: str, fps: float = 2.0) -> list:
    """Extract frames at given FPS."""
    cap = cv2.VideoCapture(video_path)
    native_fps = cap.get(cv2.CAP_PROP_FPS)
    interval = int(native_fps / fps)
    frames = []
    i = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if i % interval == 0:
            frames.append(frame)
        i += 1
    cap.release()
    return frames

def extract_audio(video_path: str, out_path: str) -> None:
    """Extract mono 16kHz audio for Whisper."""
    subprocess.run([
        "ffmpeg", "-i", video_path,
        "-ar", "16000", "-ac", "1",
        "-f", "wav", out_path,
        "-y", "-loglevel", "quiet"
    ], check=True)
Enter fullscreen mode Exit fullscreen mode

Step 3: MediaPipe Pose Keypoints for Pets

This is where it gets fun. MediaPipe's standard pose model is human-only — pets have different joint topology. So I used a fine-tuned version from Google's Animal Pose research track, plus some open-source pet keypoint models.

Key features I extract per frame:

  • Tail angle relative to spine
  • Ear position (flattened vs. upright)
  • Spine curvature (hunched = stress signal)
  • Center of mass velocity (stillness vs. agitation)
import mediapipe as mp
import numpy as np
import cv2

def extract_pet_keypoints(frame) -> dict | None:
    mp_pose = mp.solutions.pose
    with mp_pose.Pose(
        static_image_mode=True,
        model_complexity=2,
        min_detection_confidence=0.5
    ) as pose:
        results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        if not results.pose_landmarks:
            return None

        lm = results.pose_landmarks.landmark
        return {
            "spine_top": (lm[11].x, lm[11].y),
            "spine_mid": (lm[23].x, lm[23].y),
            "spine_base": (lm[24].x, lm[24].y),
            "tail_base": (lm[27].x, lm[27].y),
        }

def spine_curvature(kp: dict) -> float:
    """Simple curvature score: 0 = straight, 1 = fully hunched."""
    top = np.array(kp["spine_top"])
    mid = np.array(kp["spine_mid"])
    base = np.array(kp["spine_base"])
    v1 = base - top
    v2 = mid - top
    proj = np.dot(v2, v1) / np.dot(v1, v1) * v1
    deviation = np.linalg.norm(v2 - proj)
    body_len = np.linalg.norm(v1)
    return float(deviation / body_len) if body_len > 0 else 0.0
Enter fullscreen mode Exit fullscreen mode

Step 4: Late Fusion Model

Instead of training one big end-to-end model (complex, brittle), I train each encoder separately, then fuse their embeddings.

import torch
import torch.nn as nn

class PetEmotionFusion(nn.Module):
    def __init__(
        self,
        vision_dim: int = 576,   # MobileNetV3 small
        audio_dim: int = 384,    # Whisper-tiny encoder
        pose_dim: int = 32,      # Keypoint features
        num_classes: int = 5
    ):
        super().__init__()
        total_dim = vision_dim + audio_dim + pose_dim
        self.fusion = nn.Sequential(
            nn.Linear(total_dim, 256),
            nn.LayerNorm(256),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(256, 64),
            nn.GELU(),
            nn.Linear(64, num_classes)
        )

    def forward(self, vision_emb, audio_emb, pose_emb):
        x = torch.cat([vision_emb, audio_emb, pose_emb], dim=-1)
        return self.fusion(x)
Enter fullscreen mode Exit fullscreen mode

Training with class-weighted loss:

from torch.utils.data import WeightedRandomSampler

class_counts = [1021, 847, 612, 398, 223]
class_weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
sample_weights = class_weights[train_labels]

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True
)
criterion = nn.CrossEntropyLoss(weight=class_weights.cuda())
Enter fullscreen mode Exit fullscreen mode

Step 5: FastAPI Inference Endpoint

from fastapi import FastAPI, UploadFile
from fastapi.responses import JSONResponse
import tempfile, shutil
from pathlib import Path

app = FastAPI()

@app.post("/analyze")
async def analyze_pet_video(file: UploadFile) -> JSONResponse:
    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = tmp.name
    try:
        result = inference_pipeline(tmp_path)
        return JSONResponse({
            "emotion": result["label"],
            "confidence": round(result["confidence"], 3),
            "signals": result["top_signals"],
            "recommendation": result["recommendation"]
        })
    finally:
        Path(tmp_path).unlink(missing_ok=True)
Enter fullscreen mode Exit fullscreen mode

Sample response:

{
  "emotion": "Anxious",
  "confidence": 0.847,
  "signals": ["hunched spine", "low tail angle", "no vocalization"],
  "recommendation": "Reduce stimuli. Consider consulting a pet behaviorist."
}
Enter fullscreen mode Exit fullscreen mode

Results

Metric Value
Validation Accuracy 78.3%
Weighted F1 0.76
Inference time (CPU) ~1.8s per 10s clip
Model size 42MB (quantized INT8)

Not perfect — but a vet behaviorist reviewed 100 random predictions and rated 81% as "clinically reasonable." For a v1, that's good enough to ship.


What I Learned

  1. Multimodal > unimodal — Adding pose keypoints boosted F1 by ~8 points over vision-only
  2. Label quality > label quantity — Throwing out ambiguous clips hurt dataset size but helped the model
  3. Late fusion is underrated — Easier to debug, easier to swap components, nearly as good as end-to-end
  4. Pets are hard — They don't hold still. Speed jitter + random crop augmentation was essential

What's Next

  • Longitudinal tracking — Not just "anxious now" but "more anxious than last week?"
  • Breed-specific fine-tuning — A Basenji's "relaxed" looks like a Labrador's "alert"
  • On-device inference — ONNX export + Core ML for iOS

If you want to try the live demo on your own pet's video, it's free to start at mypettherapist.com 🐾

Questions about the architecture, training, or vet partnership — drop them in the comments.

What multimodal fusion patterns have worked well for you?

Top comments (0)