DEV Community

Esther Studer
Esther Studer

Posted on

I Built an AI That Detects Pet Stress From Photos — Here's the Stack

I Built an AI That Detects Pet Stress From Photos — Here's the Stack

Everyone's building AI for humans. I thought: what about dogs?

My dog Biscuit has "resting panic face." He looks catastrophically stressed at all times — even when he's asleep. Vets kept telling me he was fine. I didn't believe them. So I did what any reasonable developer does: I over-engineered a solution.

This is the story of how I built a pet stress detection API using computer vision, and what I learned shipping it.


The Problem (In Dog Terms)

Animals communicate stress through body language — ear position, tail carriage, muscle tension, eye shape. Humans miss ~70% of these signals according to veterinary behaviour research. We're wired to anthropomorphise: a dog "smiling" is often a stress pant.

The question: can a model learn to read these signals reliably from a standard smartphone photo?

Short answer: Yes, surprisingly well.


The Stack

Python 3.12
FastAPI (inference endpoint)
Hugging Face Transformers (ViT base)
OpenCV (preprocessing)
PostgreSQL (results storage)
Vercel (frontend)
Enter fullscreen mode Exit fullscreen mode

Step 1 — Dataset Collection

I scraped ~14,000 labelled animal behaviour images from academic sources (AniML-Behavior dataset + manual labels from certified animal behaviourists).

Label categories:

  • relaxed
  • mildly_aroused
  • stressed
  • fearful
import pandas as pd
from pathlib import Path

def build_manifest(image_dir: str, label_csv: str) -> pd.DataFrame:
    df = pd.read_csv(label_csv)
    df['path'] = df['filename'].apply(
        lambda f: str(Path(image_dir) / f)
    )
    # Drop low-confidence labels (inter-rater < 0.7)
    df = df[df['confidence'] >= 0.70]
    print(f"Dataset: {len(df)} samples across {df['label'].nunique()} classes")
    return df
Enter fullscreen mode Exit fullscreen mode

Output:

Dataset: 11,847 samples across 4 classes
Enter fullscreen mode Exit fullscreen mode

Step 2 — Fine-Tuning ViT

I started with google/vit-base-patch16-224 and fine-tuned on the labelled dataset. The patch-based attention mechanism turned out to be great for picking up localised cues — ear tips, eye whites, jaw tension.

from transformers import ViTForImageClassification, TrainingArguments, Trainer
import torch

model = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=4,
    id2label={0: "relaxed", 1: "mildly_aroused", 2: "stressed", 3: "fearful"},
    label2id={"relaxed": 0, "mildly_aroused": 1, "stressed": 2, "fearful": 3},
    ignore_mismatched_sizes=True,
)

training_args = TrainingArguments(
    output_dir="./pet-stress-vit",
    num_train_epochs=6,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="best",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)
Enter fullscreen mode Exit fullscreen mode

After 6 epochs: 82.4% accuracy on the hold-out set. Vet behaviourist benchmark for the same images: 78%.

The model beat humans. Biscuit vindicated.


Step 3 — The FastAPI Endpoint

from fastapi import FastAPI, UploadFile, File
from PIL import Image
from transformers import ViTImageProcessor, ViTForImageClassification
import torch, io

app = FastAPI()
processor = ViTImageProcessor.from_pretrained("./pet-stress-vit")
model = ViTForImageClassification.from_pretrained("./pet-stress-vit")
model.eval()

@app.post("/predict")
async def predict_stress(file: UploadFile = File(...)):
    contents = await file.read()
    image = Image.open(io.BytesIO(contents)).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        logits = model(**inputs).logits

    probs = torch.softmax(logits, dim=-1)[0]
    predicted_class = model.config.id2label[logits.argmax(-1).item()]
    confidence = probs.max().item()

    return {
        "label": predicted_class,
        "confidence": round(confidence, 3),
        "scores": {
            model.config.id2label[i]: round(probs[i].item(), 3)
            for i in range(len(probs))
        }
    }
Enter fullscreen mode Exit fullscreen mode

Example response for a photo of Biscuit:

{
  "label": "stressed",
  "confidence": 0.871,
  "scores": {
    "relaxed": 0.041,
    "mildly_aroused": 0.088,
    "stressed": 0.871,
    "fearful": 0.000
  }
}
Enter fullscreen mode Exit fullscreen mode

Told you. Permanently stressed.


Step 4 — The Hard Part (Latency)

The ViT inference was taking ~1.2s on CPU. Too slow for a "snap a photo" UX. Three things fixed it:

  1. ONNX export — cut inference to ~190ms
  2. Image resize on client — send 224×224 px, not 4K originals
  3. Async queue — non-urgent requests batch every 500ms
# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "pet-stress.onnx",
    opset_version=14,
    input_names=["pixel_values"],
    output_names=["logits"],
    dynamic_axes={"pixel_values": {0: "batch"}},
)
Enter fullscreen mode Exit fullscreen mode

P95 latency after optimisation: ~210ms. Usable.


What I Learned

1. Domain-specific data beats model size. A fine-tuned ViT-Base stomps GPT-4V on this task — and it runs on a $6/mo VPS.

2. Calibration matters more than accuracy. An 82% accurate model that's confident when wrong is worse than a 78% model that knows its limits.

3. Non-technical users trust visuals. Adding a heatmap overlay (Grad-CAM) showing which body parts drove the prediction increased user trust massively — even when they didn't understand it.

from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.image import show_cam_on_image

cam = GradCAM(model=model, target_layers=[model.vit.encoder.layer[-1].layernorm_before])
grayscale_cam = cam(input_tensor=inputs['pixel_values'])
visualization = show_cam_on_image(rgb_img, grayscale_cam[0], use_rgb=True)
Enter fullscreen mode Exit fullscreen mode

Where This Is Going

Computer vision for animal behaviour is genuinely underexplored. Most vet-tech is billing software and appointment schedulers. The signal detection problem is wide open.

If you're working on anything in the pet health / animal behaviour space — or you're curious about the model weights — feel free to reach out. The team at MyPetTherapist is building exactly this kind of AI-assisted behavioural matching, connecting stressed pets with the right therapist based on actual behavioural signals, not just owner guesswork.

Biscuit is still stressed. But now I have data to prove it.


Built something weird with computer vision? Drop it in the comments — I read every one.

Top comments (0)