I Built an AI That Detects Pet Stress From Photos — Here's the Stack
Everyone's building AI for humans. I thought: what about dogs?
My dog Biscuit has "resting panic face." He looks catastrophically stressed at all times — even when he's asleep. Vets kept telling me he was fine. I didn't believe them. So I did what any reasonable developer does: I over-engineered a solution.
This is the story of how I built a pet stress detection API using computer vision, and what I learned shipping it.
The Problem (In Dog Terms)
Animals communicate stress through body language — ear position, tail carriage, muscle tension, eye shape. Humans miss ~70% of these signals according to veterinary behaviour research. We're wired to anthropomorphise: a dog "smiling" is often a stress pant.
The question: can a model learn to read these signals reliably from a standard smartphone photo?
Short answer: Yes, surprisingly well.
The Stack
Python 3.12
FastAPI (inference endpoint)
Hugging Face Transformers (ViT base)
OpenCV (preprocessing)
PostgreSQL (results storage)
Vercel (frontend)
Step 1 — Dataset Collection
I scraped ~14,000 labelled animal behaviour images from academic sources (AniML-Behavior dataset + manual labels from certified animal behaviourists).
Label categories:
relaxedmildly_arousedstressed-
fearful
import pandas as pd
from pathlib import Path
def build_manifest(image_dir: str, label_csv: str) -> pd.DataFrame:
df = pd.read_csv(label_csv)
df['path'] = df['filename'].apply(
lambda f: str(Path(image_dir) / f)
)
# Drop low-confidence labels (inter-rater < 0.7)
df = df[df['confidence'] >= 0.70]
print(f"Dataset: {len(df)} samples across {df['label'].nunique()} classes")
return df
Output:
Dataset: 11,847 samples across 4 classes
Step 2 — Fine-Tuning ViT
I started with google/vit-base-patch16-224 and fine-tuned on the labelled dataset. The patch-based attention mechanism turned out to be great for picking up localised cues — ear tips, eye whites, jaw tension.
from transformers import ViTForImageClassification, TrainingArguments, Trainer
import torch
model = ViTForImageClassification.from_pretrained(
"google/vit-base-patch16-224",
num_labels=4,
id2label={0: "relaxed", 1: "mildly_aroused", 2: "stressed", 3: "fearful"},
label2id={"relaxed": 0, "mildly_aroused": 1, "stressed": 2, "fearful": 3},
ignore_mismatched_sizes=True,
)
training_args = TrainingArguments(
output_dir="./pet-stress-vit",
num_train_epochs=6,
per_device_train_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="best",
load_best_model_at_end=True,
metric_for_best_model="f1",
)
After 6 epochs: 82.4% accuracy on the hold-out set. Vet behaviourist benchmark for the same images: 78%.
The model beat humans. Biscuit vindicated.
Step 3 — The FastAPI Endpoint
from fastapi import FastAPI, UploadFile, File
from PIL import Image
from transformers import ViTImageProcessor, ViTForImageClassification
import torch, io
app = FastAPI()
processor = ViTImageProcessor.from_pretrained("./pet-stress-vit")
model = ViTForImageClassification.from_pretrained("./pet-stress-vit")
model.eval()
@app.post("/predict")
async def predict_stress(file: UploadFile = File(...)):
contents = await file.read()
image = Image.open(io.BytesIO(contents)).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
predicted_class = model.config.id2label[logits.argmax(-1).item()]
confidence = probs.max().item()
return {
"label": predicted_class,
"confidence": round(confidence, 3),
"scores": {
model.config.id2label[i]: round(probs[i].item(), 3)
for i in range(len(probs))
}
}
Example response for a photo of Biscuit:
{
"label": "stressed",
"confidence": 0.871,
"scores": {
"relaxed": 0.041,
"mildly_aroused": 0.088,
"stressed": 0.871,
"fearful": 0.000
}
}
Told you. Permanently stressed.
Step 4 — The Hard Part (Latency)
The ViT inference was taking ~1.2s on CPU. Too slow for a "snap a photo" UX. Three things fixed it:
- ONNX export — cut inference to ~190ms
- Image resize on client — send 224×224 px, not 4K originals
- Async queue — non-urgent requests batch every 500ms
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"pet-stress.onnx",
opset_version=14,
input_names=["pixel_values"],
output_names=["logits"],
dynamic_axes={"pixel_values": {0: "batch"}},
)
P95 latency after optimisation: ~210ms. Usable.
What I Learned
1. Domain-specific data beats model size. A fine-tuned ViT-Base stomps GPT-4V on this task — and it runs on a $6/mo VPS.
2. Calibration matters more than accuracy. An 82% accurate model that's confident when wrong is worse than a 78% model that knows its limits.
3. Non-technical users trust visuals. Adding a heatmap overlay (Grad-CAM) showing which body parts drove the prediction increased user trust massively — even when they didn't understand it.
from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.image import show_cam_on_image
cam = GradCAM(model=model, target_layers=[model.vit.encoder.layer[-1].layernorm_before])
grayscale_cam = cam(input_tensor=inputs['pixel_values'])
visualization = show_cam_on_image(rgb_img, grayscale_cam[0], use_rgb=True)
Where This Is Going
Computer vision for animal behaviour is genuinely underexplored. Most vet-tech is billing software and appointment schedulers. The signal detection problem is wide open.
If you're working on anything in the pet health / animal behaviour space — or you're curious about the model weights — feel free to reach out. The team at MyPetTherapist is building exactly this kind of AI-assisted behavioural matching, connecting stressed pets with the right therapist based on actual behavioural signals, not just owner guesswork.
Biscuit is still stressed. But now I have data to prove it.
Built something weird with computer vision? Drop it in the comments — I read every one.
Top comments (0)