I Built an AI That Reads Your Pet's Body Language — Here's the Exact Tech Stack

#ai #webdev #python #showdev

Your dog pins their ears back. Your cat flicks their tail. Most pet owners miss 80% of what their animals are telling them — not because they don't care, but because we were never taught the language.

I built MyPetTherapist to fix that. Here's the full technical breakdown of how we use AI to interpret pet body language from a single photo.

The Problem Worth Solving

Vets see your pet for maybe 20 minutes a year. In that window, anxiety, pain signals, and behavioral red flags often stay hidden — pets mask stress in unfamiliar environments. The real behavior happens at home, and it's invisible to professionals.

The question I kept asking: what if a phone camera could become a 24/7 behavioral observer?

Architecture Overview

User uploads photo
       ↓
  FastAPI backend
       ↓
  Vision model (GPT-4o) — keypoint extraction + behavioral inference
       ↓
  Structured JSON output
       ↓
  Report generator (species-specific templates)
       ↓
  PDF/HTML delivery to user

Simple on paper. The complexity lives in the prompt engineering and the output validation layer.

Step 1: Vision Analysis via GPT-4o

The core call looks like this:

import base64
import httpx
from openai import OpenAI

client = OpenAI()

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_pet_body_language(image_path: str, species: str) -> dict:
    base64_image = encode_image(image_path)

    system_prompt = f"""
    You are a certified animal behaviorist specializing in {species} body language.
    Analyze the image and return a structured JSON with:
    - posture: overall body posture assessment
    - ears: ear position and what it signals
    - tail: tail position/movement and interpretation
    - eyes: eye shape, dilation, gaze direction
    - muscle_tension: visible tension indicators
    - stress_level: 1-10 scale with reasoning
    - emotional_state: primary emotional label
    - confidence: your confidence score 0.0-1.0
    - recommendations: list of actionable advice
    """

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"
                        }
                    },
                    {
                        "type": "text",
                        "text": f"Analyze the {species} body language in this image. Return valid JSON only."
                    }
                ]
            }
        ],
        response_format={"type": "json_object"},
        max_tokens=1000
    )

    return json.loads(response.choices[0].message.content)

The response_format: json_object parameter is doing a lot of heavy lifting here. Without it, GPT-4o will occasionally wrap the JSON in markdown fences or add a preamble sentence — and your parser silently fails.

Step 2: Output Validation with Pydantic

Raw LLM output needs a schema guard. We use Pydantic v2 for this:

from pydantic import BaseModel, Field, field_validator
from typing import Literal

class BodyLanguageReport(BaseModel):
    posture: str
    ears: str
    tail: str | None = None  # snakes don't have tails (well, they do, but)
    eyes: str
    muscle_tension: str
    stress_level: int = Field(ge=1, le=10)
    emotional_state: Literal[
        "relaxed", "anxious", "fearful", "playful",
        "aggressive", "neutral", "excited", "in_pain"
    ]
    confidence: float = Field(ge=0.0, le=1.0)
    recommendations: list[str]

    @field_validator("recommendations")
    @classmethod
    def limit_recommendations(cls, v):
        # Keep it actionable, not overwhelming
        return v[:5]

def parse_analysis(raw: dict) -> BodyLanguageReport:
    try:
        return BodyLanguageReport(**raw)
    except Exception as e:
        # Log and return a graceful degradation
        logger.warning(f"Validation failed: {e}. Raw: {raw}")
        raise ValueError("Analysis parsing failed — retrying with fallback prompt")

The emotional_state enum is controversial internally — we debated whether to allow free-form strings. We landed on a fixed vocabulary for two reasons:

Consistency across reports (users compare week-over-week)
Defensibility — we're not making up states, we're classifying into known behavioral categories

Step 3: Species-Specific Prompt Templates

This is where most pet AI projects get lazy — they treat all animals the same. Dog body language and cat body language are not the same problem.

A wagging tail in a dog = excitement. In a cat = irritation. Same signal, opposite meaning.

We maintain a species_config.yaml:

dog:
  tail_baseline: "neutral horizontal"
  stress_indicators:
    - "whale eye (visible sclera)"
    - "lip licking without food present"
    - "yawning in non-tired context"
    - "paw raise"
    - "tucked tail below hock"
  play_indicators:
    - "play bow (front down, rear up)"
    - "loose body wiggle"
    - "open relaxed mouth"

cat:
  tail_baseline: "upright with slight curve = confident/friendly"
  stress_indicators:
    - "dilated pupils in normal light"
    - "flattened ears (airplane ears)"
    - "tail lashing side to side"
    - "whiskers pulled back flat"
    - "crouched low body posture"
  play_indicators:
    - "tail up, slow blink"
    - "chirping at window"
    - "relaxed exposed belly (invitation, NOT submission)"

The prompt dynamically injects the relevant species block:

import yaml

with open("species_config.yaml") as f:
    SPECIES_CONFIG = yaml.safe_load(f)

def build_system_prompt(species: str) -> str:
    config = SPECIES_CONFIG.get(species, {})
    stress = "\n".join(f"- {s}" for s in config.get("stress_indicators", []))
    play = "\n".join(f"- {p}" for p in config.get("play_indicators", []))

    return f"""You are a certified animal behaviorist specializing in {species} behavior.

Known {species} stress signals to watch for:
{stress}

Known {species} positive/play signals:
{play}

Analyze the uploaded image with this species-specific context in mind.
Return structured JSON only."""

Step 4: The Retry + Confidence Threshold Layer

Low-confidence results (< 0.65) trigger a second-pass with a simplified prompt:

async def analyze_with_fallback(
    image_path: str,
    species: str,
    max_retries: int = 2
) -> BodyLanguageReport:

    for attempt in range(max_retries):
        result = await analyze_pet_body_language(image_path, species)
        report = parse_analysis(result)

        if report.confidence >= 0.65:
            return report

        if attempt == 0:
            # Retry with explicit "be conservative" instruction
            logger.info(f"Low confidence ({report.confidence}), retrying with conservative prompt")
            # Modify prompt to emphasize uncertainty is OK

    # Return best result even if below threshold, flagged for review
    report.flagged_for_review = True
    return report

We flag low-confidence analyses in our DB and use them as training signal — they're often edge cases (unusual breeds, bad lighting, multiple animals in frame).

What We Learned After 10,000+ Analyses

1. Image quality kills accuracy more than model choice. A blurry iPhone photo in bad lighting will defeat GPT-4o. We added a pre-flight blur detection step using OpenCV before even calling the API.

2. Context matters massively. A dog mid-yawn looks identical to a dog stress-panting. Adding contextual fields ("is this a vet visit photo?", "was the dog just playing?") in the analysis request improved accuracy by ~18%.

3. Users over-interpret stress scores. A stress_level of 6/10 sent some users into panic mode. We added a baseline comparison ("your dog's average is 4.2 — today is 6.0, slightly elevated but within normal range") which reduced support tickets by 40%.

4. Multi-frame analysis > single photo. We're now rolling out video clip analysis (5-second clips) using frame sampling. Early results show significantly better accuracy for dynamic signals like panting rate and ear micromovement.

What's Next

The v2 pipeline adds:

Longitudinal tracking — weekly stress trend graphs per pet
Multimodal audio — analyzing vocalizations alongside body language
Vet integration API — so pet owners can share reports directly with their vet

If you're working on anything similar in the pet-tech or animal behavior space, I'd love to compare notes in the comments.

And if you want to see the final product in action — try it free at mypettherapist.com. Upload a photo of your pet and get a full behavioral report in under 60 seconds.

Built with Python, FastAPI, GPT-4o, Pydantic v2, and a lot of research into animal behavior literature. The species config YAML currently covers dogs, cats, rabbits, and birds — horses coming soon.