Stephen McCullough

Posted on Apr 1 • Edited on Apr 6 • Originally published at swm.cc

Indexatron: Teaching Local LLMs to See Family Photos

#ollama #python #ai #llm

Status: ✅ SUCCESS
Hypothesis: Local LLMs can analyse family photos with useful metadata extraction

I've been building the-mcculloughs.org - a family photo sharing app. The Rails side handles uploads, galleries, and all the usual stuff. But I wanted semantic search - not just "photos from 2015" but "photos at the beach" or "pictures with grandma."

The cloud APIs exist. But uploading decades of family photos to someone else's servers? Hard pass.

Time for a science experiment - two apps working together.

The Experiment

I called it Indexatron 🤖

The goal: prove that Ollama running locally with LLaVA:7b and nomic-embed-text can:

Analyse photos - Extract descriptions, detect people/objects, estimate era
Generate embeddings - Create 768-dimensional vectors for similarity search
Process batches - Handle multiple images with progress tracking

Test Results

Metric	Value
Images Processed	3/3
Failed	0
Total Time	40.82s
Avg Time/Image	~13.6s

Sample Outputs

🐕 family_photo_03.jpg

Description: "A tan-coloured Labrador Retriever is sitting on a wooden floor indoors"
Categories: ["dog"]
Mood: calm
Processing Time: 14.73s

🍺 family_photo_02.jpg

Description: "A photo of a bottle of beer and a glass with frothy white head on top, placed on a table at a restaurant"
Location: Indoor restaurant
Objects Detected: Beer bottle (Kingfisher brand), glass with beer
Categories: ["beer", "restaurant"]
Processing Time: 14.2s

👔 family_photo_01.jpg

Description: "A man standing in an indoor conference room during a wedding reception"
Era Detected: 2010s (medium confidence)
Person: Male guest, 30s, wearing suit and tie
Categories: ["wedding"]
Processing Time: 11.89s

What Worked Well

LLaVA Vision Analysis

Correctly identified subjects (dog, beer, person)
Detected specific brands (Kingfisher)
Estimated era from visual cues
Provided useful mood/atmosphere descriptions

Embedding Generation

768-dimensional embeddings generated for all images
Based on analysis descriptions (semantic meaning)
Ready for similarity search when needed

Batch Processing

Progress bar with Rich library
Skip existing functionality
Combined JSON output

Quirks & Learnings

JSON Parsing Required Repair

LLaVA doesn't always output clean JSON. The analyser needed:

Code block stripping
Brace balancing
Type coercion for nested objects

Model Hallucinations

Some amusing observations:

The dog photo mentioned "clothing" and "fashion trends for pets" (the dog had no clothes)
Beer was classified under people array with estimated_age: "Beer is an alcoholic beverage"

These quirks don't break the system - robust parsing handles them.

Processing Time

~13.6 seconds per image is acceptable for batch processing. Real-time analysis would need:

Smaller model (llava:7b is the smallest)
GPU acceleration
Or async processing with user feedback

Development Approach

This was parallel development across two codebases - with very different approaches for each.

The Boring Bits: AI Agents for CRUD

The Rails API work? It's not exciting. Setting up API endpoints, adding pgvector, writing migrations, CRUD operations - I've done this hundreds of times. It's necessary scaffolding, but it's not where I want to spend my brain cycles.

So I let AI agents handle it. Claude Code with custom agents for code review, test writing, and documentation. The agents handled the boilerplate while I reviewed and approved. This is exactly what AI assistance is good for - augmenting the repetitive work so you can focus on what matters.

The Interesting Bits: README-Driven Development

Indexatron was different. This was an experiment - I needed to understand every piece, make deliberate choices, and document as I went. For this, I used README-driven development:

Write the README first - Document what the code should do before writing it
One branch per milestone - Each branch proves one thing works
Merge only when it works - No moving on until the milestone is complete
AI for documentation - Let agents help write up the results

README-driven development forces you to think through the design before coding. It's slower, but you end up with working code and documentation. Perfect for experiments where you need to prove something works.

Development Progress

Indexatron (Python) - The Experiment

README-driven development with one branch per milestone:

PR	Milestone	What It Proved
#5	Project Setup	Foundation ready
#1	Ollama Connection	Local LLM runtime accessible
#2	Image Analysis	LLaVA extracts useful metadata
#3	Embeddings	768-dim vectors for similarity
#4	Batch Processing	Scalable to many images

Each branch had to work before moving on. Prove it, merge it, move on.

Rails App - The Integration (Agent-Assisted)

While I focused on Indexatron, AI agents handled the Rails infrastructure:

PR	Feature
#60	AI Photo Analysis API with pgvector

Standard API endpoint, database migration, pgvector setup - all the CRUD that's been done a thousand times before. The agents wrote the code, I reviewed it, tests passed, merged. That's the right division of labour: agents handle the predictable, humans handle the novel.

Technical Stack

Ollama (local runtime)
├── llava:7b (~4.7GB) - Vision analysis
└── nomic-embed-text (~274MB) - Embeddings

Python 3.11+
├── ollama - API client
├── pydantic - Data validation
├── pillow - Image handling
└── rich - Console output

Next Steps

This proves the concept works. Future integration:

Rails API - Add endpoint for on-demand analysis
Database Storage - Save embeddings in PostgreSQL (pgvector)
Similarity Search - Find "photos like this one"
Face Recognition - Cluster photos by person (future model)

Conclusion

🤖 The robots can see our photos.

Local LLMs provide a privacy-preserving alternative to cloud APIs for photo analysis. The quality is good enough for family photo organisation, and the 768-dimensional embeddings enable future similarity search features.

Code

Repository	Description
swmcc/indexatron	Python service for local LLM photo analysis
swmcc/the-mcculloughs.org	Rails family photo sharing app

Full experiment results: RESULTS.md

Built with Ollama, LLaVA, and a healthy scepticism of cloud APIs.

DEV Community