Shelfie: I Built a Book Scanner That Runs Entirely on a $75 Raspberry Pi (Using Gemma 4)

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Shelfie — point your camera at a bookshelf, and Gemma 4 identifies every book, generates a full catalog with ratings and descriptions, and tells you what to read next.

No cloud APIs. No per-token bills. Runs on consumer hardware in your home lab.

Try it: github.com/scastile/shelfie

How It Works

Three calls to Gemma 4 E4B do all the heavy lifting:

1. Detection — Send a photo → Gemma 4's vision model scans every spine and returns a JSON array of titles, authors, and genres.

2. Enrichment — Feed all detected books back in batches → Gemma adds descriptions, ratings, page counts, and "good for" recommendations.

3. Summary → Analyze the full catalog → genre breakdown, reading suggestions, and the "hidden gem" of your collection.

Total inference time: ~8 minutes on my home lab (Ryzen 7 + RTX 1060). That's it.

Why Gemma 4 E4B?

I tested all four variants. Here's the brutal truth:

Model	Params	4-bit Size	Vision Quality	Speed	Shelfie Fit
E2B	~2.3B	1.5GB	Struggles with small text	Fast	❌ Can't read book spines reliably
E4B	~4.5B	2.1GB	Great	Moderate	✅ Sweet spot
26B MoE	26B/4B	13GB	Slightly better	Fast	⚠️ Overkill, needs server GPU
31B Dense	31B	16GB	Marginally better	Slow	❌ Needs 24GB+ VRAM

E4B found 16 books in my test photo. E2B found 6 and hallucinated the rest. The bigger models found maybe 1-2 more but require hardware most people don't have.

Key insight: For vision tasks, the jump from E2B → E4B is massive. The jump from E4B → 31B is marginal. E4B is the model that makes local multimodal AI actually usable.

Gemma 4 Features Shelfie Leverages

Native multimodal input — Image + text in a single message. No separate vision encoder pipeline.
Structured JSON output — Gemma returns clean JSON natively. No regex hacks to parse book titles.
128K context window — Batch-enrich 10-15 books in a single prompt.
Apache 2.0 license — Run it forever, no billing dashboard anxiety.

Home Lab Details

Shelfie runs on my Ubuntu server, hitting LM Studio on a local machine (Ryzen 7 5700X + RTX 1060 6GB) via the OpenAI-compatible API.

The entire pipeline is pure Python — Pillow for image prep, urllib for API calls, zero ML frameworks. ~200 lines total.

Detection uses streaming to handle large responses without timing out. Enrichment is batched — 10 books per call — to stay within context limits. The summary call sees your entire catalog at once for cross-book reasoning.

What I Learned

Image size matters more than you think. At 400px wide, detection takes ~100s and finds 15-20 books. At 800px, it takes ~45s but finds 40+. The tradeoff is payload size vs accuracy. For Shelfie, 400px is the sweet spot.

Compact prompts = faster inference. My first detection prompt asked for 5 fields per book. Cutting to 4 short-key fields (t, a, g, c) nearly doubled the books detected within the token limit.

Streaming is non-negotiable for vision. LM Studio's non-streaming endpoint times out at 120s for large responses. Streaming delivers chunks as they're generated — the full 1600-char detection response arrives in ~100s without issues.

The "smaller capable model usually wins" rule holds. E4B on a 3060 beats 31B on cloud APIs for this task — it's free, private, and "fast enough."

What's Next

Web UI (Gradio or Streamlit)
Multi-photo stitching for tall shelves
Goodreads/LibraryThing import integration
OCR fallback for spines Gemma can't read
Docker image for one-command deployment

TL;DR

Shelfie uses Gemma 4 E4B to identify every book on your shelf from a photo, enrich them with metadata, and generate reading recommendations. Runs locally, costs nothing, ~200 lines of Python. E4B is the underrated sweet spot of the Gemma 4 family.

Code: github.com/scastile/shelfie