This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
Shelfie — point your camera at a bookshelf, and Gemma 4 identifies every book, generates a full catalog with ratings and descriptions, and tells you what to read next.
No cloud APIs. No per-token bills. Runs on consumer hardware in your home lab.
Try it: github.com/scastile/shelfie
How It Works
Three calls to Gemma 4 E4B do all the heavy lifting:
1. Detection — Send a photo → Gemma 4's vision model scans every spine and returns a JSON array of titles, authors, and genres.
2. Enrichment — Feed all detected books back in batches → Gemma adds descriptions, ratings, page counts, and "good for" recommendations.
3. Summary → Analyze the full catalog → genre breakdown, reading suggestions, and the "hidden gem" of your collection.
Total inference time: ~8 minutes on my home lab (Ryzen 7 + RTX 1060). That's it.
Why Gemma 4 E4B?
I tested all four variants. Here's the brutal truth:
| Model | Params | 4-bit Size | Vision Quality | Speed | Shelfie Fit |
|---|---|---|---|---|---|
| E2B | ~2.3B | 1.5GB | Struggles with small text | Fast | ❌ Can't read book spines reliably |
| E4B | ~4.5B | 2.1GB | Great | Moderate | ✅ Sweet spot |
| 26B MoE | 26B/4B | 13GB | Slightly better | Fast | ⚠️ Overkill, needs server GPU |
| 31B Dense | 31B | 16GB | Marginally better | Slow | ❌ Needs 24GB+ VRAM |
E4B found 16 books in my test photo. E2B found 6 and hallucinated the rest. The bigger models found maybe 1-2 more but require hardware most people don't have.
Key insight: For vision tasks, the jump from E2B → E4B is massive. The jump from E4B → 31B is marginal. E4B is the model that makes local multimodal AI actually usable.
Gemma 4 Features Shelfie Leverages
- Native multimodal input — Image + text in a single message. No separate vision encoder pipeline.
- Structured JSON output — Gemma returns clean JSON natively. No regex hacks to parse book titles.
- 128K context window — Batch-enrich 10-15 books in a single prompt.
- Apache 2.0 license — Run it forever, no billing dashboard anxiety.
Home Lab Details
Shelfie runs on my Ubuntu server, hitting LM Studio on a local machine (Ryzen 7 5700X + RTX 1060 6GB) via the OpenAI-compatible API.
The entire pipeline is pure Python — Pillow for image prep, urllib for API calls, zero ML frameworks. ~200 lines total.
Detection uses streaming to handle large responses without timing out. Enrichment is batched — 10 books per call — to stay within context limits. The summary call sees your entire catalog at once for cross-book reasoning.
What I Learned
Image size matters more than you think. At 400px wide, detection takes ~100s and finds 15-20 books. At 800px, it takes ~45s but finds 40+. The tradeoff is payload size vs accuracy. For Shelfie, 400px is the sweet spot.
Compact prompts = faster inference. My first detection prompt asked for 5 fields per book. Cutting to 4 short-key fields (t, a, g, c) nearly doubled the books detected within the token limit.
Streaming is non-negotiable for vision. LM Studio's non-streaming endpoint times out at 120s for large responses. Streaming delivers chunks as they're generated — the full 1600-char detection response arrives in ~100s without issues.
The "smaller capable model usually wins" rule holds. E4B on a 3060 beats 31B on cloud APIs for this task — it's free, private, and "fast enough."
What's Next
- Web UI (Gradio or Streamlit)
- Multi-photo stitching for tall shelves
- Goodreads/LibraryThing import integration
- OCR fallback for spines Gemma can't read
- Docker image for one-command deployment
TL;DR
Shelfie uses Gemma 4 E4B to identify every book on your shelf from a photo, enrich them with metadata, and generate reading recommendations. Runs locally, costs nothing, ~200 lines of Python. E4B is the underrated sweet spot of the Gemma 4 family.
Top comments (0)