PRASAD TILLOO

Posted on May 22

L.E.N.S. — A private photography coach for blind and low-vision artisans

#a11y #devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

L.E.N.S. (Local Edge Native Studio) is a voice-guided photography coach that runs Gemma 4 E4B locally through Ollama — so a maker can verify and improve product photos before listing, without sending images to the cloud and without asking someone sighted to “just check this one.”

Gemma 4’s native multimodal vision is the engine: each coaching turn sends a real product photo (base64 in the Ollama chat) and gets back structured JSON the app validates before speaking.

🔗 Try it (no install): lens-app-gemma4.vercel.app

📹 Demo video: YouTube walkthrough

💻 Source: github.com/prasadt1/photography-coach-gemma4 (Apache 2.0)

What I Built

I built L.E.N.S. for someone like Mohan — a low-vision artisan who hand-knits sweaters to sell online. He can judge the knit by touch: tension, pattern, finish. What he cannot reliably judge is the photograph of the piece. Is it in focus? Is the light flat? Is the sweater cropped awkwardly or lost against the background? On a marketplace like Etsy, the photo is the product; a weak photo quietly costs the sale. Until now, that step has meant borrowing someone else’s eyes.

L.E.N.S. closes that gap.

The maker points their camera and takes a photo.
Gemma 4 E4B — on their own machine, via Ollama — assesses framing, lighting, focus, and composition from the image itself (multimodal input, not a text-only description).
L.E.N.S. speaks back one specific, actionable fix: not “this photo is bad,” but “move back about six inches” or “the light is behind the sweater — turn toward the window.”
They take a second photo; L.E.N.S. compares the two images out loud and says which is stronger and why.
It drafts copy-ready listing text — title, description, and alt-text — ready to paste into their store.

It is voice-first by design, not a visual UI with audio bolted on. I built and tested the flow with a screen reader on and the screen off, because that is how it will actually be used. Structured JSON is an accessibility choice too: the client validates a strict schema and surfaces discrete, ordered points, so coaching stays one fix at a time instead of a wall of feedback the user cannot skim.

I designed for the hardest case — a blind maker, fully offline — and by the curb-cut effect, the same coaching helps any maker without a photographer or a reliable connection.

Alt: infographic of five steps — artisan capture, on-device analysis, voice feedback loop, compare and iterate, then listing copy for Etsy or Shopify.

Demo

Full walkthrough: first photo, spoken coaching, stronger retake, comparison, generated listing.

Try it live

Link	What you get
lens-app-gemma4.vercel.app	Judge / no-install demo. Sample photos play back real E4B runs recorded locally; uploads use Gemma 4 31B on Ollama Cloud so reviewers can try a photo without pulling a model.
photography-coach-gemma4.vercel.app	Real product path for the submission video — E4B on your Mac via Ollama (same Wi‑Fi PWA or tunnel). Photos do not go to Ollama Cloud on this deploy.

No account. No tracking. Copy-ready output only — L.E.N.S. does not auto-publish to Etsy or Shopify.

Code

Source, README, architecture notes, and spike write-ups:

prasadt1 / photography-coach-gemma4

📷 L.E.N.S. — Local Edge Native Studio

The one step between a finished piece and a sale shouldn't depend on someone else's eyes.

A private, voice-guided photography coach for blind and low-vision artisans.

🔗 Live demos: Judge try-it (Ollama Cloud 31B) · Real product / video (local E4B) · Demo video · Built for the Gemma 4 Good Hackathon

Tracks: Digital Equity & Inclusivity · Ollama

What L.E.N.S. is

Mohan has low vision. He hand-knits sweaters and can finish a flawless cable pattern by touch. He can shape, price, and list a piece on his own — until the one step he cannot finish alone: photographing it well enough to sell online.

L.E.N.S. closes that gap. It is a voice-guided photography coach that helps blind and low-vision artisans verify and improve their product photos before listing their work. It runs Gemma 4 through Ollama, describes the photo in plain…

View on GitHub

Stack: React 19 + TypeScript PWA, optional Electron desktop build, Ollama for local multimodal inference, Web Speech API for coach voice output.

Repo highlights:

Strict JSON contract — one schema drives description, colour check, single fix, alt-text, and listing copy.
Three honestly labelled inference modes (see below).
Spike docs: Spike 1 — E4B via Ollama, quantization study, LiteRT iOS spike.

This is original work I built for accessibility-first product photography coaching; the repo is not a repackaged template.

How I Used Gemma 4

Gemma 4 is the core of L.E.N.S.: multimodal photo assessment and coaching generation. Every model and runtime choice followed from local-first privacy and voice-loop latency.

Why Gemma 4 E4B (and what I ruled out)

The Gemma 4 family spans small edge models, 31B Dense, and 26B MoE. For this project:

Variant	Role in my decision
E2B (~2B)	Too small for consistent visual judgment on real product photos.
E4B (~4B)	Shipped. Small enough for consumer hardware + Ollama offline; capable enough for trustworthy multimodal coaching.
31B Dense	Ruled out for the product — too heavy for typical laptops; breaks the “photo never leaves the machine” promise. Used only for judge demo uploads on Ollama Cloud.
26B MoE	Strong for throughput/reasoning, but overkill for a single-photo voice loop on modest hardware; E4B matched the edge + multimodal product path better.

E4B is the deliberate middle: the trade-off is the project.

What E4B unlocked for this project

Multimodal vision on-device — real product photos in, structured coaching out (framing, light, focus, colour), not text-only guesses.
Offline independence — the product path never requires sending photos to a remote API.
Usable voice-loop latency — ~4B + Q4_K_M + streaming TTS ≈ ~20s warm (down from ~40s early on).
Strict JSON coaching — one spatial fix, two-photo compare, listing copy — all from schemas Ollama enforces at generation time.
Honest dual deploy — E4B for the real maker story; 31B only where judges need a zero-install upload path.

Multimodal + structured output (how it’s wired)

Each analyze call sends the image in Ollama’s messages[].images[] array and asks Gemma 4 E4B for JSON via Ollama’s format field (JSON Schema). The client validates before TTS speaks:

// services/ollamaService.ts — simplified
const messages = [
  { role: 'system', content: buildSystemPrompt(/* artisan coaching */) },
  { role: 'user', content: userPrompt, images: [base64ProductPhoto] },
];

await fetch(`${OLLAMA_BASE}/api/chat`, {
  method: 'POST',
  body: JSON.stringify({
    model: 'gemma4:e4b',
    messages,
    format: ARTISAN_V3_OUTPUT_SCHEMA,  // Ollama enforces JSON shape
    stream: true,                       // TTS starts before generation ends
    options: { num_predict: cappedTokens },
    keep_alive: '30m',
  }),
});

The artisan schema drives fields like scene description, one priorityFix, alt-text, and listing title/description — so VoiceOver/TalkBack and the coach voice never drown the maker in a paragraph of fixes.

Runtime: Ollama

I spiked Cactus and llama.cpp as well. Ollama won for the cleanest local multimodal serving and the simplest path to multiple inference modes without rebuilding the pipeline each time.

Quantization: Q4_K_M

On modest hardware, Q4_K_M keeps E4B runnable without meaningfully hurting visual assessment. Lighter quants started to cost coaching quality; heavier ones were not worth the memory for this use case.

Latency and voice

Early warm inference was ~40s — too long for a spoken coaching loop. Prompt tuning, a token cap, a warm-up call on startup, and streaming brought warm runs to roughly 20s.

Three honest inference modes

Mode	Model	Network
Local (product)	Gemma 4 E4B via Ollama on the maker’s machine	Fully offline
Judge demo uploads	Gemma 4 31B on Ollama Cloud	Requires connection
Demo mode	Playback of real recorded E4B responses	None

I also spiked LiteRT for true on-device iOS inference (~25 tok/s in Google’s reference app). That is Phase 2 — documented as roadmap, not claimed as shipped. Today, iOS is covered by the installable PWA talking to Ollama on the Mac (same Wi‑Fi or tunnel).

Why local Gemma matters

Privacy here is not a bullet point — it is the mechanism of independence. A cloud coach swaps one dependency for another: instead of a sighted helper, you need connectivity, an account, and a server that receives your product photos. A capable Gemma 4 model on the maker’s own hardware is what makes “I can list this myself” real.

Accessibility (why the UX matches the model story)

Voice-first with an equivalent labelled control for every voice action.
Screen reader: landmarks, live regions, managed focus; coach TTS works alongside VoiceOver/TalkBack, not instead of it.
One fix at a time — same discipline in prompt design and UI.
Anti-hallucination — states uncertainty when the image does not support a claim.
Multilingual coaching paths in the prompt layer.

What’s next

Native on-device iOS via LiteRT (spike done; integration is post-hackathon).
More languages and tighter cold-start latency.
Deeper maker workflows (batch listing prep) — still local-first.

DEV Community