Susant Swain

Posted on May 15

PhotoLens — A Fully Offline, On-Device Photo Gallery That Gives Blind and Low-Vision Users Independent Access to Their Own Memories

#gemmachallenge #devchallenge #a11y #opensource

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

photolens app icon

Let me start with the moment that made this app inevitable.

I am visually impaired. Last year, I went on a family trip to a remote, beautiful place — the kind of landscape people travel thousands of kilometres to stand in. My family was taking photographs, comparing shots, reliving moments as they happened. I had my phone. I pointed it in the direction of the excitement and pressed the button, not knowing what I was capturing.

Later, I opened every AI accessibility tool I had on my phone. Every single one failed the same way: they needed the internet, and there was no internet. No bars. No WiFi. Nothing. I put the phone in my pocket and listened to the birds and the wind — the only part of that scenery I could actually access.

I am a software engineer. The question that formed was not why does this keep happening but what would it actually take to fix it?

PhotoLens is the answer.

PhotoLens is a fully offline, privacy-first photo gallery for Android built specifically for blind and low-vision users. It uses Gemma 4 running entirely on-device via the LiteRT-LM inference framework to generate rich, natural language descriptions of photographs — with zero internet requirement, zero cloud upload, and zero compromise on privacy.

The Problem It Solves

Most AI accessibility tools for image description share a critical, disqualifying flaw: they depend on cloud connectivity. When a user who is blind or has low vision is in a remote area, on a plane, in a location with poor signal, or simply using a limited data plan, every one of these tools silently becomes useless. The user is back to square one — dependent on asking a sighted person for help, or simply going without.

This is not a minor inconvenience. For users who depend on these tools as part of their daily independence, connectivity-gating is a structural accessibility failure. And it is entirely avoidable.

PhotoLens removes the dependency entirely. The AI is on the device. It always works. Wherever you are.

What It Does

On-device photo description — Tap any photo to get a natural language description of its subjects, composition, mood, and context, generated locally in seconds with no network connection.
Auto-generation mode — Enable in settings to have descriptions generated automatically as you browse your gallery.
Thinking Mode — Expose the model's chain-of-thought reasoning before the final description is delivered, giving users transparency into how a result was reached.
Agentic structured analysis — Using Gemma 4's function calling capability, the app extracts technical image quality, emotional tone, and categorical tags in a single inference pass.
Regenerate — If a description misses something, request a second pass with a single tap.
Full TalkBack compatibility — Every screen, every element, every status update is built for screen reader navigation first. Not as an afterthought. As the primary use case.
WCAG 2.1 AA design — High contrast, generous touch targets, linear navigation, semantic labeling, automatic focus management to description output.

Why This Matters Beyond the App

PhotoLens demonstrates something the accessibility community needs demonstrated at scale: privacy and independence are not in tension. Users who are blind or have low vision should not have to choose between accessing their own photos and surrendering those photos to a cloud server they cannot audit. On-device AI collapses that false choice entirely.

Code

🔗 The PhotoLens source repository is available at the link below. also, a direct APK download link is provided for easy access

→ GitHub: docwiser/photolens

The repository includes:

Full Jetpack Compose Android source (Kotlin)
LiteRT-LM integration layer and model loading pipeline
Gemma 4 inference wrapper with Thinking Mode and structured function-call support
TalkBack accessibility implementation (semantic labels, focus management, live region announcements)
On-device gallery provider and image preprocessing pipeline
Settings system (auto-generation toggle, Thinking Mode toggle, description verbosity)

Tech stack: Kotlin · Jetpack Compose · LiteRT-LM (MediaPipe LLM Inference) · Gemma 4 E2B / E4B · Coroutines + Flow
Download APK (41.2MB)

sha256:c5bc5748252c7ef073d229e71e4a58328330b98e950b09076fe58af827603dd7

❤️ Hosted on github release

How I Used Gemma 4

Model Selection: E2B and E4B — and Why Nothing Else Would Do

I chose the Gemma 4 E2B and E4B variants. Not the 26B MoE. Not the 31B Dense. The two smallest members of the family. And I chose them for reasons that are inseparable from the entire purpose of the project.

This app exists specifically for the scenario where there is no internet. A cloud-callable model — however powerful — is architecturally incompatible with the problem being solved. The 26B and 31B models require server infrastructure. They solve a different problem for a different deployment context. For PhotoLens, they are the wrong tool regardless of their capability ceiling.

The E2B and E4B variants are designed precisely for the deployment scenario that matters here: on-device, on Android hardware, with no external dependency. Google describes them as "built for ultra-mobile, edge, and browser deployment." That is exactly where blind and low-vision users need accessible AI to live — not in a data center they cannot reach when they are somewhere beautiful and without signal.

Why E4B as the Primary Target

Within the edge variants, I target E4B as the primary inference model for most Android devices:

At approximately 4.5B effective parameters, it produces noticeably richer, more contextually aware descriptions than E2B — capturing mood, relational context, and image atmosphere, not just labeling objects.
It fits within the memory envelope of mid-range to flagship Android devices (4–6 GB RAM) while leaving room for the operating system and TalkBack to run without competition.
Its multimodal capability is natively integrated, not bolted on. This is critical. Image understanding is not an API call to a separate vision encoder — it is part of the model's unified forward pass, which means the descriptions reflect holistic reasoning about the image, not a concatenation of extracted features.
On devices with an NPU (most flagships from 2023 onward), E4B generates descriptions in 3–7 seconds — fast enough for practical real-world use.

E2B is kept as a fallback for lower-spec devices (less than 4 GB RAM), where E4B's memory footprint would cause system pressure. The user experience degrades gracefully: slightly shorter descriptions, the same privacy guarantee, the same offline operation.

What Gemma 4 Specifically Unlocked

Three capabilities in Gemma 4 are load-bearing for what PhotoLens does. None of them existed in Gemma 3 at the edge scale.

1. Native Multimodality

Previous Gemma generations at the edge scale were text-only. Multimodal capability meant cloud deployment. Gemma 4 E2B and E4B are natively multimodal — images and text are first-class inputs, processed together in a single unified forward pass.

This is not a minor architectural detail. It is the entire reason PhotoLens can exist as an on-device app. Without native multimodality at the E2B/E4B scale, there is no path to offline photo description on a phone. You are back to sending images to a server.

2. Thinking Mode / Chain-of-Thought Reasoning

Gemma 4 can expose its reasoning chain before producing a final answer. In PhotoLens, this becomes an explicit accessibility feature called Thinking Mode.

When a user who is blind asks for a description of a photograph, they are placing a significant degree of trust in the model's output. They often cannot independently verify the result. Thinking Mode gives them something cloud-based tools typically cannot: a transparent view of how the description was reached. They can hear the model observe: "I can see several people in an outdoor setting, the lighting appears to be late afternoon, there is foliage in the background suggesting a garden or park..." — and then make their own judgment about whether the final description reflects what they know about the photo.

This turns a limitation (AI can be wrong) into a feature (you can audit the reasoning). That is meaningful, especially for an accessibility tool.

3. Structured Function Calling

Gemma 4 supports function calling at the edge model scale. PhotoLens uses this to run what I call agentic structured analysis: in a single inference pass, the model is prompted via structured function call to return technical quality metrics, emotional tone, scene category, subject identification, and a narrative description — all as a typed JSON structure.

This means the app can present different views of the same analysis (a brief summary for quick browsing, a detailed description for a photo the user wants to remember) without running multiple inference passes. It also means the output is predictable and parseable — no need to post-process natural language to extract structured information.

The LiteRT-LM Integration

The inference pipeline is built on LiteRT-LM (formerly MediaPipe LLM Inference), Google's purpose-built runtime for on-device LLM execution on Android. LiteRT-LM handles GPU and NPU scheduling, memory management, and quantized model loading — all transparently to the application layer.

The integration is not a thin wrapper. The app manages:

Asynchronous model loading at startup with a progressive readiness indicator (TalkBack-announced)
Streaming token generation — description text streams in as it is generated, not all at once after a wait
Graceful thermal throttling detection — if the device overheats and the NPU clocks down, the app warns the user and adjusts inference parameters
Memory-aware model selection — the app checks available RAM at startup and loads E4B or falls back to E2B accordingly

The Architecture in One View

[User opens photo]
        ↓
[Image read from local storage]
        ↓
[Preprocessing: resize, normalize → model input format]
        ↓
[Gemma 4 E4B / E2B — running on device GPU/NPU via LiteRT-LM]
        ↓
[Structured function call → JSON: quality, tone, category, subjects, narrative]
        ↓
[Thinking Mode stream (optional) → TalkBack live region announcement]
        ↓
[Final description displayed + read aloud]
        ↓
[Focus moved automatically to description output — TalkBack navigates to result]

No network call occurs at any step.
No data leaves the device at any step.

The Intersection of Model Choice and Mission

I want to be direct about something that I think is easy to miss in a technical submission.

The choice of E2B/E4B is not a capability compromise. It is an ethical position.

The users PhotoLens is built for are often in the exact situations where cloud AI fails: remote locations, limited data plans, older devices, unstable connectivity. Choosing a server-dependent model would mean building an accessibility tool that is least accessible precisely when accessibility matters most. That is a contradiction I am not willing to ship.

Gemma 4 at the edge scale — with native multimodality, on-device reasoning, and structured function calling — makes it possible to build something that works for these users fully, not partially. Not "when you have signal." Always.

That is what intentional model selection looks like when the use case is not a developer convenience but a real independence requirement for real people.

Built by Susant Swain — independent developer, accessibility engineer, and visually impaired person who took a family trip to a remote area, opened every AI tool on his phone, and watched all of them fail.

Bhubaneswar, Odisha, India · info@susantswain.com

DEV Community