<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Susant Swain</title>
    <description>The latest articles on DEV Community by Susant Swain (@susantswain).</description>
    <link>https://dev.to/susantswain</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1595837%2Fe3447b66-5f4d-403e-a3d7-2b00c521483d.png</url>
      <title>DEV Community: Susant Swain</title>
      <link>https://dev.to/susantswain</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/susantswain"/>
    <language>en</language>
    <item>
      <title>PhotoLens — A Fully Offline, On-Device Photo Gallery That Gives Blind and Low-Vision Users Independent Access to Their Own Memories</title>
      <dc:creator>Susant Swain</dc:creator>
      <pubDate>Fri, 15 May 2026 11:49:41 +0000</pubDate>
      <link>https://dev.to/susantswain/photolens-a-fully-offline-on-device-photo-gallery-that-gives-blind-and-low-vision-users-4o2k</link>
      <guid>https://dev.to/susantswain/photolens-a-fully-offline-on-device-photo-gallery-that-gives-blind-and-low-vision-users-4o2k</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://storage.susantswain.com/photos/photolens.png" rel="noopener noreferrer"&gt;photolens app icon&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let me start with the moment that made this app inevitable.&lt;/p&gt;

&lt;p&gt;I am visually impaired. Last year, I went on a family trip to a remote, beautiful place — the kind of landscape people travel thousands of kilometres to stand in. My family was taking photographs, comparing shots, reliving moments as they happened. I had my phone. I pointed it in the direction of the excitement and pressed the button, not knowing what I was capturing.&lt;/p&gt;

&lt;p&gt;Later, I opened every AI accessibility tool I had on my phone. Every single one failed the same way: they needed the internet, and there was no internet. No bars. No WiFi. Nothing. I put the phone in my pocket and listened to the birds and the wind — the only part of that scenery I could actually access.&lt;/p&gt;

&lt;p&gt;I am a software engineer. The question that formed was not &lt;em&gt;why does this keep happening&lt;/em&gt; but &lt;em&gt;what would it actually take to fix it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PhotoLens is the answer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PhotoLens is a fully offline, privacy-first photo gallery for Android built specifically for blind and low-vision users. It uses Gemma 4 running entirely on-device via the LiteRT-LM inference framework to generate rich, natural language descriptions of photographs — with zero internet requirement, zero cloud upload, and zero compromise on privacy.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem It Solves
&lt;/h3&gt;

&lt;p&gt;Most AI accessibility tools for image description share a critical, disqualifying flaw: they depend on cloud connectivity. When a user who is blind or has low vision is in a remote area, on a plane, in a location with poor signal, or simply using a limited data plan, every one of these tools silently becomes useless. The user is back to square one — dependent on asking a sighted person for help, or simply going without.&lt;/p&gt;

&lt;p&gt;This is not a minor inconvenience. For users who depend on these tools as part of their daily independence, connectivity-gating is a structural accessibility failure. And it is entirely avoidable.&lt;/p&gt;

&lt;p&gt;PhotoLens removes the dependency entirely. The AI is on the device. It always works. Wherever you are.&lt;/p&gt;

&lt;h3&gt;
  
  
  What It Does
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-device photo description&lt;/strong&gt; — Tap any photo to get a natural language description of its subjects, composition, mood, and context, generated locally in seconds with no network connection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-generation mode&lt;/strong&gt; — Enable in settings to have descriptions generated automatically as you browse your gallery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking Mode&lt;/strong&gt; — Expose the model's chain-of-thought reasoning before the final description is delivered, giving users transparency into &lt;em&gt;how&lt;/em&gt; a result was reached.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic structured analysis&lt;/strong&gt; — Using Gemma 4's function calling capability, the app extracts technical image quality, emotional tone, and categorical tags in a single inference pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regenerate&lt;/strong&gt; — If a description misses something, request a second pass with a single tap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full TalkBack compatibility&lt;/strong&gt; — Every screen, every element, every status update is built for screen reader navigation first. Not as an afterthought. As the primary use case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WCAG 2.1 AA design&lt;/strong&gt; — High contrast, generous touch targets, linear navigation, semantic labeling, automatic focus management to description output.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why This Matters Beyond the App
&lt;/h3&gt;

&lt;p&gt;PhotoLens demonstrates something the accessibility community needs demonstrated at scale: &lt;strong&gt;privacy and independence are not in tension&lt;/strong&gt;. Users who are blind or have low vision should not have to choose between accessing their own photos and surrendering those photos to a cloud server they cannot audit. On-device AI collapses that false choice entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;🔗 &lt;em&gt;The PhotoLens source repository is available at the link below. also, a direct APK download link is provided for easy access&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/docwiser/photolens" rel="noopener noreferrer"&gt;→ GitHub: docwiser/photolens&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repository includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full Jetpack Compose Android source (Kotlin)&lt;/li&gt;
&lt;li&gt;LiteRT-LM integration layer and model loading pipeline&lt;/li&gt;
&lt;li&gt;Gemma 4 inference wrapper with Thinking Mode and structured function-call support&lt;/li&gt;
&lt;li&gt;TalkBack accessibility implementation (semantic labels, focus management, live region announcements)&lt;/li&gt;
&lt;li&gt;On-device gallery provider and image preprocessing pipeline&lt;/li&gt;
&lt;li&gt;Settings system (auto-generation toggle, Thinking Mode toggle, description verbosity)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tech stack:&lt;/strong&gt; Kotlin · Jetpack Compose · LiteRT-LM (MediaPipe LLM Inference) · Gemma 4 E2B / E4B · Coroutines + Flow&lt;br&gt;
&lt;a href="https://github.com/docwiser/photolens/releases/download/v1.2.0/photolens_v1.2.0.apk" rel="noopener noreferrer"&gt;Download APK (41.2MB)&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;sha256:c5bc5748252c7ef073d229e71e4a58328330b98e950b09076fe58af827603dd7&lt;/p&gt;

&lt;p&gt;❤️ Hosted on github release&lt;/p&gt;




&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model Selection: E2B and E4B — and Why Nothing Else Would Do
&lt;/h3&gt;

&lt;p&gt;I chose the &lt;strong&gt;Gemma 4 E2B and E4B&lt;/strong&gt; variants. Not the 26B MoE. Not the 31B Dense. The two smallest members of the family. And I chose them for reasons that are inseparable from the entire purpose of the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This app exists specifically for the scenario where there is no internet.&lt;/strong&gt; A cloud-callable model — however powerful — is architecturally incompatible with the problem being solved. The 26B and 31B models require server infrastructure. They solve a different problem for a different deployment context. For PhotoLens, they are the wrong tool regardless of their capability ceiling.&lt;/p&gt;

&lt;p&gt;The E2B and E4B variants are designed precisely for the deployment scenario that matters here: &lt;strong&gt;on-device, on Android hardware, with no external dependency&lt;/strong&gt;. Google describes them as "built for ultra-mobile, edge, and browser deployment." That is exactly where blind and low-vision users need accessible AI to live — not in a data center they cannot reach when they are somewhere beautiful and without signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why E4B as the Primary Target
&lt;/h3&gt;

&lt;p&gt;Within the edge variants, I target &lt;strong&gt;E4B&lt;/strong&gt; as the primary inference model for most Android devices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At approximately 4.5B effective parameters, it produces noticeably richer, more contextually aware descriptions than E2B — capturing mood, relational context, and image atmosphere, not just labeling objects.&lt;/li&gt;
&lt;li&gt;It fits within the memory envelope of mid-range to flagship Android devices (4–6 GB RAM) while leaving room for the operating system and TalkBack to run without competition.&lt;/li&gt;
&lt;li&gt;Its multimodal capability is natively integrated, not bolted on. This is critical. Image understanding is not an API call to a separate vision encoder — it is part of the model's unified forward pass, which means the descriptions reflect holistic reasoning about the image, not a concatenation of extracted features.&lt;/li&gt;
&lt;li&gt;On devices with an NPU (most flagships from 2023 onward), E4B generates descriptions in 3–7 seconds — fast enough for practical real-world use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;E2B&lt;/strong&gt; is kept as a fallback for lower-spec devices (less than 4 GB RAM), where E4B's memory footprint would cause system pressure. The user experience degrades gracefully: slightly shorter descriptions, the same privacy guarantee, the same offline operation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gemma 4 Specifically Unlocked
&lt;/h3&gt;

&lt;p&gt;Three capabilities in Gemma 4 are load-bearing for what PhotoLens does. None of them existed in Gemma 3 at the edge scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Native Multimodality&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Previous Gemma generations at the edge scale were text-only. Multimodal capability meant cloud deployment. Gemma 4 E2B and E4B are natively multimodal — images and text are first-class inputs, processed together in a single unified forward pass.&lt;/p&gt;

&lt;p&gt;This is not a minor architectural detail. It is the entire reason PhotoLens can exist as an on-device app. Without native multimodality at the E2B/E4B scale, there is no path to offline photo description on a phone. You are back to sending images to a server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Thinking Mode / Chain-of-Thought Reasoning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 can expose its reasoning chain before producing a final answer. In PhotoLens, this becomes an explicit accessibility feature called &lt;strong&gt;Thinking Mode&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a user who is blind asks for a description of a photograph, they are placing a significant degree of trust in the model's output. They often cannot independently verify the result. Thinking Mode gives them something cloud-based tools typically cannot: &lt;strong&gt;a transparent view of how the description was reached&lt;/strong&gt;. They can hear the model observe: &lt;em&gt;"I can see several people in an outdoor setting, the lighting appears to be late afternoon, there is foliage in the background suggesting a garden or park..."&lt;/em&gt; — and then make their own judgment about whether the final description reflects what they know about the photo.&lt;/p&gt;

&lt;p&gt;This turns a limitation (AI can be wrong) into a feature (you can audit the reasoning). That is meaningful, especially for an accessibility tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Structured Function Calling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 supports function calling at the edge model scale. PhotoLens uses this to run what I call &lt;strong&gt;agentic structured analysis&lt;/strong&gt;: in a single inference pass, the model is prompted via structured function call to return technical quality metrics, emotional tone, scene category, subject identification, and a narrative description — all as a typed JSON structure.&lt;/p&gt;

&lt;p&gt;This means the app can present different views of the same analysis (a brief summary for quick browsing, a detailed description for a photo the user wants to remember) without running multiple inference passes. It also means the output is predictable and parseable — no need to post-process natural language to extract structured information.&lt;/p&gt;

&lt;h3&gt;
  
  
  The LiteRT-LM Integration
&lt;/h3&gt;

&lt;p&gt;The inference pipeline is built on &lt;strong&gt;LiteRT-LM&lt;/strong&gt; (formerly MediaPipe LLM Inference), Google's purpose-built runtime for on-device LLM execution on Android. LiteRT-LM handles GPU and NPU scheduling, memory management, and quantized model loading — all transparently to the application layer.&lt;/p&gt;

&lt;p&gt;The integration is not a thin wrapper. The app manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asynchronous model loading at startup with a progressive readiness indicator (TalkBack-announced)&lt;/li&gt;
&lt;li&gt;Streaming token generation — description text streams in as it is generated, not all at once after a wait&lt;/li&gt;
&lt;li&gt;Graceful thermal throttling detection — if the device overheats and the NPU clocks down, the app warns the user and adjusts inference parameters&lt;/li&gt;
&lt;li&gt;Memory-aware model selection — the app checks available RAM at startup and loads E4B or falls back to E2B accordingly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Architecture in One View
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[User opens photo]
        ↓
[Image read from local storage]
        ↓
[Preprocessing: resize, normalize → model input format]
        ↓
[Gemma 4 E4B / E2B — running on device GPU/NPU via LiteRT-LM]
        ↓
[Structured function call → JSON: quality, tone, category, subjects, narrative]
        ↓
[Thinking Mode stream (optional) → TalkBack live region announcement]
        ↓
[Final description displayed + read aloud]
        ↓
[Focus moved automatically to description output — TalkBack navigates to result]

No network call occurs at any step.
No data leaves the device at any step.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Intersection of Model Choice and Mission
&lt;/h3&gt;

&lt;p&gt;I want to be direct about something that I think is easy to miss in a technical submission.&lt;/p&gt;

&lt;p&gt;The choice of E2B/E4B is not a capability compromise. It is an ethical position.&lt;/p&gt;

&lt;p&gt;The users PhotoLens is built for are often in the exact situations where cloud AI fails: remote locations, limited data plans, older devices, unstable connectivity. Choosing a server-dependent model would mean building an accessibility tool that is least accessible precisely when accessibility matters most. That is a contradiction I am not willing to ship.&lt;/p&gt;

&lt;p&gt;Gemma 4 at the edge scale — with native multimodality, on-device reasoning, and structured function calling — makes it possible to build something that works for these users fully, not partially. Not "when you have signal." Always.&lt;/p&gt;

&lt;p&gt;That is what intentional model selection looks like when the use case is not a developer convenience but a real independence requirement for real people.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by Susant Swain — independent developer, accessibility engineer, and visually impaired person who took a family trip to a remote area, opened every AI tool on his phone, and watched all of them fail.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Bhubaneswar, Odisha, India · &lt;a href="mailto:info@susantswain.com"&gt;info@susantswain.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemmachallenge</category>
      <category>devchallenge</category>
      <category>a11y</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
