<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dan Parii</title>
    <description>The latest articles on DEV Community by Dan Parii (@danparii).</description>
    <link>https://dev.to/danparii</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3921682%2Fd82548e0-7594-4185-a06d-df6edde1f42b.png</url>
      <title>DEV Community: Dan Parii</title>
      <link>https://dev.to/danparii</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/danparii"/>
    <language>en</language>
    <item>
      <title>Gemma Guide - Real-Time Spatial Awareness for Blind Users</title>
      <dc:creator>Dan Parii</dc:creator>
      <pubDate>Sat, 23 May 2026 12:58:04 +0000</pubDate>
      <link>https://dev.to/danparii/gemma-guide-real-time-spatial-awareness-for-blind-users-2449</link>
      <guid>https://dev.to/danparii/gemma-guide-real-time-spatial-awareness-for-blind-users-2449</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Build with Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;For a blind user, the important question is not just &lt;em&gt;what&lt;/em&gt; is in front of me, but &lt;em&gt;how far away it is&lt;/em&gt; and &lt;em&gt;how I should move safely&lt;/em&gt;. That gap between scene description and grounded spatial awareness is what Gemma Guide is built to close.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;Gemma Guide combines Gemma 4 with TIPSv2 into a routed multi-agent pipeline. The user speaks a question, the system interprets it visually and acoustically, and returns grounded guidance: not just &lt;em&gt;"there is a chair in front of you"&lt;/em&gt; but &lt;em&gt;"the chair is 1.4 meters ahead, slightly to your left."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The diagram below shows the flow end to end. A &lt;strong&gt;Scout&lt;/strong&gt; agent first decides whether the question needs spatial analysis at all. If it does, a &lt;strong&gt;Mapper&lt;/strong&gt; agent localizes the relevant objects and calls the TIPSv2 tool stack to measure distance and bearing for each one. A &lt;strong&gt;Navigator&lt;/strong&gt; agent then reasons over those grounded measurements and produces the spoken response.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fey9agip2bzzfe7ivd0w8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fey9agip2bzzfe7ivd0w8.png" alt="Architecture Overview" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Spatial Grounding Layer
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://gdm-tipsv2.github.io/" rel="noopener noreferrer"&gt;TIPSv2&lt;/a&gt; (Google DeepMind) provides three capabilities the spatial grounding layer depends on: metric depth, semantic segmentation, and open-vocabulary matching:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Class heads (segmentation + metric depth):&lt;/strong&gt; Dense prediction transformer (DPT) heads produce per-pixel semantic segmentation and metric depth across 150 common object classes. When Gemma localizes a known object, the system intersects that region with the segmentation mask so depth is measured over the right pixels, not a coarse bounding box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open-vocabulary matching:&lt;/strong&gt; TIPSv2 is a vision-language encoder trained so that image patches and text live in the same embedding space. Gemma can pass any class name directly to the encoder and receive a per-patch similarity map in return, turning open-vocabulary understanding into open-vocabulary measurement. This extends grounding well beyond the 150-class limit without any fine-tuning.&lt;/p&gt;

&lt;p&gt;I used the &lt;a href="https://huggingface.co/google/tipsv2-b14" rel="noopener noreferrer"&gt;B/14 variant&lt;/a&gt; (86M vision + 110M text params), the smallest of the four.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Interface
&lt;/h3&gt;

&lt;p&gt;Gemma Guide is designed with a blind-first philosophy at its core: accessibility is not a feature added on top, it is the only way to build. The UI uses a two-zone tap-anywhere layout with no visual-first interaction patterns. Audio soundscapes and TTS guidance bridge the gap during model reasoning, and all architectural complexity is entirely hidden from the user.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/xtAuC4PHdzY"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/pariidanDKE" rel="noopener noreferrer"&gt;
        pariidanDKE
      &lt;/a&gt; / &lt;a href="https://github.com/pariidanDKE/GemmaGuide" rel="noopener noreferrer"&gt;
        GemmaGuide
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Gemma Guide&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;Gemma Guide is a blind-first multimodal navigation assistant that combines Gemma 4 with TIPSv2 to answer grounded questions like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What object is in front of me?&lt;/li&gt;
&lt;li&gt;How far away is it?&lt;/li&gt;
&lt;li&gt;Where is it relative to me?&lt;/li&gt;
&lt;li&gt;How should I move safely?&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Motivation&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;For a blind user, the important question is not just what is in front of me, but how far away it is and how I should move safely. That is the gap between scene description and real navigation assistance. A useful system must do more than describe a scene in natural language; it must produce grounded spatial answers that guide movement in the real world.&lt;/p&gt;
&lt;p&gt;Language models are not reliable depth sensors, but with &lt;strong&gt;Gemma 4&lt;/strong&gt;, they can act as an agent that identifies an object, calls specialized spatial tools, and turns grounded distance estimates into practical guidance. &lt;strong&gt;Gemma Guide&lt;/strong&gt; is built to turn…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/pariidanDKE/GemmaGuide" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  How I Used Gemma 4
&lt;/h2&gt;

&lt;p&gt;I used &lt;strong&gt;Gemma 4 E4B&lt;/strong&gt; as the multimodal orchestrator across a three-agent pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Gemma 4 E4B specifically:&lt;/strong&gt; the interaction is voice-driven, the scene must be interpreted visually, and the model must decide in real time whether a conversational answer is sufficient or whether grounded spatial measurement is required. That demands audio understanding, image understanding, native function calling, and strong reasoning together, and E4B delivers all of this while remaining compact.&lt;/p&gt;

&lt;p&gt;That compactness is important for more than speed. It makes co-deployment alongside TIPSv2 feasible, and it directly advances the case for on-device deployment. A blind user should not need an internet connection to see, and running locally removes a meaningful privacy concern around continuous camera access. Keeping the language model in this size class is what makes that future realistic.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://ai.google.dev/gemma/docs/core/model_card_4" rel="noopener noreferrer"&gt;Gemma 4 model card&lt;/a&gt; also lists two capabilities that turned out to be directly relevant here: native pointing (the model can return spatial coordinates for objects within an image, not just describe them) and interleaved multimodal input (audio and visual content freely mixed in a single prompt). Both are described under image understanding in the model card, leveraging this gave us more reliable responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scout&lt;/strong&gt;: Classifies the request - direct answer or spatial analysis needed - and routes accordingly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mapper&lt;/strong&gt;: Interprets the question, localizes relevant objects, and calls the TIPSv2 spatial tool stack to build a grounded scene state with distance and bearing per object.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6s4zn1ssm3f32f43nsaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6s4zn1ssm3f32f43nsaj.png" alt="Mapper overview" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Navigator&lt;/strong&gt;: Receives the annotated image and structured measurements and generates the final spoken guidance. Separating this from the Mapper matters: reasoning over a clean measured world model is more reliable than reasoning over raw tool outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5qm363npjort7d21fzve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5qm363npjort7d21fzve.png" alt="Mapper Output &amp;amp; Navigator Input - Annotated Scene" width="760" height="1351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Question&lt;/strong&gt; : &lt;em&gt;Can you tell me how to get to the grill in my backyard?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma Guide&lt;/strong&gt; : &lt;em&gt;The grill is 2.342 meters away, straight ahead. There are a few things in the way. You should first step slightly right to clear the table, which is 0.985 meters away about 25 degrees to your left. Then, the chair at 1.345 meters straight ahead will be in your path. Steer gently to your right to avoid it and step forward.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Findings&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distance alone is not enough.&lt;/strong&gt; Early versions reported metric depth but users had no sense of direction. Adding horizontal bearing from the object's position in the frame turned a distance reading into actionable spatial guidance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Whole-scene depth reasoning was too unreliable.&lt;/strong&gt; Asking the model to reason over a full depth map produced ambiguous results. The reliable path was having Gemma localize the object first, then feeding that into the measurement pipeline - leveraging what the model is actually built for: reasoning, localization, and tool calling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Separation of concerns made outputs consistent.&lt;/strong&gt; Combining scene description, tool orchestration, and navigation reasoning in one agent made outputs inconsistent. Splitting into Scout, Mapper, and Navigator fixed this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;vLLM was the right local inference choice for now.&lt;/strong&gt; Ollama lacks audio input entirely, ruling it out for a voice-driven pipeline, and its Gemma 4 tool-calling parser has had numerous bugs where vLLM's proved robust. With quantization, reduced max sequence length, and fewer image patches per call, the full stack - Gemma 4 E4B plus TIPSv2 - fits on a single 16GB GPU.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency is the dominant UX constraint.&lt;/strong&gt; Tool calls run in parallel, but the Mapper and Navigator still bottleneck on model reasoning itself, and that cost grows as conversation history lengthens. Complex scenes can push end-to-end response time past 20 seconds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reliability is the remaining work.&lt;/strong&gt; The navigation instructions are still sometimes off, and fine-tuning the grounding stack on task-specific data is where the remaining work sits.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Toward on-device deployment:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;I explored Google AI Edge Gallery as a path to partial on-device deployment (Gemma locally, TIPS stack remote), but the current blocker is that the image Gemma sees in chat is not forwarded into the skill execution context, which breaks grounded measurement. A standalone mobile app with tighter camera and voice control is the stronger long-term path, and both models are compact enough to make fully offline deployment on edge hardware plausible.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
  </channel>
</rss>
