Marco Sbragi

Posted on May 17 • Edited on May 22

GemmaLink: Your Private Eye Assistant

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Most local AI setups are currently a dependency nightmare, forcing users into heavy Python environments, massive CUDA toolkits, or complex Docker configurations. I built GemmaLink to achieve the exact opposite: a "Zero-Cloud", local-first assistant that turns your smartphone into a targeted vision sensor for local VLMs, running entirely on a standard PC with a single-file, plug-and-play binary.

GemmaLink allows you to open a web interface on your smartphone, point it at an object, capture a precise crop via an interactive on-screen viewfinder, and chat about what the camera sees with a local model running on your machine.

Unlike general-purpose tools like Google Lens, which index data on remote servers for commercial classification, GemmaLink is a strictly confidential sandbox. Because it streams data exclusively over your local network, it enables critical use cases where third-party data exposure is unacceptable:

Financial Confidentiality: Point your phone at bank statements or invoices to extract line items or analyze balances safely.
Private Medical Insights: Process the layout of localized medical data or blood test terminology without uploading your health history to the cloud.

Built-in Guardrails

Handling sensitive, real-world data requires architectural responsibility. GemmaLink enforces explicit notifications regarding the system's inherent fallibility, prompting the user to always consult certified professionals when validating critical financial or medical outputs.

Demo

I have recorded a comprehensive video walkthrough showcasing the complete lifecycle: the adaptive dual-mode interface initialization, the network handshake, the high-precision viewport cropping, and the real-time Server-Sent Events (SSE) token streaming.

Watch the full demo on YouTube: GemmaLink Walkthrough & Architecture Demo

Code

The project is fully modular, featuring a decoupled network layout where firewall rules and port-forwarding scripts (.ps1 and .sh) remain external for maximum user transparency and maintenance efficiency.

Source Code & Binary Assets (v1.0.0): GitHub Repository - eye-assistant

How I Used Gemma 4

GemmaLink is deliberately optimized for the Gemma 4 lightweight vision family. Choosing an ultra-lightweight, efficient vision-capable model was a strategic architectural choice for two main reasons:

Low-Latency Edge Performance: The primary objective was to guarantee a fast Time-To-First-Token in constrained local environments (pure CPU or Vulkan fallback) without demanding enterprise-grade hardware.
Contextual Token Efficiency: Blasting full-resolution mobile snapshots kills local inference speed and pollutes the attention matrix. The frontend computes the exact scale ratios ($videoWidth / videoRect.width$) relative to the CSS viewfinder crosshair, dynamically cropping only the targeted pixels. This surgical payload reduction matches the Gemma 4 vision sensor bounds perfectly, resulting in lightning-fast processing loops.

Driving AI without "Vibecoding"

With over 40 years of experience writing software, Go was not part of my traditional stack. I chose it because its high-concurrency model and clean cross-compilation were required for true single-binary deployment.

While I utilized an AI assistant to accelerate the implementation of the Go backend, this was absolutely not "vibecoding". The AI served as a syntax compiler and fast writer, but the technical steering wheel remained firmly in my hands. The deterministic state machine (preview -> ask -> response), the memory management (explicit URL.revokeObjectURL cleanups to prevent mobile memory leaks), and the streaming chunk buffer that prevents incomplete Markdown strings from flickering during SSE delivery were entirely engineered under my tight architectural directives.

Troubleshooting: Windows + AMD Ryzen AI (Strix Halo) Edge-Case

After testing the application on an EVO X2 mini PC equipped with unified memory and Radeon 8060S Graphics, a specific issue was identified during image processing.

If you run the application on Windows using latest-generation AMD APUs (Strix Halo / Ryzen AI Max architecture) and experience an indefinite hang at the "processing image..." log during multimodal upload, the issue stems from a memory allocation lockup within the Vulkan/AMD driver stack.

Solution: Download a CPU-Only release of llama.cpp and place it in the bin directory where you extracted GemmaLink. Thanks to the processing power of Zen 5 and its native AVX512 / AVX512_VNNI instruction sets, vision token inference remains extremely fast while completely bypassing the graphics stack bug.

Swapping to the CPU-only release serves as a reliable workaround for most image processing issues encountered on this hardware configuration, ensuring everyone can test the application. A more streamlined solution for end users will be investigated.

Updated 2026/05/22: To fix the problem you can download the specific release for gfx1151 from the lemonade-sdk/llamacpp-rocm and replace the binaries in the bin folder.

What's Next? (Community-Driven Roadmap)

The core release (v1.0.0) is tagged and stable. I have a backlog of advanced features mapped out, which I will implement if the project gains traction:

Multilingual Smartphone UI: Dynamic localization driven by browser headers.
JSON-Driven Context Chips: Offloading the quick-question preset chips to an external, customizable chips.json for manual user tuning.
Automated Hardware Dispatching: Orchestrating automatic matching of specialized llama.cpp libraries based on real-time instruction set detections directly from the Go launcher.

Top comments (3)

Andy Stewart • May 18

A single Go binary, local network handshakes, and pixel-perfect viewport cropping—turning complex, cross-device VLM pipelines into a deterministic, secure sandbox is exactly how it should be done!

This is true, hardcore systems engineering, keeping privacy and control firmly in your own hands. Forget the cloud AI hype; this kind of edge-to-edge synergy is the real future!

Marco Sbragi • May 18

Thank you Andy, for me this is the only right way.
The privacy is all for me.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.