Prema Ananda

Posted on May 21 • Edited on May 24

NeuralPocket: Private On-Device AI with Gemma 4 — Android & Web

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

NeuralPocket — a private multimodal AI assistant that runs entirely on your device. Available as both an Android app and a web app. No cloud, no subscription, no data leaving your hands.

Honest About My Motivation

I've participated in Google hackathons several times. Each time I built something real, put in the work — and each time walked away with just a participation badge 😄 This time I want to actually place, though I know there are plenty of strong projects out there!

So NeuralPocket is not a demo and not a proof-of-concept. It's a full-featured app with real architecture that solves a real problem.

The problem: modern AI assistants are brilliant — until you lose Wi-Fi. On a plane, in the mountains, roaming abroad, they become useless icons. And every message you type, every photo you send, flies off to someone else's servers.

Google gave me an extra push: the AI Edge Gallery app simply refused to install on my Android 9. Even though the phone has a 64-bit OS — which matters, since LiteRT-LM only runs on 64-bit. Instead of giving up, I figured it out myself. That became the starting point for NeuralPocket.

I wanted an assistant that:

works fully offline — always, everywhere
never sends your data anywhere
understands text, photos, and audio — in one chat
runs on both Android and in the browser

What NeuralPocket Can Do

📷 Photo analysis — snap a menu in Japan → translation and context; photograph a broken part → repair advice; photograph a document → ask questions about it
🎤 Voice input — record up to 30 seconds, converted to WAV, processed on-device
💬 Multiple independent chats with different system prompts — "Translator", "Tech Assistant", "Personal Journal"
⚙️ Configurable context memory — 0–5 conversation pairs to balance coherence and context window
🎨 Markdown rendering — model responses display with full formatting: code, lists, emphasis

Demo

🎬 Android Demo Video

🌐 Web Version (live)

Code

Both projects are fully open source:

🤖 Android (Kotlin + LiteRT-LM) → github.com/premananda108/NeuralPocket · download APK
🌐 Web (React 19 + TypeScript + WebGPU) → github.com/premananda108/NeuralPocketWeb

How I Used Gemma 4

I chose Gemma 4 not just because it can run locally, but because the small Gemma 4 models are genuinely multimodal edge models. Across the family, Gemma 4 supports text and image input, and the E2B / E4B variants add native audio input. That matters for NeuralPocket, because the app is not a text chatbot with a camera button bolted on later — it is designed as one private on-device assistant that can handle typed prompts, photos, and short voice recordings inside the same local workflow, without cloud handoffs.

For the Android app, I centered the experience around Gemma 4 E2B IT. Google positions E2B and E4B as the Gemma 4 variants built for phones and edge devices, while the larger 26B Mixture-of-Experts and 31B Dense models are positioned for consumer GPUs and workstations rather than mainstream mobile hardware. In practice, that makes E2B the best “real phone” default for NeuralPocket: small enough to be deployable on-device, but still capable enough to power useful multimodal interactions.

I used Gemma 4 E2B as the primary mobile model because it hits the best balance between capability, storage footprint, and device reach. For stronger devices, I also expose Gemma 4 E4B as an upgrade path. The key point is that I did not pick Gemma 4 only for text generation — I picked it because the smaller Gemma 4 models are where on-device multimodality actually becomes practical: reading a photo, understanding a screenshot or document, and handling short spoken input without sending any of it to a server. That product direction matches exactly what I wanted NeuralPocket to be.

On Android, this multimodal flow runs through LiteRT-LM, with hardware acceleration where available and CPU fallback otherwise. LiteRT-LM supports CPU, GPU, and NPU backends on Android, and Google’s own edge stack is explicitly designed for vision- and audio-capable LLMs. That makes Gemma 4 a strong architectural fit for a private assistant that should remain useful when the network disappears.

For the web version, I also used Gemma 4 through MediaPipe Tasks GenAI, where inference runs entirely inside the browser via WebGPU rather than on a remote server. That matters because the web app keeps the same core product idea as the Android version: private, local AI that runs on the user’s own hardware instead of sending prompts to a backend. In NeuralPocket, generation runs inside a Web Worker so the interface stays responsive during streaming, while downloaded models are cached locally to make repeated launches much more practical. In practice, WebGPU is what makes the browser version viable at all — it turns the web app from a simple demo into a real on-device experience powered by the user’s GPU.

I also looked at the larger Gemma 4 variants — especially 26B MoE and 31B Dense — as a possible path for tablets. But I do not want to overclaim what I have not personally verified. Officially, Google positions those models for consumer GPUs and workstations, and the published base inference memory requirements are far beyond typical mobile budgets: about 15.6 GB for 26B A4B and 17.4 GB for 31B at 4-bit quantization, before counting runtime overhead and KV cache. So for now, I treat them as desktop-class options, not realistic defaults for Android phones or tablets.

In other words, Gemma 4 was the right choice for NeuralPocket not because it is merely “small enough”, but because it gives me the combination I actually need: private local inference, multimodal input, edge-friendly latency, and one model family that scales from practical phone deployment to more capable hardware tiers.

Which devices can realistically run it?

In practice, the safest target for the Android version is a 64-bit Android phone or tablet with enough free storage and hardware acceleration support, especially for GPU-backed inference. Google’s LiteRT-LM stack supports CPU, GPU, and NPU on Android, and Google’s own AI Edge Gallery app currently targets Android 12+. In my own case, NeuralPocket also runs on compatible Android 9+ arm64 hardware, but actual performance depends heavily on RAM, storage, and available acceleration on each device.

For larger-screen Android devices, there is clearly growing momentum: Google’s current Gemini Nano / ML Kit GenAI support list already includes foldables and tablet-class hardware such as Pixel 9 Pro Fold / 10 Pro Fold, Galaxy Z Fold7, Lenovo Idea Tab Pro Gen 2, Lenovo Legion Tab Gen 5, and Xiaomi Pad Mini. I have not personally verified NeuralPocket on those devices yet, so I describe tablets as a promising expansion path, not a tested guarantee.

Architecture: Two Platforms, One Model

┌─────────────────────────────────────────────────┐
│                  NeuralPocket                   │
├──────────────────────┬──────────────────────────┤
│     Android App      │        Web App           │
│       Kotlin         │  React 19 + TypeScript   │
├──────────────────────┼──────────────────────────┤
│   LiteRT-LM SDK      │  MediaPipe Tasks GenAI   │
│   (native runtime)   │  Web Worker + WebGPU     │
├──────────────────────┴──────────────────────────┤
│              Gemma 4 E2B IT / E4B IT            │
│            (running locally on device)          │
└─────────────────────────────────────────────────┘

Android: LiteRT-LM

Stack: Kotlin + Google AI Edge LiteRT-LM + CameraX + MVVM

The engine automatically selects the best available backend — GPU via Vulkan or OpenCL, falling back to CPU via XNNPack. Concurrent inference calls are serialized through a Mutex to prevent race conditions.

Key architectural decisions:

A single StateFlow<ChatUiState> as the source of truth — the UI only observes, never mutates directly
Chat history is written atomically via a temp file — no data loss on crash
The vision encoder loads only when an image is present — saves RAM
Preflight check on first launch: RAM, ABI, free storage — the app warns if the device doesn't meet the minimum requirements

Performance:

GPU (Vulkan/OpenCL): ~15–30 tokens/sec
CPU-only (XNNPack): ~5–10 tokens/sec
Requirements: Android 9+, arm64, 4+ GB RAM

All three screenshots were taken in airplane mode — no network, everything running locally:

Web: WebGPU Right in the Browser

Stack: React 19 + TypeScript + Vite + Tailwind CSS v4 + MediaPipe Tasks GenAI

All inference runs inside a Web Worker — generation never blocks the UI, keeping the interface responsive during streaming. Models are cached in OPFS (Origin Private File System): first launch downloads ~2.6 GB, every subsequent launch starts instantly without a network connection.

Three model presets are supported: Gemma 4 E2B, Gemma 4 E4B, and Gemma 3 Multimodal. You can also provide a custom model URL.

The web app is built as a PWA (Progressive Web App) — you can install it on your computer as a standalone app with one click from the browser, just like YouTube or other web services. Once installed, it appears in your app menu and opens in its own window without an address bar.

Web version in action (all computation happens locally in the browser via WebGPU):

Honest caveat about offline: after the first launch the app works without a network. But it's not fully autonomous out of the box: the MediaPipe runtime loads from jsDelivr, and fonts load from Google Fonts. For full offline you'd need to self-host those dependencies.

Honest caveat about multimodal in the web: at the time of development I couldn't find web-optimized multimodal models for Gemma 4 — available versions only support text. However, I found a fully multimodal model from the previous generation — gemma-3n-E2B-it-int4-Web.litertlm — which supports displaying text, images and audio directly in the browser. That became the third preset in the web version.

A note on how fast things move. While building NeuralPocket, Google released Gemini 3.5 Flash — and first impressions suggest it's a notable step up from 3.1. It handles complex multi-step tasks confidently: for example, it wrote a full test suite for the web version of NeuralPocket on the first try, something that used to take several iterations. It's remarkable how fast this space evolves — the world changes while you're still writing the article.

At this pace, in a year you might just need to download the latest Gemma and ask it to build the whole app itself. Probably. Maybe. 😄

Privacy as Architecture, Not Marketing

NeuralPocket sends nothing anywhere — not messages, not photos, not chat history, not analytics. This isn't a setting you toggle. It's a consequence of the architecture: there's no server that could receive anything. Works in airplane mode. No account, no subscription.

Honest caveat about stopping generation: Neither LiteRT-LM nor MediaPipe Tasks GenAI currently provides a way to cleanly interrupt an in-progress generation. In NeuralPocket, pressing Stop hides new tokens from the UI, but inference keeps running in the background until the model finishes on its own. On Android I tried force-restarting the engine on cancel — the app crashed. Hopefully future versions of both libraries will add proper cancellation support.

Summary: Android vs Web

Two apps, one idea — but different trade-offs:

	🤖 Android	🌐 Web
Installation	APK (~36 MB)	None — just open in browser
Install as app	✅ native	✅ PWA
Model	Gemma 4 E2B / E4B	Gemma 4 E2B / E4B
Text chat	✅	✅
Photo input	✅	⚠️ Gemma 3n only
Audio input	✅	⚠️ Gemma 3n only
Offline	✅ after downloading models	⚠️ after launch and downloading models
Performance	~15–30 tok/s (GPU)	depends on browser WebGPU
Requirements	Android 9+, arm64	Chrome / Edge with WebGPU
Multiple chats	✅	✅
Custom model	❌	✅ by URL