DEV Community: Dimitar Hadzhiradev

Bringing Gemma 4 E2B to the Edge: Building a Privacy-First Dream Analyzer with Flutter & LiteRT

Dimitar Hadzhiradev — Sat, 23 May 2026 17:42:19 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Every AI app says it respects your privacy.

Then it uploads your most personal data to the cloud.

When we started building Remora — a dream journaling and psychological interpretation app — we faced a difficult question:

How do you analyze deeply personal subconscious experiences without sending them to a remote server?

We wanted users to wake up, record a dream, and receive rich AI-powered analysis directly on their phone.

No cloud inference.
No persistent uploads.
No centralized storage of emotional or psychological data.

That requirement immediately ruled out most modern AI architectures.

Then we discovered Gemma 4.

Its compact E2B footprint, multimodal support, and mobile-first optimization made it uniquely suited for true on-device inference.

But integrating cutting-edge local AI into a production Flutter app turned out to be far more challenging than expected.

This is the engineering story behind making it work.

Why Gemma 4 Changed the Architecture

Most mobile AI today still relies on a thin-client model:

Capture user data
Upload to cloud APIs
Run inference remotely
Return results

That approach breaks down completely for sensitive psychological analysis.

Dream journals often contain:

trauma,
fears,
relationships,
emotional states,
deeply personal memories.

We needed:

offline capability,
low latency,
multimodal understanding,
and strict data locality.

Gemma 4 E2B gave us a realistic path toward all four.

Running directly on-device also unlocked:

instant responses,
airplane-mode support,
reduced infrastructure cost,
and dramatically improved user trust.

Challenge 1: The Model Format Wars (GGUF vs LiteRT)

Our first instinct was straightforward:

Download a .gguf quantization from Hugging Face and wire it into Flutter.

That assumption lasted about five minutes.

The moment the engine initialized on Android, the app crashed with:

IllegalArgumentException:
Unsupported model format: .gguf

What We Learned

The open-source ecosystem heavily favors .gguf because of tools like llama.cpp.

But Android hardware acceleration operates in a very different ecosystem.

Google’s mobile AI stack relies on:

MediaPipe,
LiteRT,
LiteRT-LM delegates,
and NPU-optimized tensor layouts.

That means models must be packaged as:

.task
.bin
or .litertlm

—not GGUF.

Once we switched to the official LiteRT package, memory usage dropped significantly and inference stabilized immediately.

FlutterGemma.installModel(
  modelType: ModelType.gemma4,
  fileType: ModelFileType.litertlm,
).fromNetwork(
  'https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm',
).withProgress((progress) {
  print('Downloading 1.5GB Edge Model: ${progress}%');
});

This was our first major realization:

Edge AI is not just “smaller cloud AI.”
It is an entirely different deployment architecture.

Challenge 2: The “Code 13” Audio Crash

One of the most exciting features of Gemma 4 is native multimodal capability.

Our goal was simple:

Users should be able to:

wake up,
tap record,
describe their dream verbally,
and receive private on-device analysis.

We recorded audio and passed it into the local model.

Immediate crash.

We switched encoders:

.m4a
PCM16 WAV
16kHz mono

Crash again.

Failed to start streaming (code: 13)

The Root Cause

After digging through Google’s AI Edge Gallery implementation, we discovered:

Current community LiteRT weights do not yet expose fully fused audio subgraphs
Qualcomm QNN delegates require certain audio operators to run on CPU
Current Flutter bindings don’t yet support backend splitting between CPU and NPU execution

In practice:

text generation worked perfectly,
audio tensor routing did not.

The Solution: A Secure Hybrid Pipeline

Instead of abandoning voice support, we built a privacy-preserving fallback architecture.

If local audio inference fails:

audio is sent to a transient speech-to-text endpoint,
no audio is persisted,
only transcription text is returned,
all psychological interpretation still happens locally via Gemma 4.

That preserved the most sensitive part of the workflow entirely on-device.

Future<DreamAnalysisResult> analyzeAudio(String filePath) async {
  if (_localEngine.isReady) {
    try {
      return await _localEngine.analyzeAudio(filePath);
    } catch (e) {
      print('Code 13 detected. Engaging secure fallback.');
    }
  }

  final bytes = await File(filePath).readAsBytes();

  final response = await dio.post(
    '/dreams/transcribe',
    data: bytes,
  );

  final String text = response.data['transcription'];

  final result = await _localEngine.analyzeDream(text);

  return DreamAnalysisResult(
    title: result.title,
    interpretation: result.interpretation,
    tags: [...result.tags, '#voice_log'],
    transcribedText: text,
  );
}

This ended up becoming one of the most important architectural decisions in the app.

Not because it was perfect —
but because it degraded gracefully while preserving privacy guarantees.

Challenge 3: Emulators Lie

During development we tested inference using the Android emulator.

Everything failed instantly.

Connection closed before full header was received

At first we suspected:

networking,
Flutter isolates,
or broken FFI bindings.

None of those were the problem.

The real issue was architecture mismatch.

LiteRT-LM delegates are optimized specifically for:

arm64-v8a
mobile NPUs
physical AI acceleration hardware

The x86 emulator environment simply could not execute the delegate stack correctly.

Once we moved testing onto a physical Pixel device:

binaries mapped correctly,
NPU acceleration activated,
inference latency dropped dramatically.

That moment changed how we approached mobile AI QA entirely.

Edge AI development without real hardware is basically guesswork.

Looking Forward: Android AI Core & Gemini Nano

Downloading a 1.5GB local model works —
but it is not the ideal long-term UX.

Large bundled models create:

storage pressure,
installation friction,
and slower onboarding.

To future-proof the architecture, we integrated Android AI Core support.

Before downloading Gemma 4 locally, Remora now checks whether:

Gemini Nano,
or another system-level model,
is already available through Android’s native AI layer.

If available:

inference becomes instant,
no model download is required,
and privacy remains intact.

This creates a hybrid architecture where:

OS-native models are preferred,
Gemma 4 acts as the portable fallback,
and all inference still remains local-first.

What Building with Gemma 4 Taught Us

Working with Gemma 4 fundamentally changed how we think about mobile apps.

For years, mobile AI has largely meant:

“Call an API and wait.”

But local multimodal models enable something very different:

applications that function offline,
preserve privacy by default,
reduce infrastructure cost,
and feel dramatically more responsive.

The tooling ecosystem is still early.
The documentation is fragmented.
The hardware constraints are real.

But the direction is obvious.

Edge AI is becoming a first-class application platform.

And Gemma 4 is one of the first models that genuinely makes that future practical for mobile developers.

Final Thoughts

Remora started as an experiment:

Could we build a psychologically meaningful AI experience without compromising user privacy?

Thanks to Gemma 4, LiteRT, and Android’s emerging edge AI ecosystem, the answer is increasingly yes.

We still have challenges ahead:

audio graph support,
smaller quantizations,
memory optimization,
and broader device compatibility.

But for the first time, building truly private multimodal AI apps on smartphones feels achievable.

And that changes everything. What challenges you the most in Edge AI journey?

The Subconscious Powered by Edge AI

Dimitar Hadzhiradev — Sat, 23 May 2026 17:23:28 +0000

RemoraAI: The Subconscious Social Network Powered by Edge AI

What I Built

Dreams are our most private thoughts.

Yet most AI-powered journaling apps require users to upload deeply personal emotions, fears, and subconscious experiences directly to the cloud.

Remora was built to challenge that assumption.

Remora is a privacy-first “Subconscious Social Network” powered by Gemma 4 running directly on-device using LiteRT-LM and Flutter.

The app allows users to:

record dreams via voice,
receive AI-powered psychological interpretation,
detect recurring subconscious patterns over time,
generate surreal dream visuals,
and optionally publish anonymized dreams to a public community feed.

The key innovation is that the sensitive psychological analysis happens entirely on-device.

No raw dream data needs to leave the smartphone.

The Core Problem

Dream journaling has historically remained a private, offline activity because users are understandably uncomfortable uploading vulnerable psychological content to centralized servers.

We wanted to answer a difficult question:

Can modern multimodal AI deliver meaningful emotional analysis while preserving user privacy?

Remora demonstrates that the answer is yes.

Demo

Core Flow

User records a dream using voice input
Gemma 4 processes the narrative locally
The app generates:

a dream title,
emotional interpretation,
thematic tags,
and subconscious motif detection
1. User optionally generates AI dream artwork
2. User may privately store or anonymously publish the dream

Demo Content

Offline “Privacy Mode”

AI-generated dream art

Community feed scrolling

Code

Tech Stack

Flutter
LiteRT-LM
MediaPipe
Flutter FFI
FastAPI
Android AI Core
Gemini Nano
Imagen 4
Vector Embeddings + RAG

Architecture Highlights

Local AI Layer

Gemma 4 E2B via LiteRT-LM
On-device inference
NPU acceleration
Offline-capable “Privacy Mode”

Cloud Layer

Optional dream image generation
Anonymous community feed
Secure transient speech-to-text fallback

Memory Layer

Vector embeddings for recurring dream motifs
Retrieval-Augmented Generation (RAG)
Long-term subconscious pattern analysis

How I Used Gemma 4

We selected the Gemma 4 E2B model because it sits at the ideal intersection of:

mobile performance,
low memory footprint,
multimodal capability,
and meaningful reasoning quality.

Previous local models were either:

too large for mobile deployment,
too slow for real-time inference,
or incapable of nuanced psychological interpretation.

Gemma 4 E2B solved all three.

Using LiteRT-LM, the model runs directly on-device through Android NPUs or Android AI Core (Gemini Nano where available).

This enables:

fully offline dream analysis,
dramatically reduced latency,
improved privacy,
and lower infrastructure cost.

Local Inference Pipeline

FlutterGemma.installModel(
  modelType: ModelType.gemma4,
  fileType: ModelFileType.litertlm,
).fromNetwork(
  'https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm',
);

The Hardest Engineering Problem

One of the biggest challenges was multimodal audio processing.

Although Gemma 4 supports audio understanding conceptually, current LiteRT community weights lack fully fused audio execution graphs for mobile delegates.

Attempting native audio inference produced:

After investigating Google’s AI Edge Gallery implementation, we discovered:

unsupported audio tensor routing,
delegate backend limitations,
and missing Flutter bindings for CPU/NPU graph splitting.

Instead of abandoning voice dreams entirely, we engineered a Secure Hybrid Loop:

Audio is transiently transcribed
No raw data is persisted
Transcription text returns immediately
Gemma 4 performs all psychological interpretation locally

This preserved the most sensitive part of the experience entirely on-device.

Subconscious RAG

Remora is not just a dream diary.

Over time, it becomes a semantic memory system for the user’s subconscious.

Dream entities are vectorized using embeddings:

characters,
emotions,
locations,
recurring symbols,
and narrative structures.

If a user repeatedly dreams about:

“A woman in a red coat”

…the system detects the recurring motif and surfaces psychological pattern insights over months or years.

This transforms dream logging from passive journaling into longitudinal subconscious analysis.

Dream Visualization

After local interpretation is complete, users can optionally generate dream artwork using Imagen 4.

The backend converts the interpreted dream into a surreal cinematic visual prompt and generates high-resolution dream imagery.

This creates a hybrid architecture:

Task	Location
Psychological analysis	On-device
Dream embeddings	On-device
Sensitive interpretation	On-device
Visual generation	Cloud
Community publishing	Optional

Community Layer

By default, every dream remains private.

Users may optionally anonymize and publish dreams to the Remora community feed, creating a surreal stream of humanity’s collective subconscious.

Other users can:

upvote bizarre dreams,
react to recurring themes,
or share dreams with therapists or friends.

This transforms deeply personal subconscious experiences into optional social storytelling.

Why Gemma 4 Matters

Before Gemma 4, building an app like Remora was largely impractical.

The model needed to be:

lightweight enough for smartphones,
capable of emotional nuance,
fast enough for real-time interaction,
and deployable through modern mobile inference stacks.

Gemma 4 E2B made that architecture possible.

It allowed us to move psychological AI away from centralized cloud systems and directly into the user’s pocket.

That shift fundamentally changes what privacy-first AI applications can become.

Future Work

We plan to expand Remora with:

native multimodal audio execution,
local image generation,
lucid dream detection,
and cross-dream narrative mapping.

As edge AI tooling matures, applications like Remora will increasingly blur the line between local software and personal AI companions.

Final Thoughts

Building Remora with Gemma 4 demonstrated something important:

Edge AI is no longer experimental.

For the first time, mobile devices are capable of delivering meaningful multimodal AI experiences while preserving user privacy by default.

That opens the door to an entirely new generation of personal AI applications.