ujja

Posted on Feb 28

Building a Privacy-First Mobile Speech Assistant Using Google Gemini

#devchallenge #geminireflections #gemini #llm

Built with Google Gemini: Writing Challenge

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

I built a privacy-first mobile speech assistant designed to support people who stutter. The app focuses on fluency analysis, speech planning, roleplay practice, and pacing guidance — while keeping architecture simple and user data controlled.

But this project did not begin with Gemini.

It began with a question:

Can I build a fully offline, LLM-powered mobile speech assistant?

Phase 1: The Offline Mobile LLM Experiment

Before Gemini entered the picture, I spent significant time exploring on-device inference for mobile.

I experimented with:

Quantized GGUF models
llama.cpp bridges in React Native
Native C++ integrations
On-device transcription pipelines
Fully offline speech + reasoning workflows

Technically, I got models running.

But in practice, I encountered serious constraints:

Model weights increased app size dramatically
Memory pressure on mid-tier Android devices
Latency spikes when speech recognition and LLM ran together
Complex EAS builds and JNI debugging
Difficult JS ↔ native boundary stability

Speech applications are highly latency-sensitive. Even a few seconds of delay breaks confidence during fluency practice.

At that point, I asked the community:

Has anyone shipped a production-ready offline LLM mobile app without heavy compromises?

The responses were honest:

Yes, but you compromise heavily
Quantization reduces quality
Great for demos, harder for real UX
Choose two: size, speed, or reasoning quality

Phase 2: Hackathon Build

After those discussions, I participated in a hackathon and built a web-based speech pacing assistant using:

Elm for predictable state management
Elixir + Phoenix Channels for real-time behavior
Local LLM inference
Rule-based fallback logic

That project, Pacemate, focused on:

Real-time speech analysis
AI-powered fluency guidance
Progress tracking

It was intentionally local-first and privacy-conscious.

The reception was strong. Developers appreciated:

The clarity of architecture
The hybrid fallback design
The focus on real user experience instead of AI novelty

That positive response made me revisit the earlier community feedback more seriously.

If offline mobile LLMs introduce friction, and speech apps demand low latency and stability — maybe the right solution was not extreme offline purity.

That is when I began exploring Google Gemini 2.5 Flash.

Phase 3: Integrating Google Gemini

Instead of forcing everything on-device, I redesigned the mobile architecture to be hybrid and lightweight.

The current mobile stack:

React Native 0.81 + Expo SDK 54
Native speech recognition (Apple Speech Framework on iOS, Google Speech Recognizer on Android)
Google Generative AI SDK
Structured prompt architecture
Rule-based offline fallback

There is no mandatory backend.

The runtime flow:

User speaks.
Native speech recognition transcribes locally.
Transcript is sent directly to Gemini using a user-provided API key.
Gemini returns structured JSON feedback.
The app parses and renders categorized results.

API keys are:

User-controlled
Stored locally via AsyncStorage
Never routed through my own server

This keeps the AI boundary extremely thin.

How Gemini Is Used Technically

Gemini is used as a structured reasoning engine, not a free-form chatbot.

Example fluency analysis prompt:

Analyze the following transcript for fluency.

Return strictly formatted JSON:

{

"summary": "",

"strengths": [],

"improvementAreas": [],

"rephraseSuggestions": []

}

Transcript:

"""

{userTranscript}

"""

Implementation safeguards:

Explicit schema constraints in prompt
Deterministic key naming
JSON-only instruction enforcement
Defensive TypeScript parsing
Automatic fallback if parsing fails

Gemini 2.5 Flash was consistent enough with structure that I eliminated the need for a backend validation service.

That dramatically simplified the system.

Roleplay and Context Handling

For conversational practice scenarios:

Conversation history is stored locally.
Each turn is sent with prior context.
Gemini maintains continuity and supportive tone.

Example message structure:

[

{ role: "system", content: "You are supportive and concise." },

{ role: "user", content: "Hi, I’d like to order a coffee." }

]

The challenge was balancing realism with guidance. I constrained prompts to:

Avoid overly verbose responses
Maintain conversational authenticity
Provide subtle, not overwhelming, fluency feedback

Gemini handled tone surprisingly well for this sensitive use case.

Hybrid Architecture: Cloud + Fallback

Rather than choosing cloud or offline exclusively, I implemented both.

When:

Internet is available
API key is configured

→ Gemini powers:

Fluency analysis
Sentence rephrasing
Roleplay interactions

When:

Offline
No API key
Rate limit exceeded

→ Rule-based fallback activates.

Fallback includes:

Basic pacing heuristics
Template-based conversational responses
Simple word substitutions
Encouragement patterns

The app never hard-fails.

Graceful degradation became a core design principle.

Demo

Conceptually, the mobile experience works like this:

User speaks into the device.
Real-time transcription appears.
Within ~2 seconds, structured fluency feedback is displayed.
User can switch to roleplay mode for interactive practice.
Offline mode still provides guidance without AI.

Total app size remains around 50MB because no model weights are bundled.

There is no backend deployment required.

As the app is still under development and currently running in a stealth environment, I’ve recorded a demo video instead.

What I Learned

This journey reshaped how I think about mobile AI systems.

Technical lessons:

Offline LLMs are viable but operationally heavy on mobile.
Quantization reduces size but impacts reasoning quality.
Latency matters more than locality in speech applications.
Prompt engineering can replace backend microservices.
Structured schema prompts significantly improve reliability.

Architectural lessons:

Community insight can accelerate decision-making.
Hackathon validation can clarify real-world constraints.
Hybrid systems are often more resilient than extreme designs.
Engineering maturity is about choosing the right complexity level.

Google Gemini Feedback

What Worked Well

Strong instruction following with structured JSON outputs
Fast response times suitable for speech workflows
Supportive and adaptive tone
Clean SDK integration
No infrastructure management overhead

Gemini 2.5 Flash offered a practical balance between speed and reasoning depth.

Where I Experienced Friction

Occasional JSON formatting drift requiring defensive parsing
Free-tier rate limits needing UX consideration
Internet dependency for advanced features
User onboarding for API key setup

However, compared to embedding large LLMs directly into a mobile app, these trade-offs were manageable.

This project started with an attempt to push fully offline mobile LLMs.

After community feedback, a hackathon build, and real-world validation, it evolved into a privacy-first hybrid architecture powered by Google Gemini.

The biggest lesson was not about compression or inference tricks.

It was about architectural judgment.

Sometimes the most advanced system is the one with the fewest moving parts.

Top comments (14)

Benjamin Nguyen • Feb 28

really nice! I really enjoy to read your project. Have you try gemini 3 yet?

ujja • Mar 1

Thanks Benjamin. I am still using 2.5 flash. Yet to try 3.

Benjamin Nguyen • Mar 1

Ok!

ujja • Mar 1

Yup, yet to try. Will try soon though. I am trying with different gemini alternatives tbh.

Benjamin Nguyen • Mar 1

Nice! I find that gemini 2.5 flash a bit better than gemini 3 flash because Gemini 2.5 flash has more tokens. You will not running out of them.

ujja • Mar 2

Yeah, heard this complaint from many.

Benjamin Nguyen • Mar 2 • Edited

Oh wow! Someone mentions to me about the problem with Gemini 3 flash in the comment section in my latest project last week. I run out of token with Gemini 3. I had to wait about a month so I can use Gemini 3 flash again.

ujja • Mar 2

Yeah, that's definitely a downside. I've heard similar complaint from Claude users too. Especially those who aren't on a enterprise plan.

Benjamin Nguyen • Mar 2

Oh wow! I did not know that about Claude also.

ujja • Mar 2

Yes, Claude can get expensive pretty quickly 😄

Benjamin Nguyen • Mar 3

ah!

Dejan • Mar 2

I sincerely congratulate you on your success. However, I believe that to be 100% secure, you need to eliminate third-party subscriptions, such as APIs, to utilize the privacy policy. This means using your own models. These models can be made lightweight using TRANSFORMER, and the parts that avoid the lengthy explanations you mention can be eliminated through custom learning. I believe this problem can be solved by clearly defining roles. Anyway, thank you for sharing your experience, and I wish you continued success.

ujja • Mar 2

Yes, definitely. And that is the goal. I just want the users to give an option. I am thinking of keeping this configurable where users are open to use whatever they like.

Matsync • Mar 4

Thanks!

View full discussion (14 comments)