DEV Community

Cover image for Building a Privacy-First Mobile Speech Assistant Using Google Gemini
ujja
ujja

Posted on

Building a Privacy-First Mobile Speech Assistant Using Google Gemini

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

I built a privacy-first mobile speech assistant designed to support people who stutter. The app focuses on fluency analysis, speech planning, roleplay practice, and pacing guidance — while keeping architecture simple and user data controlled.

But this project did not begin with Gemini.

It began with a question:

Can I build a fully offline, LLM-powered mobile speech assistant?

Phase 1: The Offline Mobile LLM Experiment

Before Gemini entered the picture, I spent significant time exploring on-device inference for mobile.

I experimented with:

  • Quantized GGUF models

  • llama.cpp bridges in React Native

  • Native C++ integrations

  • On-device transcription pipelines

  • Fully offline speech + reasoning workflows

Technically, I got models running.

But in practice, I encountered serious constraints:

  • Model weights increased app size dramatically

  • Memory pressure on mid-tier Android devices

  • Latency spikes when speech recognition and LLM ran together

  • Complex EAS builds and JNI debugging

  • Difficult JS ↔ native boundary stability

Speech applications are highly latency-sensitive. Even a few seconds of delay breaks confidence during fluency practice.

At that point, I asked the community:

Has anyone shipped a production-ready offline LLM mobile app without heavy compromises?

The responses were honest:

  • Yes, but you compromise heavily

  • Quantization reduces quality

  • Great for demos, harder for real UX

  • Choose two: size, speed, or reasoning quality

Phase 2: Hackathon Build

After those discussions, I participated in a hackathon and built a web-based speech pacing assistant using:

  • Elm for predictable state management

  • Elixir + Phoenix Channels for real-time behavior

  • Local LLM inference

  • Rule-based fallback logic

That project, Pacemate, focused on:

  • Real-time speech analysis

  • AI-powered fluency guidance

  • Progress tracking

It was intentionally local-first and privacy-conscious.

The reception was strong. Developers appreciated:

  • The clarity of architecture

  • The hybrid fallback design

  • The focus on real user experience instead of AI novelty

That positive response made me revisit the earlier community feedback more seriously.

If offline mobile LLMs introduce friction, and speech apps demand low latency and stability — maybe the right solution was not extreme offline purity.

That is when I began exploring Google Gemini 2.5 Flash.

Phase 3: Integrating Google Gemini

Instead of forcing everything on-device, I redesigned the mobile architecture to be hybrid and lightweight.

The current mobile stack:

  • React Native 0.81 + Expo SDK 54

  • Native speech recognition (Apple Speech Framework on iOS, Google Speech Recognizer on Android)

  • Google Generative AI SDK

  • Structured prompt architecture

  • Rule-based offline fallback

There is no mandatory backend.

The runtime flow:

  1. User speaks.

  2. Native speech recognition transcribes locally.

  3. Transcript is sent directly to Gemini using a user-provided API key.

  4. Gemini returns structured JSON feedback.

  5. The app parses and renders categorized results.

API keys are:

  • User-controlled

  • Stored locally via AsyncStorage

  • Never routed through my own server

This keeps the AI boundary extremely thin.

How Gemini Is Used Technically

Gemini is used as a structured reasoning engine, not a free-form chatbot.

Example fluency analysis prompt:

Analyze the following transcript for fluency.

Return strictly formatted JSON:

{

"summary": "",

"strengths": [],

"improvementAreas": [],

"rephraseSuggestions": []

}

Transcript:

"""

{userTranscript}

"""

Implementation safeguards:

  • Explicit schema constraints in prompt

  • Deterministic key naming

  • JSON-only instruction enforcement

  • Defensive TypeScript parsing

  • Automatic fallback if parsing fails

Gemini 2.5 Flash was consistent enough with structure that I eliminated the need for a backend validation service.

That dramatically simplified the system.

Roleplay and Context Handling

For conversational practice scenarios:

  • Conversation history is stored locally.

  • Each turn is sent with prior context.

  • Gemini maintains continuity and supportive tone.

Example message structure:

[

{ role: "system", content: "You are supportive and concise." },

{ role: "user", content: "Hi, I’d like to order a coffee." }

]

The challenge was balancing realism with guidance. I constrained prompts to:

  • Avoid overly verbose responses

  • Maintain conversational authenticity

  • Provide subtle, not overwhelming, fluency feedback

Gemini handled tone surprisingly well for this sensitive use case.

Hybrid Architecture: Cloud + Fallback

Rather than choosing cloud or offline exclusively, I implemented both.

When:

  • Internet is available

  • API key is configured

→ Gemini powers:

  • Fluency analysis

  • Sentence rephrasing

  • Roleplay interactions

When:

  • Offline

  • No API key

  • Rate limit exceeded

→ Rule-based fallback activates.

Fallback includes:

  • Basic pacing heuristics

  • Template-based conversational responses

  • Simple word substitutions

  • Encouragement patterns

The app never hard-fails.

Graceful degradation became a core design principle.

Demo

Conceptually, the mobile experience works like this:

  • User speaks into the device.

  • Real-time transcription appears.

  • Within ~2 seconds, structured fluency feedback is displayed.

  • User can switch to roleplay mode for interactive practice.

  • Offline mode still provides guidance without AI.

Total app size remains around 50MB because no model weights are bundled.

There is no backend deployment required.

As the app is still under development and currently running in a stealth environment, I’ve recorded a demo video instead.

What I Learned

This journey reshaped how I think about mobile AI systems.

Technical lessons:

  • Offline LLMs are viable but operationally heavy on mobile.

  • Quantization reduces size but impacts reasoning quality.

  • Latency matters more than locality in speech applications.

  • Prompt engineering can replace backend microservices.

  • Structured schema prompts significantly improve reliability.

Architectural lessons:

  • Community insight can accelerate decision-making.

  • Hackathon validation can clarify real-world constraints.

  • Hybrid systems are often more resilient than extreme designs.

  • Engineering maturity is about choosing the right complexity level.

Google Gemini Feedback

What Worked Well

  • Strong instruction following with structured JSON outputs

  • Fast response times suitable for speech workflows

  • Supportive and adaptive tone

  • Clean SDK integration

  • No infrastructure management overhead

Gemini 2.5 Flash offered a practical balance between speed and reasoning depth.

Where I Experienced Friction

  • Occasional JSON formatting drift requiring defensive parsing

  • Free-tier rate limits needing UX consideration

  • Internet dependency for advanced features

  • User onboarding for API key setup

However, compared to embedding large LLMs directly into a mobile app, these trade-offs were manageable.

This project started with an attempt to push fully offline mobile LLMs.

After community feedback, a hackathon build, and real-world validation, it evolved into a privacy-first hybrid architecture powered by Google Gemini.

The biggest lesson was not about compression or inference tricks.

It was about architectural judgment.

Sometimes the most advanced system is the one with the fewest moving parts.

Top comments (0)