This is a submission for the Built with Google Gemini: Writing Challenge
What I Built with Google Gemini
I built a privacy-first mobile speech assistant designed to support people who stutter. The app focuses on fluency analysis, speech planning, roleplay practice, and pacing guidance — while keeping architecture simple and user data controlled.
But this project did not begin with Gemini.
It began with a question:
Can I build a fully offline, LLM-powered mobile speech assistant?
Phase 1: The Offline Mobile LLM Experiment
Before Gemini entered the picture, I spent significant time exploring on-device inference for mobile.
I experimented with:
Quantized GGUF models
llama.cpp bridges in React Native
Native C++ integrations
On-device transcription pipelines
Fully offline speech + reasoning workflows
Technically, I got models running.
But in practice, I encountered serious constraints:
Model weights increased app size dramatically
Memory pressure on mid-tier Android devices
Latency spikes when speech recognition and LLM ran together
Complex EAS builds and JNI debugging
Difficult JS ↔ native boundary stability
Speech applications are highly latency-sensitive. Even a few seconds of delay breaks confidence during fluency practice.
At that point, I asked the community:
Has anyone shipped a production-ready offline LLM mobile app without heavy compromises?
The responses were honest:
Yes, but you compromise heavily
Quantization reduces quality
Great for demos, harder for real UX
Choose two: size, speed, or reasoning quality
Phase 2: Hackathon Build
After those discussions, I participated in a hackathon and built a web-based speech pacing assistant using:
Elm for predictable state management
Elixir + Phoenix Channels for real-time behavior
Local LLM inference
Rule-based fallback logic
That project, Pacemate, focused on:
Real-time speech analysis
AI-powered fluency guidance
Progress tracking
It was intentionally local-first and privacy-conscious.
The reception was strong. Developers appreciated:
The clarity of architecture
The hybrid fallback design
The focus on real user experience instead of AI novelty
That positive response made me revisit the earlier community feedback more seriously.
If offline mobile LLMs introduce friction, and speech apps demand low latency and stability — maybe the right solution was not extreme offline purity.
That is when I began exploring Google Gemini 2.5 Flash.
Phase 3: Integrating Google Gemini
Instead of forcing everything on-device, I redesigned the mobile architecture to be hybrid and lightweight.
The current mobile stack:
React Native 0.81 + Expo SDK 54
Native speech recognition (Apple Speech Framework on iOS, Google Speech Recognizer on Android)
Google Generative AI SDK
Structured prompt architecture
Rule-based offline fallback
There is no mandatory backend.
The runtime flow:
User speaks.
Native speech recognition transcribes locally.
Transcript is sent directly to Gemini using a user-provided API key.
Gemini returns structured JSON feedback.
The app parses and renders categorized results.
API keys are:
User-controlled
Stored locally via AsyncStorage
Never routed through my own server
This keeps the AI boundary extremely thin.
How Gemini Is Used Technically
Gemini is used as a structured reasoning engine, not a free-form chatbot.
Example fluency analysis prompt:
Analyze the following transcript for fluency.
Return strictly formatted JSON:
{
"summary": "",
"strengths": [],
"improvementAreas": [],
"rephraseSuggestions": []
}
Transcript:
"""
{userTranscript}
"""
Implementation safeguards:
Explicit schema constraints in prompt
Deterministic key naming
JSON-only instruction enforcement
Defensive TypeScript parsing
Automatic fallback if parsing fails
Gemini 2.5 Flash was consistent enough with structure that I eliminated the need for a backend validation service.
That dramatically simplified the system.
Roleplay and Context Handling
For conversational practice scenarios:
Conversation history is stored locally.
Each turn is sent with prior context.
Gemini maintains continuity and supportive tone.
Example message structure:
[
{ role: "system", content: "You are supportive and concise." },
{ role: "user", content: "Hi, I’d like to order a coffee." }
]
The challenge was balancing realism with guidance. I constrained prompts to:
Avoid overly verbose responses
Maintain conversational authenticity
Provide subtle, not overwhelming, fluency feedback
Gemini handled tone surprisingly well for this sensitive use case.
Hybrid Architecture: Cloud + Fallback
Rather than choosing cloud or offline exclusively, I implemented both.
When:
Internet is available
API key is configured
→ Gemini powers:
Fluency analysis
Sentence rephrasing
Roleplay interactions
When:
Offline
No API key
Rate limit exceeded
→ Rule-based fallback activates.
Fallback includes:
Basic pacing heuristics
Template-based conversational responses
Simple word substitutions
Encouragement patterns
The app never hard-fails.
Graceful degradation became a core design principle.
Demo
Conceptually, the mobile experience works like this:
User speaks into the device.
Real-time transcription appears.
Within ~2 seconds, structured fluency feedback is displayed.
User can switch to roleplay mode for interactive practice.
Offline mode still provides guidance without AI.
Total app size remains around 50MB because no model weights are bundled.
There is no backend deployment required.
As the app is still under development and currently running in a stealth environment, I’ve recorded a demo video instead.
What I Learned
This journey reshaped how I think about mobile AI systems.
Technical lessons:
Offline LLMs are viable but operationally heavy on mobile.
Quantization reduces size but impacts reasoning quality.
Latency matters more than locality in speech applications.
Prompt engineering can replace backend microservices.
Structured schema prompts significantly improve reliability.
Architectural lessons:
Community insight can accelerate decision-making.
Hackathon validation can clarify real-world constraints.
Hybrid systems are often more resilient than extreme designs.
Engineering maturity is about choosing the right complexity level.
Google Gemini Feedback
What Worked Well
Strong instruction following with structured JSON outputs
Fast response times suitable for speech workflows
Supportive and adaptive tone
Clean SDK integration
No infrastructure management overhead
Gemini 2.5 Flash offered a practical balance between speed and reasoning depth.
Where I Experienced Friction
Occasional JSON formatting drift requiring defensive parsing
Free-tier rate limits needing UX consideration
Internet dependency for advanced features
User onboarding for API key setup
However, compared to embedding large LLMs directly into a mobile app, these trade-offs were manageable.
This project started with an attempt to push fully offline mobile LLMs.
After community feedback, a hackathon build, and real-world validation, it evolved into a privacy-first hybrid architecture powered by Google Gemini.
The biggest lesson was not about compression or inference tricks.
It was about architectural judgment.
Sometimes the most advanced system is the one with the fewest moving parts.
Top comments (0)