So here’s the thing—I’m not a mobile dev by trade.
I mostly work on backend systems, build APIs, and mess around with servers and machine learning. One of my pet projects has been a web-based, AI-powered speech coaching assistant — it listens to your voice, analyses your pacing and fluency, rephrases hesitant speech into confident phrasing, and even simulates conversations.
The vision? To bring that same assistant to mobile — fully offline, no server required.
No network required. No server to talk to. Just a pocket-sized fluency mentor that could listen to your voice, give you feedback, rephrase your sentences, and even roleplay a conversation—all on-device.
How hard could it be?
Well… let’s talk about that.
✅ The Web Version Was Easy
I already had a working version on the web: a FastAPI backend running Whisper for transcription and Ollama serving phi-3
for LLM tasks like pacing advice, rephrasing, and generating roleplay responses.
The frontend just sent API calls and got back results. Everything was modular, clean, and fast.
So naturally, I thought, “Let’s just reuse that backend and ship it with the mobile app.”
🚧 Attempt 1: Bundle the Backend and Ollama
This was my first dead end.
Turns out, you can’t just run a Python web server inside a mobile app. iOS and Android are not happy to host FastAPI, nor do they allow launching binaries like Ollama from inside the app bundle. Even trying to force it with tools like BeeWare was met with sandbox restrictions, resource constraints, and general disapproval from the mobile OS gods.
Okay, cool. Lesson learned.
🔄 Attempt 2: Rebuild Everything Natively for Mobile
With the backend out of the picture, I decided to take a new approach:
- Use Whisper locally for speech-to-text
-
Use a quantized LLM (like
phi-3
or TinyLlama) for inference - Skip the server entirely and move all logic into the mobile app
This was where things got interesting.
🎤 Whisper.rn to the Rescue
I started with transcription, and honestly—it was a dream.
I used whisper.rn
, which wraps whisper.cpp
into a clean React Native module. A few lines of code and a bundled tiny Whisper model (ggml-tiny.en.bin
), and I was transcribing audio directly on-device.
const whisper = await initWhisper({
filePath: require('../assets/ggml-tiny.en.bin'),
});
const { result } = await whisper.transcribe(audioUri, { language: 'en' });
No cloud calls. No API keys. Just a clean stream of local inference. ✅
💡 Turning Backend Endpoints into Prompts
Since there was no backend to call anymore, I had to rethink how to express all the logic—fluency checking, pacing suggestions, rephrasing, etc.
Instead of calling POST /fluency
, I would now just send this prompt to the LLM:
Analyze the following sentence for fluency and pacing:
"I um... I went to the store and I was like... really anxious."
Same for rephrasing and roleplay:
Rephrase this to sound confident and fluent:
"Umm... I think I could maybe help?"
You are a friendly barista. The user says: "Can I get a cappuccino?"
Respond like a real barista.
🤖 LLM Time: Cue the Big Guns (and the Big Problems)
With Whisper humming along, it was time to add the LLM that would handle rephrasing, analysis, and roleplay. I chose phi-3-mini-4k-instruct
, quantized and all.
This was a mistake.
💥 Problem #1: The 7.1 GB Model That Killed Android Studio
I copied the model into:
android/app/src/main/assets/phi-3-mini-4k-instruct
As soon as I tried building the app, Gradle exploded with this error:
Required array size too large
Turns out, Android’s build system tries to compress assets during packaging. It attempted to allocate a massive array in memory to compress the model file—and just crashed.
Even if it hadn’t crashed, Google Play has size limits:
- 100 MB per asset file (Play Store)
- 2 GB internal asset max
- Anything bigger? Boom 💥
🔁 Problem #2: JNI + llama.cpp Setup Still Not Working
I wired up JNI to call llama.cpp
from React Native, but the integration just... doesn’t return anything.
const response = await Llm.runLLM("Rephrase this: 'I guess I might be able to help?'");
My C++ run_prompt()
seems to be getting the input—but nothing is returned back to JavaScript. Possibly model path issues, maybe asset handling, maybe JNI bugs. It’s on my list.
📦 Problem #3: Android Assets Aren’t Just Files
To use a .gguf
model from native C++, you can’t just fopen()
it directly from assets.
You have to:
- Copy the asset at runtime to
getFilesDir()
- Then load from disk via C++
Still need to implement that.
📊 Where Things Stand Now
Feature | Status |
---|---|
Whisper transcription | ✅ Working via whisper.rn
|
Fluency & pacing logic | ✅ Prompt-based and solid |
Roleplay prompts | ✅ Working |
LLM model integration | ❌ Broken due to size |
🛣️ What’s Next
- Copy
.gguf
tofilesDir()
- Debug JNI return path
- Try smaller models (TinyLlama, Phi2-4bit)
- Better prompt chaining
- Fallback logic for model load errors
🎓 What I’ve Learned
- Whisper works great on mobile
- LLMs bring size, build, and bridge headaches
- Prompt engineering replaces backend logic
- JNI is powerful but fragile
- Android assets ≠ normal files
✨ Final Thoughts
This isn’t done—but it’s real.
I’ve got offline Whisper working, I’ve shifted all backend logic into prompt templates, and I’ve scaffolded the native pieces to run an LLM offline.
Next stop? Making the LLM actually respond.
If you're building something similar, don’t be afraid to dig in. Just maybe… don’t start with a 7.1 GB model 😉
Top comments (1)
Infact a good read and getting to learn more things in this llm space.