ujjavala

Posted on Jul 29

Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks

#mobile #ai #llm #webdev

So here’s the thing—I’m not a mobile dev by trade.

I mostly work on backend systems, build APIs, and mess around with servers and machine learning. One of my pet projects has been a web-based, AI-powered speech coaching assistant — it listens to your voice, analyses your pacing and fluency, rephrases hesitant speech into confident phrasing, and even simulates conversations.

The vision? To bring that same assistant to mobile — fully offline, no server required.

No network required. No server to talk to. Just a pocket-sized fluency mentor that could listen to your voice, give you feedback, rephrase your sentences, and even roleplay a conversation—all on-device.

How hard could it be?

Well… let’s talk about that.

✅ The Web Version Was Easy

I already had a working version on the web: a FastAPI backend running Whisper for transcription and Ollama serving phi-3 for LLM tasks like pacing advice, rephrasing, and generating roleplay responses.

The frontend just sent API calls and got back results. Everything was modular, clean, and fast.

So naturally, I thought, “Let’s just reuse that backend and ship it with the mobile app.”

🚧 Attempt 1: Bundle the Backend and Ollama

This was my first dead end.

Turns out, you can’t just run a Python web server inside a mobile app. iOS and Android are not happy to host FastAPI, nor do they allow launching binaries like Ollama from inside the app bundle. Even trying to force it with tools like BeeWare was met with sandbox restrictions, resource constraints, and general disapproval from the mobile OS gods.

Okay, cool. Lesson learned.

🔄 Attempt 2: Rebuild Everything Natively for Mobile

With the backend out of the picture, I decided to take a new approach:

Use Whisper locally for speech-to-text
Use a quantized LLM (like phi-3 or TinyLlama) for inference
Skip the server entirely and move all logic into the mobile app

This was where things got interesting.

🎤 Whisper.rn to the Rescue

I started with transcription, and honestly—it was a dream.

I used whisper.rn, which wraps whisper.cpp into a clean React Native module. A few lines of code and a bundled tiny Whisper model (ggml-tiny.en.bin), and I was transcribing audio directly on-device.

const whisper = await initWhisper({
  filePath: require('../assets/ggml-tiny.en.bin'),
});

const { result } = await whisper.transcribe(audioUri, { language: 'en' });

No cloud calls. No API keys. Just a clean stream of local inference. ✅

💡 Turning Backend Endpoints into Prompts

Since there was no backend to call anymore, I had to rethink how to express all the logic—fluency checking, pacing suggestions, rephrasing, etc.

Instead of calling POST /fluency, I would now just send this prompt to the LLM:

Analyze the following sentence for fluency and pacing:  
"I um... I went to the store and I was like... really anxious."

Same for rephrasing and roleplay:

Rephrase this to sound confident and fluent:  
"Umm... I think I could maybe help?"

You are a friendly barista. The user says: "Can I get a cappuccino?"  
Respond like a real barista.

🤖 LLM Time: Cue the Big Guns (and the Big Problems)

With Whisper humming along, it was time to add the LLM that would handle rephrasing, analysis, and roleplay. I chose phi-3-mini-4k-instruct, quantized and all.

This was a mistake.

💥 Problem #1: The 7.1 GB Model That Killed Android Studio

I copied the model into:

android/app/src/main/assets/phi-3-mini-4k-instruct

As soon as I tried building the app, Gradle exploded with this error:

Required array size too large

Turns out, Android’s build system tries to compress assets during packaging. It attempted to allocate a massive array in memory to compress the model file—and just crashed.

Even if it hadn’t crashed, Google Play has size limits:

100 MB per asset file (Play Store)
2 GB internal asset max
Anything bigger? Boom 💥

🔁 Problem #2: JNI + llama.cpp Setup Still Not Working

I wired up JNI to call llama.cpp from React Native, but the integration just... doesn’t return anything.

const response = await Llm.runLLM("Rephrase this: 'I guess I might be able to help?'");

My C++ run_prompt() seems to be getting the input—but nothing is returned back to JavaScript. Possibly model path issues, maybe asset handling, maybe JNI bugs. It’s on my list.

📦 Problem #3: Android Assets Aren’t Just Files

To use a .gguf model from native C++, you can’t just fopen() it directly from assets.

You have to:

Copy the asset at runtime to getFilesDir()
Then load from disk via C++

Still need to implement that.

📊 Where Things Stand Now

Feature	Status
Whisper transcription	✅ Working via `whisper.rn`
Fluency & pacing logic	✅ Prompt-based and solid
Roleplay prompts	✅ Working
LLM model integration	❌ Broken due to size