DEV Community

Cover image for Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks
ujjavala
ujjavala

Posted on

Offline AI on Mobile: Tackling Whisper, LLMs, and Size Roadblocks

So here’s the thing—I’m not a mobile dev by trade.

I mostly work on backend systems, build APIs, and mess around with servers and machine learning. One of my pet projects has been a web-based, AI-powered speech coaching assistant — it listens to your voice, analyses your pacing and fluency, rephrases hesitant speech into confident phrasing, and even simulates conversations.

The vision? To bring that same assistant to mobile — fully offline, no server required.

No network required. No server to talk to. Just a pocket-sized fluency mentor that could listen to your voice, give you feedback, rephrase your sentences, and even roleplay a conversation—all on-device.

How hard could it be?

Well… let’s talk about that.

✅ The Web Version Was Easy

I already had a working version on the web: a FastAPI backend running Whisper for transcription and Ollama serving phi-3 for LLM tasks like pacing advice, rephrasing, and generating roleplay responses.

The frontend just sent API calls and got back results. Everything was modular, clean, and fast.

So naturally, I thought, “Let’s just reuse that backend and ship it with the mobile app.”

🚧 Attempt 1: Bundle the Backend and Ollama

This was my first dead end.

Turns out, you can’t just run a Python web server inside a mobile app. iOS and Android are not happy to host FastAPI, nor do they allow launching binaries like Ollama from inside the app bundle. Even trying to force it with tools like BeeWare was met with sandbox restrictions, resource constraints, and general disapproval from the mobile OS gods.

Okay, cool. Lesson learned.

🔄 Attempt 2: Rebuild Everything Natively for Mobile

With the backend out of the picture, I decided to take a new approach:

  • Use Whisper locally for speech-to-text
  • Use a quantized LLM (like phi-3 or TinyLlama) for inference
  • Skip the server entirely and move all logic into the mobile app

This was where things got interesting.

🎤 Whisper.rn to the Rescue

I started with transcription, and honestly—it was a dream.

I used whisper.rn, which wraps whisper.cpp into a clean React Native module. A few lines of code and a bundled tiny Whisper model (ggml-tiny.en.bin), and I was transcribing audio directly on-device.

const whisper = await initWhisper({
  filePath: require('../assets/ggml-tiny.en.bin'),
});

const { result } = await whisper.transcribe(audioUri, { language: 'en' });
Enter fullscreen mode Exit fullscreen mode

No cloud calls. No API keys. Just a clean stream of local inference. ✅

💡 Turning Backend Endpoints into Prompts

Since there was no backend to call anymore, I had to rethink how to express all the logic—fluency checking, pacing suggestions, rephrasing, etc.

Instead of calling POST /fluency, I would now just send this prompt to the LLM:

Analyze the following sentence for fluency and pacing:  
"I um... I went to the store and I was like... really anxious."
Enter fullscreen mode Exit fullscreen mode

Same for rephrasing and roleplay:

Rephrase this to sound confident and fluent:  
"Umm... I think I could maybe help?"

You are a friendly barista. The user says: "Can I get a cappuccino?"  
Respond like a real barista.
Enter fullscreen mode Exit fullscreen mode

🤖 LLM Time: Cue the Big Guns (and the Big Problems)

With Whisper humming along, it was time to add the LLM that would handle rephrasing, analysis, and roleplay. I chose phi-3-mini-4k-instruct, quantized and all.

This was a mistake.

💥 Problem #1: The 7.1 GB Model That Killed Android Studio

I copied the model into:

android/app/src/main/assets/phi-3-mini-4k-instruct
Enter fullscreen mode Exit fullscreen mode

As soon as I tried building the app, Gradle exploded with this error:

Required array size too large
Enter fullscreen mode Exit fullscreen mode

Turns out, Android’s build system tries to compress assets during packaging. It attempted to allocate a massive array in memory to compress the model file—and just crashed.

Even if it hadn’t crashed, Google Play has size limits:

  • 100 MB per asset file (Play Store)
  • 2 GB internal asset max
  • Anything bigger? Boom 💥

🔁 Problem #2: JNI + llama.cpp Setup Still Not Working

I wired up JNI to call llama.cpp from React Native, but the integration just... doesn’t return anything.

const response = await Llm.runLLM("Rephrase this: 'I guess I might be able to help?'");
Enter fullscreen mode Exit fullscreen mode

My C++ run_prompt() seems to be getting the input—but nothing is returned back to JavaScript. Possibly model path issues, maybe asset handling, maybe JNI bugs. It’s on my list.

📦 Problem #3: Android Assets Aren’t Just Files

To use a .gguf model from native C++, you can’t just fopen() it directly from assets.

You have to:

  1. Copy the asset at runtime to getFilesDir()
  2. Then load from disk via C++

Still need to implement that.

📊 Where Things Stand Now

Feature Status
Whisper transcription ✅ Working via whisper.rn
Fluency & pacing logic ✅ Prompt-based and solid
Roleplay prompts ✅ Working
LLM model integration ❌ Broken due to size

🛣️ What’s Next

  • Copy .gguf to filesDir()
  • Debug JNI return path
  • Try smaller models (TinyLlama, Phi2-4bit)
  • Better prompt chaining
  • Fallback logic for model load errors

🎓 What I’ve Learned

  • Whisper works great on mobile
  • LLMs bring size, build, and bridge headaches
  • Prompt engineering replaces backend logic
  • JNI is powerful but fragile
  • Android assets ≠ normal files

✨ Final Thoughts

This isn’t done—but it’s real.

I’ve got offline Whisper working, I’ve shifted all backend logic into prompt templates, and I’ve scaffolded the native pieces to run an LLM offline.

Next stop? Making the LLM actually respond.

If you're building something similar, don’t be afraid to dig in. Just maybe… don’t start with a 7.1 GB model 😉

Top comments (1)

Collapse
 
kbv_subrahmanyam_91778884 profile image
Kbv Subrahmanyam

Infact a good read and getting to learn more things in this llm space.