Syed

Posted on Feb 14

How I ran LLM + RAG fully offline on Android using MNN

#ai #android #llm #rag

Running LLM + RAG Fully Offline on Android Using MNN (No Cloud, No API)

Most AI apps today depend completely on the cloud.

Upload your document → send to API → wait for response → pay per request.
And if internet is slow or unavailable? The AI stops working.

I wanted to explore something different:
Can we run a complete LLM + RAG pipeline fully offline on a mobile device?

After months of experimentation, optimization, and many failed attempts, I finally got a working offline document AI running entirely on-device. Here’s what I learned.

🎯 Goal

Build a document assistant that:

Runs fully offline
Uses no external API
Keeps documents local
Works on mid-range Android devices
Provides usable response speed

Not just a demo — something actually practical.

🧠 Architecture Overview

The system is a fully local RAG pipeline running on-device:

Pipeline:

User loads PDF/document
Text extracted locally
Converted into embeddings
Stored in local vector index
User asks question
Relevant chunks retrieved
Local LLM generates answer

Everything happens inside the device. No cloud calls.

⚙️ Tech Stack

LLM (Quantized)

Smaller quantized models optimized for CPU inference.
Main challenge: balancing size vs response quality.

Embeddings (On-device)

Multilingual embeddings generated locally and stored for retrieval.

Vector Storage

Lightweight local vector index for fast retrieval without heavy RAM usage.

MNN (Mobile Neural Network)

The biggest unlock.

MNN provides:

Efficient CPU inference
Mobile-optimized runtime
Low memory overhead
Faster load vs some other runtimes I tested

For on-device AI, runtime efficiency matters more than raw model size.

🚧 Major Challenges

1. Memory limits on mid-range phones

High-end phones are easy.
Real challenge: 4–6GB RAM devices.

Solutions:

Aggressive quantization
Model size tuning
Streaming token generation
Careful memory release

2. Model loading time

Large models = slow startup.

Fix:

Preload strategy
Lazy loading
Smaller embedding models

3. Embedding speed

Generating embeddings locally was initially slow.

Optimizations:

Batch processing
Lightweight embedding models
Efficient tensor handling in MNN

4. Response usability

Offline LLM must still feel usable.

Tradeoffs:

Slightly slower than cloud
But instant availability
Zero latency from network

📊 Current Performance (mid-range Android)

Fully offline end-to-end
No internet required
Works on mid-range devices
Private document processing
No API cost

Still optimizing:

response speed
model quality
memory usage

🔐 Why Offline AI Matters

Cloud AI is powerful, but comes with trade-offs:

Privacy concerns
Recurring API cost
Internet dependency
Latency
Not usable in low-connectivity regions

Offline AI flips this model.

Use cases:

Students with limited internet
Journalists handling sensitive docs
Developers
Researchers
Privacy-first users

🧪 Lessons Learned

Biggest insight:
Offline AI on mobile is no longer impractical.

With the right optimization:

Quantized models
Efficient runtime (like MNN)
Lightweight RAG pipeline

…it becomes usable today.

Not perfect yet, but very real.

🔮 What’s Next

Currently exploring:

Faster token generation
Better small models
Multi-document knowledge base
Offline voice input
Cross-platform support

Long term: building a fully offline AI ecosystem.

🤝 Looking for Feedback

Curious if others here are experimenting with:

On-device LLMs
Offline RAG
Mobile AI inference
MNN / llama.cpp / other runtimes

What models or runtimes are you using?
Is there real demand for offline/private AI vs cloud?

Would love to hear thoughts from the community.

📱 Demo App (for testing)

If anyone wants to try the current implementation:
https://play.google.com/store/apps/details?id=io.cyberfly.edgedox

Mainly looking for technical feedback and ideas from other builders working on local AI.

DEV Community