DEV Community

Syed
Syed

Posted on

How I ran LLM + RAG fully offline on Android using MNN

Running LLM + RAG Fully Offline on Android Using MNN (No Cloud, No API)

Most AI apps today depend completely on the cloud.

Upload your document โ†’ send to API โ†’ wait for response โ†’ pay per request.
And if internet is slow or unavailable? The AI stops working.

I wanted to explore something different:
Can we run a complete LLM + RAG pipeline fully offline on a mobile device?

After months of experimentation, optimization, and many failed attempts, I finally got a working offline document AI running entirely on-device. Hereโ€™s what I learned.


๐ŸŽฏ Goal

Build a document assistant that:

  • Runs fully offline
  • Uses no external API
  • Keeps documents local
  • Works on mid-range Android devices
  • Provides usable response speed

Not just a demo โ€” something actually practical.


๐Ÿง  Architecture Overview

The system is a fully local RAG pipeline running on-device:

Pipeline:

  1. User loads PDF/document
  2. Text extracted locally
  3. Converted into embeddings
  4. Stored in local vector index
  5. User asks question
  6. Relevant chunks retrieved
  7. Local LLM generates answer

Everything happens inside the device. No cloud calls.


โš™๏ธ Tech Stack

LLM (Quantized)

Smaller quantized models optimized for CPU inference.
Main challenge: balancing size vs response quality.

Embeddings (On-device)

Multilingual embeddings generated locally and stored for retrieval.

Vector Storage

Lightweight local vector index for fast retrieval without heavy RAM usage.

MNN (Mobile Neural Network)

The biggest unlock.

MNN provides:

  • Efficient CPU inference
  • Mobile-optimized runtime
  • Low memory overhead
  • Faster load vs some other runtimes I tested

For on-device AI, runtime efficiency matters more than raw model size.


๐Ÿšง Major Challenges

1. Memory limits on mid-range phones

High-end phones are easy.
Real challenge: 4โ€“6GB RAM devices.

Solutions:

  • Aggressive quantization
  • Model size tuning
  • Streaming token generation
  • Careful memory release

2. Model loading time

Large models = slow startup.

Fix:

  • Preload strategy
  • Lazy loading
  • Smaller embedding models

3. Embedding speed

Generating embeddings locally was initially slow.

Optimizations:

  • Batch processing
  • Lightweight embedding models
  • Efficient tensor handling in MNN

4. Response usability

Offline LLM must still feel usable.

Tradeoffs:

  • Slightly slower than cloud
  • But instant availability
  • Zero latency from network

๐Ÿ“Š Current Performance (mid-range Android)

  • Fully offline end-to-end
  • No internet required
  • Works on mid-range devices
  • Private document processing
  • No API cost

Still optimizing:

  • response speed
  • model quality
  • memory usage

๐Ÿ” Why Offline AI Matters

Cloud AI is powerful, but comes with trade-offs:

  • Privacy concerns
  • Recurring API cost
  • Internet dependency
  • Latency
  • Not usable in low-connectivity regions

Offline AI flips this model.

Use cases:

  • Students with limited internet
  • Journalists handling sensitive docs
  • Developers
  • Researchers
  • Privacy-first users

๐Ÿงช Lessons Learned

Biggest insight:
Offline AI on mobile is no longer impractical.

With the right optimization:

  • Quantized models
  • Efficient runtime (like MNN)
  • Lightweight RAG pipeline

โ€ฆit becomes usable today.

Not perfect yet, but very real.


๐Ÿ”ฎ Whatโ€™s Next

Currently exploring:

  • Faster token generation
  • Better small models
  • Multi-document knowledge base
  • Offline voice input
  • Cross-platform support

Long term: building a fully offline AI ecosystem.


๐Ÿค Looking for Feedback

Curious if others here are experimenting with:

  • On-device LLMs
  • Offline RAG
  • Mobile AI inference
  • MNN / llama.cpp / other runtimes

What models or runtimes are you using?
Is there real demand for offline/private AI vs cloud?

Would love to hear thoughts from the community.


๐Ÿ“ฑ Demo App (for testing)

If anyone wants to try the current implementation:
https://play.google.com/store/apps/details?id=io.cyberfly.edgedox

Mainly looking for technical feedback and ideas from other builders working on local AI.

Top comments (0)