Your PDFs Never Leave Your Pocket: Building a 100% Offline RAG App with Gemma 4 + LiteRT-LM

Umer — Fri, 08 May 2026 21:53:24 +0000

Your PDFs Never Leave Your Pocket: Building a 100% Offline RAG App with Gemma 4 + LiteRT-LM 🔒📱

"We'd love to use AI on our internal documents… but legal said no."

If you've ever worked with a German Mittelstand company — or honestly, any healthcare provider, law firm, or financial services team anywhere in the EU — you've heard a version of this sentence. And legal isn't being paranoid. They're being correct.

The moment an employee pastes a contract, a payslip, or a patient record into ChatGPT, that document becomes someone else's processing activity. Under GDPR Article 28, the cloud AI provider becomes a data processor. You stay the controller. If those servers sit in the US, you've also tripped Chapter V transfer rules and the ghost of Schrems II. Fines top out at €20 million or 4% of global turnover, and the regulators are warming up.

So here's the dilemma every European SME is sitting in right now: the productivity gains from "chat with your documents" are real and obvious, but the compliance surface is a minefield. Most teams resolve it the same way — they don't use AI on their sensitive stuff at all. The data just sits there, unsearchable.

I wanted to fix that. Not by writing another DPA template, but by removing the cloud from the equation entirely.

This is PocketSage — a fully offline, on-device RAG assistant for Android. You import a PDF, ask it questions, and get streaming answers from Gemma 4 E2B running natively on your phone. No network calls. No API keys. No "your data may be used to improve our services." The model weights live in your app sandbox; the embeddings live in a Room database; airplane mode works perfectly.

Let me walk you through how it's built.

🔗 Repo: github.com/umarpazir11/pocketsage
📦 Pre-built APK: Grab it from the Releases page — no build setup required, just side-load the LLM and go.
⭐ Star it if you find it useful — it genuinely helps others discover the project.

The Privacy Argument, Stated Plainly 🇪🇺

I want to spend one more paragraph here because this is the whole point of the project, not a footnote.

When you build a cloud RAG pipeline for a German enterprise, here's what your compliance checklist actually looks like:

✅ Sign a Data Processing Agreement with your LLM provider (Article 28)
✅ Conduct a Data Protection Impact Assessment — DPIA — for high-risk processing (Article 35)
✅ Document legal basis under Article 6 for every category of data
✅ Update your Record of Processing Activities (Article 30)
✅ Set up Standard Contractual Clauses for any non-EU sub-processors
✅ Implement PII redaction before vectorization (because the prompt-and-document data hits a third-party server)
✅ Build a "Right to be Forgotten" mechanism that can purge specific vectors from your store

That's a six-month project before you write a line of feature code.

Now here's PocketSage's compliance checklist:

✅ The data doesn't leave the device.

That's it. There is no processor because there is no processing happening anywhere except on hardware the user already owns. Article 28 doesn't apply. Chapter V transfers don't apply. There's no DPA to sign because there's no third party. This is privacy by design in the most literal sense the regulation could possibly mean — the architecture itself makes the violation impossible.

For a German SME evaluating "chat with your contracts" tools, this is the difference between a six-month legal review and a one-week pilot.

The Tech Stack 🛠️

PocketSage is a textbook Modern Android Development (MAD Skills) app applied to a non-trivial ML problem. Three layers, clean separation, zero android.* imports in the domain layer.

Layer	Choice	Why
UI	Jetpack Compose + Material 3	Single Activity, dynamic color, recruiter-recognisable
Architecture	MVVM, Hilt DI, Navigation Compose	Standard, testable, no surprises
Concurrency	Kotlin Coroutines + `Flow`	Streaming tokens map cleanly onto `callbackFlow`
Persistence	Room (SQLite)	384-dim embeddings stored as `BLOB`, cosine in Kotlin
Embeddings	LiteRT (`all-MiniLM-L6-v2`)	22 MB, well-benchmarked, runs anywhere
PDF parsing	`pdfbox-android`	Mature port, handles most consumer PDFs
LLM Inference	LiteRT-LM + `gemma-4-E2B-it-litert-lm`	Google's official on-device GenAI orchestration layer

The whole RAG pipeline is roughly 500 lines of Kotlin once you strip the boilerplate. Honestly the hard part wasn't the code — it was choosing the right model file format. (More on that nightmare in a moment.)

How RAG Works Here, in Three Paragraphs 📚

When you import a PDF, PocketSage extracts the text with PDFBox, splits it into ~800-character overlapping chunks, embeds each chunk with MiniLM (a tiny BERT-family model), and stores the resulting 384-dimensional vectors as raw bytes in a Room table. One-time per document, runs in the background, progress bar in the UI. Standard stuff.

When you ask a question, the same embedding model converts your question into a vector. The app computes cosine similarity between the question vector and every stored chunk, takes the top four matches, and stitches them into a prompt template that explicitly tells the LLM: answer only from the supplied context.

The prompt is fed to Gemma 4 E2B running in LiteRT-LM's Engine runtime, which streams tokens back through a callback. Each token is appended to a StateFlow<String> that the chat screen renders in real time, with the retrieved chunks shown beneath each answer so you can verify the model isn't hallucinating. End-to-end, on a Pixel 8, you get first token in roughly 2-3 seconds and a full answer in 10-15. Not GPT-4, but very usable.

Under the Hood: The `LiteRtLmRunner` 🔧

This is the piece I'm proudest of, and it's also the piece that took the longest to get right. LiteRT-LM is Google's new orchestration layer that sits on top of LiteRT (formerly TensorFlow Lite). It handles KV-cache management, prompt templating, and the streaming token API — all the GenAI-specific plumbing that you used to have to write yourself with raw TFLite.

Here's the core of how PocketSage talks to Gemma 4:

@Singleton
class LiteRtLmRunner @Inject constructor(
    @ApplicationContext private val ctx: Context,
    private val modelRepo: ModelRepository,
) : LlmRunner {

    private val engine: Engine by lazy {
        val modelPath = modelRepo.getModelPath().absolutePath
        Log.i(TAG, "Initializing engine — model: $modelPath")
        Engine(
            EngineConfig(
                modelPath = modelPath,
                cacheDir = ctx.cacheDir.absolutePath,
                // maxNumTokens left at default — overriding it triggers a
                // DYNAMIC_UPDATE_SLICE tensor shape error. Painful lesson.
            )
        ).also { it.initialize() }
    }

    private val sessionConfig = SessionConfig(
        SamplerConfig(
            topK = BuildConfig.LLM_TOP_K,
            topP = 0.9,
            temperature = BuildConfig.LLM_TEMPERATURE.toDouble(),
            seed = 0,
        )
    )

    private var activeSession: Session? = null

    override fun generate(prompt: String): Flow<String> = callbackFlow {
        check(modelRepo.isModelReady()) { "Model not ready" }

        // Cancel any in-flight generation — the user has asked something new
        activeSession?.cancelProcess()
        activeSession?.close()

        val session = engine.createSession(sessionConfig)
        activeSession = session

        session.generateContentStream(
            listOf(InputData.Text(prompt)),
            object : ResponseCallback {
                override fun onNext(response: String) { trySend(response) }
                override fun onDone() { close() }
                override fun onError(throwable: Throwable) { close(throwable) }
            }
        )

        awaitClose {
            activeSession = null
            session.close()
        }
    }.flowOn(Dispatchers.IO)
}

There are three things worth pausing on here:

1. The Engine is lazy and singleton-scoped. Initializing the engine means loading 1.5 GB of model weights into memory and warming the KV cache. You absolutely do not want to do that on every query. Hilt's @Singleton plus Kotlin's by lazy give you a clean "load once, on first use" pattern that Just Works.

2. callbackFlow is the bridge between LiteRT-LM's callback API and Kotlin's Flow. This is honestly one of the most elegant pieces of the coroutines library. The ResponseCallback from LiteRT-LM gives you onNext / onDone / onError, which maps perfectly onto trySend / close() / close(throwable). The awaitClose block runs when the collector cancels — which means if the user navigates away from the chat screen mid-generation, the session gets cleaned up properly. No leaked native memory, no zombie inference threads.

3. flowOn(Dispatchers.IO) keeps the UI thread sacred. Token generation is CPU-heavy (XNNPACK is using 4 threads under the hood, hammering your big cores). If any of this leaked onto the main thread, scrolling would jank instantly. The dispatcher switch is one line and saves the entire UX.

The cancellation logic — activeSession?.cancelProcess() before starting a new one — is the kind of detail you only learn by shipping. Without it, if a user types question A, then immediately question B before A finishes, you get two inference jobs racing for the same engine and your tokens come out interleaved like a poorly-shuffled deck of cards. Ask me how I know — the first session wins the engine lock, the second one's tokens come out scrambled. Took me an evening of "is the model broken?" debugging to spot it.

A War Story: The `.task` vs `.litertlm` Saga 😤

I'll be honest about the part of this project that ate the most time. The first version of PocketSage used MediaPipe + Gemma 2B with a .task file. That stack works, but it's the older path. With Gemma 4, Google bifurcated the recommendation:

For experimentation: Google AI Edge Gallery (you're a user, not a dev)
For production apps: LiteRT-LM with .litertlm files

LiteRT-LM is the more modern API. It supports the new Gemma 4 features properly (per-layer embeddings, the mixed-precision quant scheme, longer context windows up to 32k via the Android runtime, with the model itself supporting 128k natively). But — and this caught me hard — LiteRT-LM only loads .litertlm files. If you grab the .task file off Hugging Face and try to feed it to LiteRT-LM's Engine, you'll get a cryptic init failure that does not mention the word "format" anywhere.

The fix is finding the right artifact. For PocketSage that's litert-community/gemma-4-E2B-it-litert-lm on Hugging Face. It uses Gemma's mixed quantization scheme — a blend of 2-bit, 4-bit, and 8-bit weights — which is what gets the model down to ~1.5 GB of active RAM while keeping output quality solid.

Lesson: when you're working with a brand-new runtime, always check that the model artifact and the runtime versions match before you write a single line of integration code. Saved me a week of confused debugging the second time.

Performance & Hardware Constraints 📊

Let me be the engineer who actually tells you the trade-offs instead of pretending it's magic.

Memory: ~1.5 GB of active RAM for the engine + KV cache. This rules out budget devices with 4 GB total RAM (Android itself wants a chunk, and you'll get killed by the OOM killer the moment you background the app). Realistic minimum is 6 GB RAM, comfortable is 8 GB+. A Pixel 7/8, Samsung Galaxy S23+, or any flagship from the last two years is the sweet spot.

Latency: On a Pixel 8 with the CPU backend (XNNPACK, 4 threads), I see roughly:

Time to first token: 2-3 seconds (after the engine is warm)
Decode speed: ~12-18 tokens/second
Cold start: 5-10 seconds to load the engine on first query

GPU acceleration via the Mali/Adreno backends or NPU acceleration via Qualcomm QNN gets you another 2-3x on supported chipsets. PocketSage v0.1 ships CPU-only for maximum compatibility; GPU is on the roadmap.

Why E2B and not E4B? This is the question I get most often. The Gemma 4 family ships in E2B (~Effective 2B params), E4B, 26B MoE, and 31B Dense sizes. E4B is meaningfully smarter — better long-context reasoning, better at multi-step questions. But it needs ~3 GB of active RAM and pushes time-to-first-token past 5 seconds on most phones. For an interactive chat experience, E2B is the sweet spot: it fits comfortably on mid-range hardware, generates fast enough that streaming feels alive, and for the extractive QA task at the heart of RAG ("what does this contract say about termination?"), the marginal accuracy gain from E4B isn't worth the latency tax.

If you're targeting tablets or high-end flagships only and you need stronger reasoning, swap to E4B — the runner code is identical, just point EngineConfig.modelPath at the bigger file.

What's on the Roadmap 🗺️

v0.1 ships a working end-to-end RAG loop. The goal from here is to keep the codebase legible rather than dense. Help wanted on any of these:

good first issue — Settings screen with sliders for chunk size, overlap, and top-K
good first issue — Empty-state and error-state polish across both screens
help wanted — ANN index (Hnswlib JNI or ObjectBox vector search) once chunk count exceeds a few thousand
help wanted — Cross-encoder re-ranker over the top 20 → top 4
help wanted — Multi-turn chat with rolling summarization
help wanted — OCR for scanned PDFs (ML Kit Text Recognition)
research — Multimodal: Gemma 4 natively handles images and audio. Imagine asking "what does this scanned invoice say?" without an OCR step.

Try It Yourself 🚀

There are two paths depending on whether you want to build from source or just kick the tires.

Path A: Just want to test it (5 minutes) 📱

Download the latest APK from GitHub Releases.
Download gemma-4-E2B-it.litertlm (~2.58 GB on disk, ~1.5 GB active RAM at runtime) from Hugging Face.
Side-load the APK to your Android device (you'll need to allow "Install unknown apps" — the APK is signed with a personal developer key, not a Play Store key).
Launch the app, pick the .litertlm file when prompted, and you're off.

⚠️ The APK is signed with a personal developer key, not a Play Store key — Android will warn you about the unknown source. The app itself doesn't transmit data anywhere (that's the whole point), but if you're deploying this in a real enterprise context, audit the source yourself or build it from scratch. Trust your own build, not mine.

Path B: Build from source 🛠️

git clone https://github.com/umarpazir11/pocketsage.git
cd pocketsage
./gradlew installDebug

The MiniLM embedding model and vocab are already in the repo — you only need to side-load the Gemma 4 LLM (~2.58 GB) from Hugging Face. The app's first-run screen walks you through picking it.

Once it's running, open the app, tap + to add a PDF, ask away. Toggle airplane mode mid-conversation just to feel the magic. ✈️

The Bigger Picture 🌍

I think we're at an inflection point with on-device AI that mirrors where mobile photography was around 2014. For years, "real" photography meant a DSLR and the cloud meant where you stored the photos. Then computational photography on the SoC got good enough that the phone was the camera, and the cloud became optional.

LLMs are following the same arc. For two years, "real" AI has meant a frontier model in someone else's data center. But Gemma 4 E2B running on a Pixel 8 is genuinely useful for a huge class of tasks — document QA, summarization, code explanation, light reasoning. And it's running on hardware your users already paid for, on data that never leaves their device, under regulatory regimes that suddenly become trivial.

For German SMEs sitting on filing cabinets full of contracts, HR records, and ISO 27001 audit logs they'd love to make searchable but can't legally upload anywhere — this changes the economics. The compliance surface collapses to "the user owns the device." The tooling is open-source, the models are Apache 2.0, the runtime is free.

If you build for this market, the next twelve months are going to be wild.

Get In Touch 💬

PocketSage is MIT licensed, contributions are genuinely welcome, and I read every issue. Whether you're an Android dev curious about on-device ML, an ML engineer who wants to learn Compose, or a documentation hawk — there's something in the roadmap for you.

🐙 GitHub: github.com/umarpazir11/pocketsage
💼 LinkedIn: linkedin.com/in/umardilpazir — I'm especially interested in talking with German SMEs and Mittelstand companies thinking about privacy-first AI
📬 Issues / PRs: Open one, I respond fast

If PocketSage is useful to you, a star on the repo helps others find it. ⭐

And if you're building privacy-first AI for European enterprises, get in touch on LinkedIn.

Built with Kotlin, Compose, and an unreasonable amount of respect for Article 28. 🇪🇺

Gemma4 #AndroidDev #LiteRT #OnDeviceAI #GDPR #Privacy #Kotlin #JetpackCompose #RAG #OpenSource

DEV Community: Umer

Built a fully offline RAG app for Android using Gemma 4 + LiteRT-LM. Your PDFs never leave the device — GDPR compliance collapses to one bullet point. Here's how. 🔒📱