Cloud-based LLMs are powerful, but they’re not always the right tool for mobile apps.
They introduce:
• Network dependency
• Latency
• Usage-based costs
• Privacy concerns
As Android developers, we already ship complex logic on-device.
So the real question is:
Can we run LLMs fully offline on Android, using Kotlin?
Yes — and it’s surprisingly practical today.
In this article, I’ll show how to run LLMs locally on Android using Kotlin, powered by llama.cpp and a Kotlin-first library called Llamatik.
Why run LLMs offline on Android?
Offline LLMs unlock use cases that cloud APIs struggle with:
• 📴 Offline-first apps
• 🔐 Privacy-preserving AI
• 📱 Predictable performance & cost
• ⚡ Tight UI integration
Modern Android devices have:
• ARM CPUs with NEON
• Plenty of RAM (on mid/high-end devices)
• Fast local storage
The challenge isn’t hardware — it’s tooling.
llama.cpp: the engine behind on-device LLMs
llama.cpp is a high-performance C++ runtime designed to run LLMs efficiently on CPUs.
Why it’s ideal for Android:
• CPU-first (no GPU required)
• Supports quantized GGUF models
• Battle-tested across platforms
The downside?
It’s C++, and integrating it directly into Android apps is painful.
That’s where Llamatik comes in.
What is Llamatik?
Llamatik is a Kotlin-first library that wraps llama.cpp behind a clean Kotlin API.
It’s designed for:
• Android
• Kotlin Multiplatform (iOS & Desktop)
• Fully offline inference
Key features:
• No JNI in your app code
• GGUF model support
• Streaming & non-streaming generation
• Embeddings for offline RAG
• Kotlin Multiplatform–friendly API
You write Kotlin — native complexity stays inside the library.
Add Llamatik to your Android project
Llamatik is published on Maven Central.
dependencies {
implementation("com.llamatik:library:0.12.0")
}
No custom Gradle plugins.
No manual NDK setup.
Add a GGUF model
Download a quantized GGUF model (Q4 or Q5 recommended) and place it in:
androidMain/assets/
└── phi-2.Q4_0.gguf
Quantized models are essential for mobile performance.
Load the model
val modelPath = LlamaBridge.getModelPath("phi-2.Q4_0.gguf")
LlamaBridge.initGenerateModel(modelPath)
This copies the model from assets and loads it into native memory.
Generate text (fully offline)
val response = LlamaBridge.generate(
"Explain Kotlin Multiplatform in one sentence."
)
No network.
No API keys.
No cloud calls.
Everything runs on-device.
Streaming generation (for chat UIs)
Streaming is critical for good UX.
LlamaBridge.generateStreamWithContext(
system = "You are a concise assistant.",
context = "",
user = "List three benefits of offline LLMs.",
onDelta = { token ->
// Append token to your UI
},
onDone = { },
onError = { error -> }
)
This works naturally with:
• Jetpack Compose
• ViewModels
• StateFlow
Embeddings & offline RAG
Llamatik also supports embeddings, enabling offline search and RAG use cases.
LlamaBridge.initModel(modelPath)
val embedding = LlamaBridge.embed("On-device AI with Kotlin")
Store embeddings locally and build fully offline AI features.
Performance expectations
On-device LLMs have limits — let’s be honest:
• Use small, quantized models
• Expect slower responses than cloud GPUs
• Manage memory carefully
• Always call shutdown() when done
That said, for:
• Assistive features
• Short prompts
• Domain-specific tasks
The performance is absolutely usable on modern devices.
When does this approach make sense?
Llamatik is a great fit when you need:
• Offline support
• Strong privacy guarantees
• Predictable costs
• Tight UI integration
It’s not meant to replace large cloud models — it’s edge AI done right.
⸻
Try it yourself
• GitHub: https://github.com/ferranpons/llamatik
• Website & demo app: https://llamatik.com
• llama.cpp: https://github.com/ggml-org/llama.cpp
Final thoughts
Running LLMs offline on Android using Kotlin is no longer experimental.
With the right abstractions, Kotlin developers can build private, offline, on-device AI — without touching C++.
If you’re curious about pushing AI closer to the device, this is a great place to start.
Top comments (0)