The views and opinions expressed on this blog are my own and do not reflect those of my employer. Additionally, any solutions, APIs, or products mentioned are for informational and discussion purposes only and should not be considered an endorsement.
Running Gemma 3 27B with Ollama on a powerful laptop or a gaming PC gives great performances. It’s very versatile, has good “world knowledge” and is good at text transformation (text summarization, rewrite, etc…). And Ollama also supports system prompts, so you can use it to give the model a persona and turn into a fun chatbot.
Today you can run 8 billion parameters models on a recent Android device (check out Gemma 3n E2B and Gemma 3n E4B through Google AI Edge Gallery 😉). But no device has enough memory yet to run 27 billions parameters models.
So as a workaround, let’s run the model on your machine at home and tunnel a connection from your phone to your home network.
Building blocks
Ollama
Ollama is a local LLM runtime that makes running open-weight models dead simple. You download a model with one command and it runs locally even if your machine doesn’t have a GPU (it then defaults to CPI). Ollama also packs a web server and provides a REST API for inference.
So no complex setup and no Python environments to manage. Recently, they even released a desktop app to make interacting with local models even easier.
In this blog post we are using Gemma 3 27B, but the setup will work with any model supported by Ollama that your computer can run.
A llama dressed as Google’s Android robot (cc)
(Image generated via Imagen 4)
Tailscale
Tailscale creates a secure mesh network between your devices. Install it on your laptop and phone, and they can talk to each other as if they're on the same local network – even when you're thousands miles apart. Tailscale is based on the open source protocol WireGuard and it takes care of the configuration for you so you won’t have to worry about port forwarding, no firewall rules and router settings.
Bespoke Ollama Android library
I implemented a bespoke library for Android designed to interact with an Ollama API server. I try to use generic Kotlin APIs as much as possible to potentially use this library in KMP.
I used kotlinx.serialization
for serialization/deserialization. All API request and response models are defined as @Serializable
data classes, which allows for easy and type-safe conversion between JSON and Kotlin objects.
import kotlinx.serialization.Serializable
import kotlinx.serialization.SerialName
@Serializable
data class GenerateRequest(
val model: String,
val prompt: String,
val stream: Boolean = false,
val format: String? = null,
val options: Options? = null,
val system: String? = null,
val template: String? = null,
val context: String? = null,
)
@Serializable
data class GenerateResponse(
val model: String,
val createdAt: String,
val response: String,
@SerialName("done_reason") val doneReason: Boolean,
@SerialName("total_duration") val totalDuration: Long? = null,
@SerialName("load_duration") val loadDuration: Long? = null,
@SerialName("prompt_eval_count") val promptEvalCount: Int? = null,
@SerialName("prompt_eval_duration") val promptEvalDuration: Long? = null,
@SerialName("eval_count") val evalCount: Int? = null,
@SerialName("eval_duration") val evalDuration: Long? = null,
val context: List<Int>? = null,
)
@Serializable
data class ChatRequest(
val model: String,
val messages: List<Message>,
val stream: Boolean = false,
val format: String? = null,
val options: Options? = null,
)
@Serializable
data class ChatResponse(
val model: String,
@SerialName("created_at") val createdAt: String,
val message: Message,
@SerialName("done_reason") val doneReason: String = "",
val done: Boolean,
@SerialName("total_duration") val totalDuration: Long? = null,
@SerialName("load_duration") val loadDuration: Long? = null,
@SerialName("prompt_eval_count") val promptEvalCount: Int? = null,
@SerialName("prompt_eval_duration") val promptEvalDuration: Long? = null,
@SerialName("eval_count") val evalCount: Int? = null,
@SerialName("eval_duration") val evalDuration: Long? = null,
)
@Serializable
data class Message(
val role: String,
val content: String,
val images: List<String>? = null,
)
@Serializable
data class Model(
val name: String,
@SerialName("modified_at") val modifiedAt: String,
val size: Long,
val digest: String,
val details: ModelDetails,
)
@Serializable
data class ModelDetails(
@SerialName("parent_model") val parentModel: String,
val format: String,
val family: String,
val families: List<String>? = null,
@SerialName("parameter_size") val parameterSize: String,
@SerialName("quantization_level") val quantizationLevel: String,
)
@Serializable
data class Options(
val numa: Boolean? = null,
@SerialName("num_ctx") val numCtx: Int? = null,
@SerialName("num_batch") val numBatch: Int? = null,
@SerialName("num_gqa") val numGqa: Int? = null,
@SerialName("num_gpu") val numGpu: Int? = null,
@SerialName("main_gpu") val mainGpu: Int? = null,
@SerialName("low_vram") val lowVram: Boolean? = null,
@SerialName("f16_kv") val f16Kv: Boolean? = null,
@SerialName("logits_all") val logitsAll: Boolean? = null,
@SerialName("vocab_only") val vocabOnly: Boolean? = null,
@SerialName("use_mmap") val useMmap: Boolean? = null,
@SerialName("use_mlock") val useMlock: Boolean? = null,
@SerialName("embedding_only") val embeddingOnly: Boolean? = null,
@SerialName("rope_frequency_base") val ropeFrequencyBase: Float? = null,
@SerialName("rope_frequency_scale") val ropeFrequencyScale: Float? = null,
@SerialName("num_thread") val numThread: Int? = null,
@SerialName("num_keep") val numKeep: Int? = null,
@SerialName("seed") val seed: Int? = null,
@SerialName("num_predict") val numPredict: Int? = null,
@SerialName("top_k") val topK: Int? = null,
@SerialName("top_p") val topP: Float? = null,
@SerialName("tfs_z") val tfsZ: Float? = null,
@SerialName("typical_p") val typicalP: Float? = null,
@SerialName("repeat_last_n") val repeatLastN: Int? = null,
@SerialName("temperature") val temperature: Float? = null,
@SerialName("repeat_penalty") val repeatPenalty: Float? = null,
@SerialName("presence_penalty") val presencePenalty: Float? = null,
@SerialName("frequency_penalty") val frequencyPenalty: Float? = null,
@SerialName("mirostat") val mirostat: Int? = null,
@SerialName("mirostat_tau") val mirostatTau: Float? = null,
@SerialName("mirostat_eta") val mirostatEta: Float? = null,
@SerialName("penalize_newline") val penalizeNewline: Boolean? = null,
@SerialName("stop") val stop: List<String>? = null,
)
@Serializable
data class ListModelsResponse(
val models: List<Model>
)
I used the Ktor client for handling HTTP requests and picked the CIO (Coroutine-based I/O) engine for Ktor which is a pure-Kotlin and lightweight solution. However, it’s not compatible with iOS, so I would have to add the Darwin engine for a KMP project supporting Android.
import io.ktor.client.HttpClient
import io.ktor.client.call.body
import io.ktor.client.engine.cio.CIO
import io.ktor.client.plugins.contentnegotiation.ContentNegotiation
import io.ktor.client.request.get
import io.ktor.client.request.post
import io.ktor.client.request.setBody
import io.ktor.client.statement.bodyAsText
import io.ktor.http.ContentType
import io.ktor.http.contentType
import io.ktor.serialization.kotlinx.json.json
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow
import kotlinx.serialization.json.Json
class OllamaClient {
private lateinit var host: String // server url and port number (usually 11434)
private val client = HttpClient(CIO) {
install(ContentNegotiation) {
json(Json {
ignoreUnknownKeys = true
})
}
engine {
requestTimeout = 60_000
}
}
fun setHost(ipAddress: String) {
host = ipAddress
}
fun generate(request: GenerateRequest): Flow<GenerateResponse> = flow {
val response = client.post("$host/api/generate") {
contentType(ContentType.Application.Json)
setBody(request)
}
response.bodyAsText().split("\n").forEach { line ->
if (line.isNotEmpty()) {
emit(Json.decodeFromString(line))
}
}
}
fun chat(request: ChatRequest): Flow<ChatResponse> = flow {
val response = client.post("$host/api/chat") {
contentType(ContentType.Application.Json)
setBody(request)
}
response.bodyAsText().split("\n").forEach { line ->
if (line.isNotEmpty()) {
emit(Json.decodeFromString(line))
}
}
}
suspend fun list(): List<Model> {
val response = client.get("$host/api/tags").body<ListModelsResponse>()
return response.models
}
[...]
// Omitting model update and manipulation functions
}
Finally, don’t forget to add the INTERNET
permission to your manifest 😉 and enable usesCleartextTraffic
. This setting will allow the application to easily connect to your local Ollama server over HTTP without requiring complex network security configurations. Usually HTTPS is recommended, but here since we are using Tailscale tunneling between your phone and the Ollama server on your laptop, the communication will still be end-to-end encrypted.
Conclusion
After testing it extensively, I can confirm that this setup performs really well. The latency is almost imperceptible with a 5G or even LTE connection, which allows a solid real-time chat experience.
In conclusion, as long as you are connected, mobile private personal AI with a "mid-size" LLM is now possible!
Top comments (0)