Structured Output and Tool Calling with On-Device LLMs on Android

#webdev #programming

---
title: "Structured Output from On-Device LLMs on Android with GBNF"
published: true
description: "Enforce JSON schema output and build offline agent loops with on-device LLMs on Android using GBNF grammars, llama.cpp, and Kotlin coroutines."
tags: kotlin, android, architecture, performance
canonical_url: https://blog.mvpfactory.co/structured-output-on-device-llms-android-gbnf
---

## What We Are Building

In this workshop, I will walk you through enforcing structured JSON output from on-device LLMs on Android using GBNF grammars in llama.cpp. By the end, you will have a coroutine-based agent loop that dispatches tool calls, runs entirely offline, and keeps your UI at 60fps under thermal constraints.

Here is the minimal setup to get this working — no server, no API keys, no crossed fingers hoping the model returns valid JSON.

## Prerequisites

- Android Studio with Kotlin 1.9+
- A device or emulator running API 28+ (Pixel 8 recommended for benchmarking)
- llama.cpp integrated via NDK ([llama.cpp repo](https://github.com/ggerganov/llama.cpp))
- A quantized model (Q4_K_M, 3B parameters — Llama 3.2 3B works well)
- Familiarity with Kotlin coroutines and `StateFlow`

## Step 1: Understand Why Prompting Alone Fails

Most teams try to prompt-engineer their way to valid JSON. That works 80-something percent of the time. The rest of the time, it crashes your app. The ReAct pattern ([Yao et al., 2023](https://arxiv.org/abs/2210.03629)) depends on structured contracts between reasoning traces and tool dispatch. On-device, there is no API-level JSON mode. You enforce it yourself.

## Step 2: Define a GBNF Grammar for Tool Calls

GBNF grammars constrain token sampling at decode time. Every sampled token is validated against a grammar state machine — if a token would violate the grammar, its logit is masked to negative infinity before softmax. The model literally cannot produce invalid output.

Let me show you a pattern I use in every project:

kotlin
val toolCallGrammar = """
root ::= "{" ws "\"tool\"" ws ":" ws tool-name ws ","
ws "\"args\"" ws ":" ws "{" ws args ws "}" ws "}"
tool-name ::= "\"search\"" | "\"calculate\"" | "\"summarize\""
args ::= pair ("," ws pair)*
pair ::= string ws ":" ws value
value ::= string | number | bool | object
object ::= "{" ws (pair ("," ws pair))? ws "}"
bool ::= "true" | "false"
string ::= "\"" [^"] "\""
number ::= "-"? [0-9]+ ("." [0-9]+)?
ws ::= [ \t\n]*
""".trimIndent()


This restricts tool names to your known set. The model cannot hallucinate a tool that does not exist. For the full GBNF specification, see the [GBNF guide in the llama.cpp repository](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).

## Step 3: Build the Coroutine-Based Agent Loop

With structured output guaranteed, wire up the dispatch:

kotlin
sealed class AgentAction {
data class ToolCall(val tool: String, val args: Map) : AgentAction()
data class FinalAnswer(val text: String) : AgentAction()
}

suspend fun agentLoop(
prompt: String,
llamaEngine: LlamaEngine,
maxSteps: Int = 5
): String = withContext(Dispatchers.Default) {
var context = prompt
repeat(maxSteps) { step ->
val output = llamaEngine.generate(
prompt = context,
grammar = toolCallGrammar,
maxTokens = 256
)
when (val action = parseAction(output)) {
is AgentAction.ToolCall -> {
val result = dispatch(action.tool, action.args)
context += "\nObservation: $result\nThought:"
}
is AgentAction.FinalAnswer -> return@withContext action.text
}
ensureActive()
}
"Max steps reached"
}


Note `Dispatchers.Default`, not `Dispatchers.IO` — inference is CPU-bound, so you want the thread pool sized to core count.

## Step 4: Respect Thermal Budgets

An unbounded agent loop on a mobile device is a thermal throttling event waiting to happen. Monitor and adapt:

kotlin
val thermalStatus = context.getSystemService()
?.currentThermalStatus ?: THERMAL_STATUS_NONE

if (thermalStatus >= THERMAL_STATUS_MODERATE) {
llamaEngine.setThreadCount(2)
delay(200)
}


| Thermal Status | Threads | Step Delay | Token Limit |
|---|---|---|---|
| NONE / LIGHT | 4 | 0ms | 256 |
| MODERATE | 2 | 200ms | 128 |
| SEVERE | 1 | 500ms | 64 |
| CRITICAL | Pause loop | — | — |

Emit partial results via `StateFlow` and let Jetpack Compose recompose on collection. No `LiveData`, no callbacks, no frame drops.

## Gotchas

- **Prompt-only JSON is not reliable.** We tested on a Pixel 8 with Llama 3.2 3B (Q4_K_M) across ~1,000 structured extraction calls. Prompt-only hit ~80-85% valid JSON. GBNF grammar hit 100% with only ~5-8% decode time overhead. That overhead eliminates retries entirely.
- **Unbounded loops will throttle your device.** Always cap `maxSteps`. Always check `ensureActive()` so cancellation works when the user navigates away.
- **Do not use `Dispatchers.IO` for inference.** It is CPU-bound work. Wrong dispatcher means wrong thread pool sizing.
- **Know the limits.** Models in the 1-7B parameter range handle structured extraction, classification, and simple multi-step reasoning well. They struggle with complex planning or knowledge-intensive tasks. Choose on-device for offline capability, sub-500ms latency on short generations, or when user data must never leave the device.

## Wrapping Up

The docs do not mention this, but the real unlock with on-device LLMs is not text generation — it is treating the model as a structured-output engine. Define tight GBNF grammars matching your tool schemas, dispatch deterministically, bound your agent loop, and monitor thermal state. That is where small models actually shine, and that 5-8% grammar overhead is the best trade you will make all year.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.