In this tutorial, you'll learn how to run a large language model (LLM) directly on a user's device — no cloud, no server, no cost. We'll start from scratch, build a working chat interface, and progressively introduce more advanced features: tool calling, sampling, and RAG.
Each concept is explained before the code, so you can follow along whether you're new to on-device AI or just new to NobodyWho.
The example app for this article is available on GitHub if you want to jump straight to working code. It is kept up to date with the latest features — if you want the code that matches this tutorial exactly, check out this commit.
Why Run AI On-Device?
Most AI features rely on a cloud API: you send a request to a remote server, it runs the model, and sends a response back. That works well, but it comes with tradeoffs.
Running the model directly on the device avoids all of them:
- Works offline — no internet connection required
- Privacy by design — user data never leaves the device
- Low latency — no network round-trip
- No cloud costs — inference is free
The tradeoff is raw capability: on-device models are smaller and less powerful than frontier cloud models. But for many use cases — summarization, chatbots, or local search — they're more than good enough.
About NobodyWho
We'll use the NobodyWho library throughout this tutorial. It wraps llama.cpp in Rust and exposes a clean Flutter API for running any model in .gguf format.
Install it with:
flutter pub add nobodywho
Then initialize the engine in main.dart before your app launches:
import 'package:nobodywho/nobodywho.dart' as nobodywho;
void main() async {
WidgetsFlutterBinding.ensureInitialized();
await nobodywho.NobodyWho.init();
runApp(const MyApp());
}
Choosing and Loading a Model
Picking a Model
We'll use LFM2, a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. Models must be in .gguf format; most will work with NobodyWho, though some may fail due to chat template formatting issues. See the model selection guide for more details.
Getting the Model onto the Device
You have two options:
| Approach | Pros | Cons |
|---|---|---|
| Bundle in assets | Simple setup, great for development | Increases app size significantly |
| Download on demand | Keeps app size small | Requires more implementation work |
For this tutorial, we'll bundle the model in assets to keep things simple. In production, you'd want to use a download-on-demand approach with something like background_downloader.
Steps:
- Create an
assets/folder at the root of your project (if it doesn't exist). - Register it in
pubspec.yaml:
flutter:
assets:
- assets/
- Download this GGUF model, rename it
model.gguf, and place it in theassets/folder.
Accessing the Model at Runtime
NobodyWho reads the model from the filesystem, so we copy it from Flutter's asset bundle to the app's documents directory on first launch. Add path_provider to handle this:
flutter pub add path_provider
import 'dart:io';
import 'package:flutter/services.dart';
import 'package:path_provider/path_provider.dart';
final dir = await getApplicationDocumentsDirectory();
final model = File('${dir.path}/model.gguf');
if (!await model.exists()) {
final data = await rootBundle.load('assets/model.gguf');
await model.writeAsBytes(data.buffer.asUint8List(), flush: true);
}
Basic Chat
With the model in place, you're ready to start a conversation. Here's the simplest possible usage — good for testing or when you don't need a full chat UI:
final chat = await nobodywho.Chat.fromPath(modelPath: model.path);
final msg = await chat.ask('Is water wet?').completed();
print(msg);
Putting It Together
Here's the complete minimal app so far:
import 'dart:io';
import 'package:flutter/material.dart';
import 'package:flutter/services.dart';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
import 'package:path_provider/path_provider.dart';
void main() async {
WidgetsFlutterBinding.ensureInitialized();
await nobodywho.NobodyWho.init();
runApp(const MainApp());
}
class MainApp extends StatelessWidget {
const MainApp({super.key});
Future<void> _onPressed() async {
try {
final dir = await getApplicationDocumentsDirectory();
final model = File('${dir.path}/model.gguf');
if (!await model.exists()) {
final data = await rootBundle.load('assets/model.gguf');
await model.writeAsBytes(data.buffer.asUint8List(), flush: true);
}
final chat = await nobodywho.Chat.fromPath(modelPath: model.path);
final msg = await chat.ask('How do I code a button in Flutter?').completed();
debugPrint(msg);
} catch (err) {
debugPrint("Error: $err");
}
}
@override
Widget build(BuildContext context) {
return MaterialApp(
home: Scaffold(
body: Center(
child: ElevatedButton(
onPressed: _onPressed,
child: const Text("Ask - How do I code a button in Flutter?"),
),
),
),
);
}
}
For customization options like system prompts and context size, see the Chat documentation.
Building a Chat Interface
A one-shot ask().completed() call is fine for single questions, but a real chat interface needs to stream tokens as they arrive — otherwise users stare at a blank screen until the full response is ready.
Streaming Tokens
final response = chat.ask('How do I code a button in Flutter?');
await for (final token in response) {
print(token); // Each token arrives as it's generated
}
A token is the smallest unit a model generates — typically a word fragment, punctuation mark, or whitespace character.
Handling the Streaming Content
class _ChatScreenState extends State<ChatScreen> {
final List<nobodywho.Message> _messages = [];
final TextEditingController _textController = TextEditingController();
String? _streamingContent;
bool _responding = false;
Future<void> _ask() async {
final userInput = _textController.text.trim();
if (userInput.isEmpty || _responding) return;
setState(() {
_responding = true;
_streamingContent = null;
});
final tokenStream = chat.ask(userInput);
await for (final token in tokenStream) {
setState(() {
_streamingContent = (_streamingContent ?? '') + token;
});
}
// ...continued below
}
}
Updating the Message List
Once the stream completes, fetch the full chat history and update your UI state:
final history = await chat.getChatHistory();
final List<nobodywho.Message> messages = [];
for (var message in history) {
messages.add(
message.copyWith(
content: message.content,
),
);
}
setState(() {
_messages.clear();
_messages.addAll(messages);
_streamingContent = null;
_responding = false;
});
Wiring Up the UI
- Connect your
TextFieldto call_ask()viaonSubmitted - Render
_messagesin aListView - Append
_streamingContentat the bottom while streaming
Tool Calling
Tool calling lets the model interact with the outside world. You define a set of functions — each with a name, a description, and an implementation — and the model decides when and how to call them based on the user's request.
import 'dart:math' as math;
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final circleAreaTool = nobodywho.Tool(
name: "circle_area",
description: "Calculates the area of a circle given its radius",
function: ({required double radius}) {
final area = math.pi * radius * radius;
return "Circle with radius $radius has area ${area.toStringAsFixed(2)}";
},
);
final getWeatherTool = nobodywho.Tool(
name: "get_weather",
description: "Get the current weather for a given city",
function: ({required String city}) async {
return await fetchWeather(city);
},
);
final chat = await nobodywho.Chat.fromPath(
modelPath: model.path,
tools: [circleAreaTool, getWeatherTool],
);
final response = await chat
.ask('What is the area of a circle with a radius of 2?')
.completed();
print(response);
The model reads each tool's description to decide when to call it, so writing clear, specific descriptions matters.
See the Tool Calling documentation for more.
Sampling
When generating a token, the model produces a probability distribution over every possible next token. A sampler controls how the final token is chosen from that distribution.
The default behavior involves some randomness, which produces natural, varied output. But you can tune it:
- Lower temperature → more deterministic, predictable output
- Higher temperature → more creative, varied output
- Constrained sampling → force output into a specific format, such as JSON
final chat = await nobodywho.Chat.fromPath(
modelPath: model.path,
sampler: nobodywho.SamplerPresets.temperature(temperature: 0.2),
);
See the Sampling documentation for more.
RAG
Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The model uses retrieved documents to ground its responses in your knowledge base.
Example: A Customer Service Assistant
import 'package:nobodywho/nobodywho.dart' as nobodywho;
Future<void> main() async {
// The cross-encoder re-ranks retrieved documents by relevance.
// Recommended model:
// https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf
// Follow the same approach as the chat model to import the reranker model.
final crossencoder = await nobodywho.CrossEncoder.fromPath(
modelPath: rerankerModel.path,
);
final knowledge = [
"Our company offers a 30-day return policy for all products",
"Free shipping is available on orders over \$50",
"Customer support is available via email and phone",
"We accept credit cards, PayPal, and bank transfers",
"Order tracking is available through your account dashboard",
];
final searchKnowledgeTool = nobodywho.Tool(
name: "search_knowledge",
description: "Search the knowledge base for relevant information",
function: ({required String query}) async {
final ranked = await crossencoder.rankAndSort(query: query, documents: knowledge);
final topDocs = ranked.take(3).map((e) => e.$1).toList();
return topDocs.join("\n");
},
);
final chat = await nobodywho.Chat.fromPath(
modelPath: model.path,
systemPrompt:
"You are a customer service assistant. Use the search_knowledge tool "
"to find relevant information from our policies before answering.",
tools: [searchKnowledgeTool],
);
final response = await chat.ask("What is your return policy?").completed();
print(response);
}
See the Embeddings & RAG documentation for more.
What's Next?
You now have a complete foundation for building on-device AI features in Flutter:
- Load and run a GGUF model
- Build a streaming chat interface
- Extend the model with tool calling
- Control output style with sampling
- Ground responses in a knowledge base with RAG
From here, you can explore the full NobodyWho documentation or dig into the example app to see everything working end to end.
Top comments (0)