DEV Community: pielouNW

Beginner's Guide to Essential Terms in Artificial Intelligence

pielouNW — Thu, 07 May 2026 12:14:58 +0000

AI has its own language, and if you're just getting started, it can feel like everyone else got the memo but you.

Terms like tokens, inference, and quantization get tossed around in articles, videos, and job descriptions as if they're common knowledge, but they're not.

This guide helps you to navigate in the AI jungle, it covers the core AI vocabulary you'll encounter and defines it simply. Whether you're building something, exploring the field, or just trying to follow the conversation, these are the terms worth knowing.

You can also find this article on NobodyWho website and learn how to integrate LLMs into your applications.

1. The AI Stack

The big picture.

Artificial Intelligence (AI)

The field of computer science focused on building systems that can perform tasks that normally require human intelligence, like understanding text or audio, recognizing images or making decisions.

The term "AI" is ultimately a moving target. As the border between machine tasks and human tasks moves, the definition shifts too.

Machine Learning (ML)

A subset of AI where systems learn from data instead of being programmed. Rather than writing rules by hand, you feed the system examples and it figures out the patterns on its own.

Deep Learning (DL)

A subset of machine learning that uses neural networks with many layers to learn from large amounts of data. It's the technology behind most modern AI breakthroughs. State-of-the-art systems for image recognition, speech synthesis, and large language models all rely on it.

Generative AI (Gen AI)

Generative AI refers to Artificial Intelligence systems that are capable of creating new content such as text, images, audio, video, or code.

2. How Models Are Built

Behind the scenes.

Dataset

A structured collection of data used to train, test, or evaluate a model. Datasets can contain text, images, audio, or any other form of information. The quality and size of a dataset directly affects how well a model performs.

Training

The process of exposing a model to data so it can learn patterns. During training, the model adjusts its internal parameters millions (or billions) of times to get better at its task.

Parameters / Weights

The internal numerical values a model learns during training. Parameters are what the model actually "knows", before being fed any prompts. They encode the patterns extracted from training data. A model with 70 billion parameters has 70 billion of these numbers, all tuned to make its outputs as accurate as possible. Weights is another term for the same thing, often used when referring to the files you download for open-weight models.

Fine-tuning

The process of taking a pre-trained model and continuing to train it on a smaller, specialized dataset to adapt it to a specific task or style. Fine-tuning is faster and cheaper than training from scratch, and it's how generic models get turned into domain-specific ones.

Distillation

A training technique where a smaller model (the student) is trained to mimic the behavior of a larger model (the teacher). The goal is to compress the capabilities of a large, expensive model into a smaller, faster one.

Quantization

A technique for reducing a model's size by lowering the precision of its weights like for example, going from 32-bit floats to 8-bit integers. Quantized models are faster and cheaper to run, with a trade-off in accuracy.

3. What a Model Is

The different shapes a model can take.

Model

The output of training, which consists of one or several files that have learned to map inputs to outputs. Models can range from a few megabytes to several terabytes.

LLM (Large Language Model)

A type of deep learning model trained on massive amounts of text data to understand and generate human language. LLMs like GPT, Claude, and Gemini predict the next most likely word/token given a context. They're the engine behind most modern AI chat and writing tools.

SLM (Small Language Model)

A language model trained with fewer parameters than a typical LLM, designed to run efficiently on limited hardware like laptops, phones, or even smartwatches. SLMs are not categorically different from LLMs, but simply smaller variants.

Mixture of Experts (MoE)

An architecture where only a fraction of the model's parameters are used for any given token, rather than all of them. This means MoE models still need a lot of memory to hold all the weights, but they spend less compute per token, so they run faster than a dense model of equivalent size. Recent examples include DeepSeek, Mixtral, and Qwen's MoE variants.

Open-weights Models

Models that are publicly released, allowing anyone to download, run, and fine-tune them. Popular examples include Llama 3, Mistral, Qwen, Gemma and DeepSeek.

The term "open-weights" is used rather than "open-source" to specify exactly what is being released. "Open-source" refers to the publishing of source code, which is human-readable code used to produce a non-human-readable binary artifact (the compiled program). The model itself is a non-human-readable binary artifact, so the term "open-weights" is used to specify that it's the weights of the model that are open, and not necessarily the training source code or dataset that was used to produce the model.

Vision Model

A model specialized in processing and understanding images. Vision models can classify what's in an image, detect objects, generate captions, or power visual search.

Multimodal Model

A model that can process or generate more than one type of data like text, images, audio or video. For example, GPT-4o and Gemini are multimodal: you can send them an image and ask a question about it, or have them describe what they hear in an audio file.

Multimodal models aren't necessarily capable of ingesting and outputting the same types of data. Many multimodal models can receive image or text inputs, and only generate text outputs.

Reasoning Model vs. Thinking Model

These terms are often used interchangeably, but there's a subtle distinction.
A reasoning model is explicitly trained or prompted to work through problems step by step before producing an answer, breaking complex tasks into logical stages.
A thinking model typically refers to models that have a dedicated internal "thinking" phase, where the model processes before responding.
In practice, both aim to improve accuracy on complex tasks by slowing down the output process.

4. Language & Text Processing

How models read and represent text under the hood.

Token

The basic unit an LLM processes. For text, a token is roughly a word fragment, "learning" might be one token, while "incredible" might be split into two tokens: "in" and "credible". Models don't read characters or full words, they read tokens.

Tokenization

The process of converting some kind of input (text, image, audio, etc.) into tokens. All model inputs are converted to tokens before being fed into the model.

Embeddings

A way of representing data (text, images, audio) as vectors (lists of numbers), in a high-dimensional space. Similar concepts end up close together in that space. Embeddings are what allow models to understand that "king" and "queen" are related, or that a photo of a cat is similar to the word "cat."

Embeddings are particularly useful in RAG systems, to identify relevant sources of information to include.

5. Using a Model

The controls and inputs that shape how a model behaves.

Prompt

The input you give to a model. For language models, a prompt is the text, like a question, instruction, or context that the model responds to. For multimodal models, the prompt could also contain an image or some audio. Prompt quality directly affects output quality. Small changes in wording can produce significantly different results.

System Prompt

A special prompt, invisible to the end user, that sets the model's behavior, tone, and constraints before the conversation begins. Developers use system prompts to give a model its "personality" or restrict what it can and can't do. It's a configuration layer on top of the model. Most models are trained to prioritize following instructions in the system prompt over any subsequent instructions.

Context Window

The maximum amount of text a model can process at once, both input and output combined. If a model has a 128k token context window, it can "see" roughly 100,000 words at a time, since there are roughly 0.75 words per token. Anything outside the context window is invisible to the model.

Latency

The time it takes for a model to respond after receiving input.
In AI products, latency matters for user experience. It's influenced by model size, what device it's running on, and whether the output is streamed token by token or returned all at once.
It's useful to measure both the time-to-first-token (TTFT) and the time to complete an entire response. In use-cases where you can stream the tokens to the user, displaying the very first token when it is generated, TTFT matters most.

Token throughput

Typically measured in tokens per second, this is a measure of how quickly the model can process and generate tokens.
The speed to read a bunch of tokens and the speed to write a bunch of tokens are very different. The token throughput of reading is often around 10x as fast as the token throughput of writing tokens.

6. Sampling

Controlling how tokens are selected.

Sampling

Under the hood, generative language models output a probability distribution for the next token in a sequence. If given the sequence of tokens ["Once ", "upon ", "a "], a model might output a distribution with a high probability for the token "time", a much lower probability for the token "hill", and an incredibly low probability for nonsense tokens like "13".

In order to actually generate a sequence, we must select one of these tokens to accept and output to the user. This process of selecting a token from the probability distribution is known as sampling.

While it's tempting to simply select the most probable token, it has been shown that language models generate much better outputs when some randomness is applied. The field of sampling in LLMs is about designing exactly how this random selection works.

The Sampler Chain

A sampler consists of two phases:

First, any number of transformations is applied to the probability distribution. These steps might zero the probability of a bunch of tokens, or shift the distribution of all tokens.

You can try to play with the token probabilities on this website. If you drag-and-drop the sampling steps, you may notice that the order of steps applied can make a difference on the result.

Once the sequence of transformations has been applied, the sampler chain finalizes by selecting a token from that distribution.

Greedy Sampling

Greedy sampling is the sampling technique where you always select the most probable token, sidestepping any randomness in the sampling process. Greedy sampling leads to very predictable and boring output.

Dist sampling

Dist sampling is the practice of selecting a token randomly, weighted by each token's probability.

Temperature

A transformation that can be applied to token probability distributions to shift it towards preferring the more probable or the less probable tokens.
If a temperature of greater than 1 is applied, the high-probability tokens are made less likely, and the low-probability tokens are made more likely.
If a temperature of less than 1 is applied, the high-probability tokens are made more likely, and the low-probability tokens are made less likely.
A temperature of exactly 1 has no effect.

Low temperature makes the model more focused and deterministic, making it feel measured and predictable.
High temperature introduces more randomness and variation, making it feel creative and spontaneous.

Top-k

Top-k limits the model to choosing from only the k most likely tokens at each step. For example, top-k of 40 means only the 40 most probable options are considered.

Top-p

Top-p (also called nucleus sampling) is more dynamic: it picks from the smallest group of tokens whose combined probability adds up to p, so at top-p of 0.9, the model considers just enough tokens to cover 90% of the probability mass.

Grammar

A formal grammar can be applied as a transformation on token probabilities. This will exclude tokens (by setting their probability to zero), if they can't possibly result in a valid completion of the grammar. This can be used to guarantee that the output will always be compatible with a certain well-defined language, so a certain parser will always work. E.g. you can apply a formal grammar to force the model to only output valid JSON.

DRY

A transformation that reduces the likelihood of tokens if they have been used recently. This is useful for preventing models from repeating themselves.

7. How Models Think & Respond

What's actually happening when a model generates an output.

Inference

The act of running a trained model on a new input to get an output. Training is when a model learns; inference is when it's actually used. Most of what happens when you use an AI product like chatting, generating images, transcribing audio is inference.

Chain-of-Thought (CoT)

A prompting technique where the model is encouraged to reason step by step before giving a final answer, rather than jumping straight to a conclusion. By writing out intermediate reasoning, like a person writing their thoughts on paper, the model tends to make fewer mistakes on complex tasks.

Hallucination

When a model generates information that sounds confident but is factually wrong or completely made up. Hallucinations happen because models predict plausible sounding text, not verified truth.

8. Advanced Techniques

Methods that extend or enhance what models can do.

RAG (Retrieval-Augmented Generation)

A technique where a model retrieves relevant external information before generating a response. Instead of relying solely on what it learned during training, the model pulls in fresh data from a database or document store at inference time. It's a practical way to keep responses accurate and up to date.

Tool Calling

The ability of a model to invoke external functions or APIs during a conversation, things like searching the web, running code, querying a database, or reading a file. Rather than generating a plain text answer, the model recognizes when a tool would help, calls it with the right inputs, receives the result, and incorporates it into its response. Tool calling is what bridges a language model and the real world, and it's the core mechanism behind most agentic systems.

9. AI Systems & Evaluation

How models are deployed, measured, and put to work.

Agent / Agentic

An AI system that can take actions, use tools, and pursue a goal across multiple steps, rather than just responding once to a single prompt. An agentic system consists of a model, a suite of tools, and some logic for when and for how long to run it. An agent will often run in several steps, until it reaches some well-defined result.

Guardrails

Rules and filters applied to a model's inputs or outputs to keep it within acceptable boundaries. Guardrails can block harmful content, enforce topic restrictions, prevent the model from impersonating real people, or ensure responses stay on-brand for a product.

Alignment

The challenge of making AI systems behave in ways that reflect human intentions, values, and goals. A misaligned model might be highly capable but pursue objectives in ways its creators didn't intend.

Eval Benchmark

A standardized test used to measure and compare response quality. Benchmarks like MMLU, HumanEval, or HellaSwag evaluate specific capabilities like reasoning, coding, language understanding, or maths. They're useful for comparing models, but a high benchmark score doesn't always translate to real-world usefulness.

Visit nobodywho.ai to start integrating AI into your applications!

Run LLMs locally in Flutter apps

pielouNW — Mon, 23 Mar 2026 14:16:40 +0000

In this tutorial, you'll learn how to run a large language model (LLM) directly on a user's device — no cloud, no server, no cost. We'll start from scratch, build a working chat interface, and progressively introduce more advanced features: tool calling, sampling, and RAG.

Each concept is explained before the code, so you can follow along whether you're new to on-device AI or just new to NobodyWho.

The example app for this article is available on GitHub if you want to jump straight to working code. It is kept up to date with the latest features — if you want the code that matches this tutorial exactly, check out this commit.

Why Run AI On-Device?

Most AI features rely on a cloud API: you send a request to a remote server, it runs the model, and sends a response back. That works well, but it comes with tradeoffs.

Running the model directly on the device avoids all of them:

Works offline — no internet connection required
Privacy by design — user data never leaves the device
Low latency — no network round-trip
No cloud costs — inference is free

The tradeoff is raw capability: on-device models are smaller and less powerful than frontier cloud models. But for many use cases — summarization, chatbots, or local search — they're more than good enough.

About NobodyWho

We'll use the NobodyWho library throughout this tutorial. It wraps llama.cpp in Rust and exposes a clean Flutter API for running any model in .gguf format.

Install it with:

flutter pub add nobodywho

Then initialize the engine in main.dart before your app launches:

import 'package:nobodywho/nobodywho.dart' as nobodywho;

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await nobodywho.NobodyWho.init();
  runApp(const MyApp());
}

Choosing and Loading a Model

Picking a Model

We'll use LFM2, a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. Models must be in .gguf format; most will work with NobodyWho, though some may fail due to chat template formatting issues. See the model selection guide for more details.

Getting the Model onto the Device

You have two options:

Approach	Pros	Cons
Bundle in assets	Simple setup, great for development	Increases app size significantly
Download on demand	Keeps app size small	Requires more implementation work

For this tutorial, we'll bundle the model in assets to keep things simple. In production, you'd want to use a download-on-demand approach with something like background_downloader.

Steps:

Create an assets/ folder at the root of your project (if it doesn't exist).
Register it in pubspec.yaml:

flutter:
  assets:
    - assets/

Download this GGUF model, rename it model.gguf, and place it in the assets/ folder.

Accessing the Model at Runtime

NobodyWho reads the model from the filesystem, so we copy it from Flutter's asset bundle to the app's documents directory on first launch. Add path_provider to handle this:

flutter pub add path_provider

import 'dart:io';
import 'package:flutter/services.dart';
import 'package:path_provider/path_provider.dart';

final dir = await getApplicationDocumentsDirectory();
final model = File('${dir.path}/model.gguf');

if (!await model.exists()) {
  final data = await rootBundle.load('assets/model.gguf');
  await model.writeAsBytes(data.buffer.asUint8List(), flush: true);
}

Basic Chat

With the model in place, you're ready to start a conversation. Here's the simplest possible usage — good for testing or when you don't need a full chat UI:

final chat = await nobodywho.Chat.fromPath(modelPath: model.path);
final msg = await chat.ask('Is water wet?').completed();
print(msg);

Putting It Together

Here's the complete minimal app so far:

import 'dart:io';
import 'package:flutter/material.dart';
import 'package:flutter/services.dart';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
import 'package:path_provider/path_provider.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await nobodywho.NobodyWho.init();
  runApp(const MainApp());
}

class MainApp extends StatelessWidget {
  const MainApp({super.key});

  Future<void> _onPressed() async {
    try {
      final dir = await getApplicationDocumentsDirectory();
      final model = File('${dir.path}/model.gguf');
      if (!await model.exists()) {
        final data = await rootBundle.load('assets/model.gguf');
        await model.writeAsBytes(data.buffer.asUint8List(), flush: true);
      }
      final chat = await nobodywho.Chat.fromPath(modelPath: model.path);
      final msg = await chat.ask('How do I code a button in Flutter?').completed();
      debugPrint(msg);
    } catch (err) {
      debugPrint("Error: $err");
    }
  }

  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      home: Scaffold(
        body: Center(
          child: ElevatedButton(
            onPressed: _onPressed,
            child: const Text("Ask - How do I code a button in Flutter?"),
          ),
        ),
      ),
    );
  }
}

For customization options like system prompts and context size, see the Chat documentation.

Building a Chat Interface

A one-shot ask().completed() call is fine for single questions, but a real chat interface needs to stream tokens as they arrive — otherwise users stare at a blank screen until the full response is ready.

Streaming Tokens

final response = chat.ask('How do I code a button in Flutter?');

await for (final token in response) {
  print(token); // Each token arrives as it's generated
}

A token is the smallest unit a model generates — typically a word fragment, punctuation mark, or whitespace character.

Handling the Streaming Content

class _ChatScreenState extends State<ChatScreen> {
  final List<nobodywho.Message> _messages = [];
  final TextEditingController _textController = TextEditingController();

  String? _streamingContent;
  bool _responding = false;

  Future<void> _ask() async {
    final userInput = _textController.text.trim();
    if (userInput.isEmpty || _responding) return;

    setState(() {
      _responding = true;
      _streamingContent = null;
    });

    final tokenStream = chat.ask(userInput);

    await for (final token in tokenStream) {
      setState(() {
        _streamingContent = (_streamingContent ?? '') + token;
      });
    }

    // ...continued below
  }
}

Updating the Message List

Once the stream completes, fetch the full chat history and update your UI state:

final history = await chat.getChatHistory();
final List<nobodywho.Message> messages = [];

for (var message in history) {
  messages.add(
    message.copyWith(
      content: message.content,
    ),
  );
}

setState(() {
  _messages.clear();
  _messages.addAll(messages);
  _streamingContent = null;
  _responding = false;
});

Wiring Up the UI

Connect your TextField to call _ask() via onSubmitted
Render _messages in a ListView
Append _streamingContent at the bottom while streaming

Tool Calling

Tool calling lets the model interact with the outside world. You define a set of functions — each with a name, a description, and an implementation — and the model decides when and how to call them based on the user's request.

import 'dart:math' as math;
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final circleAreaTool = nobodywho.Tool(
  name: "circle_area",
  description: "Calculates the area of a circle given its radius",
  function: ({required double radius}) {
    final area = math.pi * radius * radius;
    return "Circle with radius $radius has area ${area.toStringAsFixed(2)}";
  },
);

final getWeatherTool = nobodywho.Tool(
  name: "get_weather",
  description: "Get the current weather for a given city",
  function: ({required String city}) async {
    return await fetchWeather(city);
  },
);

final chat = await nobodywho.Chat.fromPath(
  modelPath: model.path,
  tools: [circleAreaTool, getWeatherTool],
);

final response = await chat
    .ask('What is the area of a circle with a radius of 2?')
    .completed();
print(response);

The model reads each tool's description to decide when to call it, so writing clear, specific descriptions matters.

See the Tool Calling documentation for more.

Sampling

When generating a token, the model produces a probability distribution over every possible next token. A sampler controls how the final token is chosen from that distribution.

The default behavior involves some randomness, which produces natural, varied output. But you can tune it:

Lower temperature → more deterministic, predictable output
Higher temperature → more creative, varied output
Constrained sampling → force output into a specific format, such as JSON

final chat = await nobodywho.Chat.fromPath(
  modelPath: model.path,
  sampler: nobodywho.SamplerPresets.temperature(temperature: 0.2),
);

See the Sampling documentation for more.

RAG

Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The model uses retrieved documents to ground its responses in your knowledge base.

Example: A Customer Service Assistant

import 'package:nobodywho/nobodywho.dart' as nobodywho;

Future<void> main() async {
  // The cross-encoder re-ranks retrieved documents by relevance.
  // Recommended model:
  // https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf
  // Follow the same approach as the chat model to import the reranker model.
  final crossencoder = await nobodywho.CrossEncoder.fromPath(
    modelPath: rerankerModel.path,
  );

  final knowledge = [
    "Our company offers a 30-day return policy for all products",
    "Free shipping is available on orders over \$50",
    "Customer support is available via email and phone",
    "We accept credit cards, PayPal, and bank transfers",
    "Order tracking is available through your account dashboard",
  ];

  final searchKnowledgeTool = nobodywho.Tool(
    name: "search_knowledge",
    description: "Search the knowledge base for relevant information",
    function: ({required String query}) async {
      final ranked = await crossencoder.rankAndSort(query: query, documents: knowledge);
      final topDocs = ranked.take(3).map((e) => e.$1).toList();
      return topDocs.join("\n");
    },
  );

  final chat = await nobodywho.Chat.fromPath(
    modelPath: model.path,
    systemPrompt:
        "You are a customer service assistant. Use the search_knowledge tool "
        "to find relevant information from our policies before answering.",
    tools: [searchKnowledgeTool],
  );

  final response = await chat.ask("What is your return policy?").completed();
  print(response);
}

See the Embeddings & RAG documentation for more.

What's Next?

You now have a complete foundation for building on-device AI features in Flutter:

Load and run a GGUF model
Build a streaming chat interface
Extend the model with tool calling
Control output style with sampling
Ground responses in a knowledge base with RAG

From here, you can explore the full NobodyWho documentation or dig into the example app to see everything working end to end.