Lokeswaran Aruljothi

Posted on Jan 2 • Originally published at blog.lokes.dev

How Large Language Models Work: A Simple Overview for Beginners

#openai #ai #llm

TLDR;

When you give text to an LLM, it first splits it into tokens using tokenization, and those tokens are converted into embeddings. Each token compares itself with other tokens using attention to build meaningful context. The model then generates raw scores for all possible next tokens and converts them into a probability distribution. Finally, sampling selects the next token. This autoregressive loop continues until the LLM produces a complete response.

What is an LLM?

A large language model is, at its core, a next-token prediction system. Given some input text, it predicts the most likely next token, appends it to the input, and repeats this process in a loop to generate a response. At each step, the model effectively asks:

What is the most probable next token that will follow this input?

An LLM does not understand text or language in the way humans do. The text you provide is first converted into numbers, and the model operates entirely on those numerical representations. Rather than understanding meaning, the model learns statistical patterns in language and uses those patterns to predict what token is likely to come next.

Since an LLM operates only on numbers, the text we provide cannot be used directly. The first step is to break text into smaller units called tokens, which act as the building blocks for everything that follows. This process is known as tokenization.

Tokenization: Breaking text into model-friendly pieces

Each language model has a fixed vocabulary, which is a list of tokens it knows how to work with. Every token is mapped to a numeric value called a token ID. Tokens can represent full words, parts of words, punctuation, whitespace, or even common character sequences. Internally, the model never sees raw text, it only sees these token IDs.

When text is passed to the model, it is split into tokens based on the tokenizer and vocabulary used by that specific model.

For example, in the image above, the sentence is tokenized using the GPT‑4o tokenizer. Notice how the word “microservices” is split into multiple tokens rather than treated as a single word. This happens because tokenizers often use subword units to efficiently handle large and diverse vocabularies.

Because different models use different tokenizers and vocabularies, the same sentence can be split into tokens differently across models. This is why token counts vary and why the same input may consume more or fewer tokens depending on the model being used.

On their own, tokens are still just numbers with no inherent meaning. To represent relationships and semantics, these tokens are next converted into embeddings.

Embeddings: Turning tokens into vectors

The tokenizer converts input text into tokens and maps them to token IDs. However, these token IDs are just numbers and carry no inherent meaning. The model cannot reason or operate using these raw IDs alone. This is where embeddings come into the picture.

An embedding is a vector representation of a token. Each token ID is mapped to a dense vector that captures semantic information. Tokens with similar meanings tend to have similar embeddings, which allows the model to reason about language in a more meaningful way.

The image above shows a simplified visualization of embeddings. Words like “king” and “prince” appear close to each other, as do “father” and “son”, while “mother” and “daughter” form a separate group. This illustrates how embeddings can capture semantic similarities and relationships. In reality, embeddings exist in a multi-dimensional space, not just two or three dimensions, and the axes shown here are only for intuition.

You can think of an embedding as a point in a multi-dimensional space. Tokenization gives a token ID, and that ID is used to look up the corresponding embedding vector. At this stage, each token is represented independently. Tokens do not yet interact with one another, and no context has been applied.

Along with token embeddings, positional information is added so the model knows the order of tokens in the sequence. This ensures that the model can distinguish between different word orders, such as “the sky is blue” and “blue is the sky”.

At this point, tokens have semantic meaning, but they still do not understand the full context of the sentence. Each token is aware only of itself. To understand language, tokens must relate to one another and determine which other tokens are relevant. This is achieved using attention.

Attention: Relating tokens to each other

Attention allows each token to look at other tokens in the input and decide how important they are for understanding the current context. Instead of treating all tokens equally, the model learns which tokens should influence each other and by how much. This process enables the model to build context-aware representations of each token.

In the visualization above, each token sends connections to other tokens. Stronger connections indicate that one token considers another token more relevant when building its contextual meaning. This process happens for every token in the input sequence.

Queries, Keys, and Values

To compute attention, each token is transformed into three different representations called query, key, and value.

The query represents what the current token is looking for.
The key represents what other tokens offer.
The value represents the information carried by those tokens.

The query of one token is compared with the keys of all other tokens to measure relevance. Based on this relevance, the corresponding values are combined with different strengths to produce a new, context-aware representation for the token. This is the core idea behind attention. The goal is not to perform a lookup, but to determine which tokens matter most in a given context.

Multi-head self-attention

This attention process does not happen just once. Instead, it runs multiple times in parallel using different attention heads. Each head can focus on different patterns in the sentence, such as grammatical structure, relationships between words, or long-range dependencies. The outputs from all heads are then combined to form a richer representation of each token. This mechanism is known as multi-head self-attention.

Role of the feedforward network

After attention mixes information across tokens, the resulting representations pass through a small feedforward neural network, often called a multi-layer perceptron. This step processes each token independently and helps refine its representation further. While attention handles relationships between tokens, this network helps transform and polish the information within each token.

At the end of this stage, each token is no longer isolated. It now carries information about its own meaning as well as its relationship to other tokens in the sentence. These context-aware token representations (new embeddings) are then used to compute scores for all possible next tokens in the vocabulary, which leads into probability estimation and token selection.

From representations to probabilities

Now that each token has a context-aware representation produced by attention, the model uses these representations to decide what should come next. For the current position, the model generates a score for every possible token in its vocabulary. These raw scores are called logits.

Logits represent how likely each token is to be the next token, relative to the others. A higher logit means the model considers that token more likely. However, logits are not probabilities. They are not normalized, do not fall within a fixed range, and are difficult to interpret directly.

To convert these logits into something meaningful, the model applies a function called softmax that transforms them into a probability distribution. After this step, each possible token is assigned a probability, and all probabilities sum to 1. This distribution represents how likely the model believes each token is to be chosen next.

At this point, the model knows the likelihood of every possible next token. The remaining question is how to select one token from this distribution, which is handled by the sampling step.

Sampling: choosing the next token

After the model computes a probability distribution for the next token, it must select one token to generate. If the model always chooses the token with the highest probability, it will produce the same output every time. This often leads to repetitive and less natural output. Sampling is used to introduce controlled randomness into the generation process.

Greedy decoding (baseline)

In greedy decoding, the model always picks the token with the highest probability. This approach is fully deterministic and produces consistent outputs. While it can be useful for tasks that require precision, it is generally not suitable for open-ended tasks such as creative writing or conversation.

Temperature

Temperature controls how sharp or flat the probability distribution is. It does not change which tokens are possible, only how likely they are relative to one another.

A lower temperature makes the distribution sharper, causing high-probability tokens to dominate. This results in more confident and less random output.
A higher temperature flattens the distribution, allowing lower-probability tokens to be selected more often. This increases diversity and randomness.

Top-k sampling

Top-k sampling further restricts the choice of the next token by considering only the k most probable tokens. All other tokens are ignored. This prevents extremely unlikely tokens from being selected while still allowing some diversity among the most likely options.

Top-p (nucleus sampling)

Top-p sampling, also known as nucleus sampling, selects the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike Top-k, this set can change dynamically depending on how confident the model is. This makes Top-p more adaptive and often better at balancing focus and diversity.

In this example, only the most likely tokens whose probabilities sum to the 0.8 threshold are considered for sampling.

Putting it together

Temperature, Top-k, and Top-p are often used together. They control how randomness is applied, not what the model knows. The underlying probabilities remain the same, but different sampling settings lead to different generation behavior.

Once a token is selected, it is appended to the input, and the entire process repeats. This is how LLMs generate text one token at a time until a complete response is produced.

A simple mental model

At its core, a large language model is a system designed to predict what comes next. Everything you have seen in this blog, from tokenization, embeddings, attention, probabilities, to sampling, exists to support that single goal. The apparent intelligence of an LLM emerges not from understanding language like a human, but from repeatedly applying this prediction process at scale.

Once you internalize this mental model, many behaviors of LLMs start to make sense. This explains why phrasing matters, why responses can vary, and why models sometimes sound confident yet incorrect. They are not reasoning about truth, but generating the most likely continuation based on patterns learned from data.

DEV Community