Most developers use LLMs every day.
We use them to write code, explain errors, summarize docs, brainstorm product ideas, and clean up text.
But most people don't realise the model has no idea what it's about to say. Not the full sentence, not even the next word.
So how does it still produce responses that feel coherent and intentional?
In this article, I am going to walk through the actual generation pipeline that runs every time you hit send. Once you see how it really works, a lot of LLM behavior starts to make more sense:
- why token count matters for cost
- why context windows feel strict
- why hallucinations happen
- why temperature flips the tone and style of responses
- why the same prompt can produce different answers
Alright, let’s get into it.
The surprising truth about LLMs
Let’s start with the biggest misconception.
When you send a prompt to an LLM, it does not plan the whole response in advance. It predicts the next token. Then the next one. Then the next.
At the moment the first word appears, the model genuinely does not know how the sentence will end. Future tokens do not exist yet.
This token-by-token process is why generations can drift halfway through a response, and why small changes in your prompt can lead to very different completions. The model is just following a probability path at each step.
The 5-Step Pipeline at a Glance
Every time you send a prompt, the model goes through these five steps:
- Tokenization: your text gets broken into pieces
- Embeddings: those pieces become meaningful vectors
- The Transformer: context gets processed through attention
- Probabilities: every possible next token receives a score
- Sampling: One token is selected based on the probability distribution.
Each step builds on the last, then the selected token is appended to the input, and the whole process repeats until the model stops.
This loop is what creates the full response you see.
Step 1: Tokenization
Before the LLM starts generating output, your input goes through a tokenizer.
This is a preprocessing step. It happens before the model starts “thinking.”
A tokenizer splits text into smaller units called tokens. Tokens are not always words. They can be:
- full words
- parts of words
- punctuation
- spaces
- special symbols
For example, if you type "I love programming. It's awesome." into OpenAI's tokenizer, you get seven tokens. Most tokens correspond to whole words, but punctuation like the period gets its own separate token.
This isn't random — tokenizers are trained on massive text data to find the most efficient patterns for splitting text.
Why tokenization exists
Tokenization exists because language has patterns.
Some words appear all the time, so it is efficient to store them as single tokens. Other words are rare or long, so they get broken into smaller subword pieces that can be reused.
That is why a common word like the may be one token, while a longer word like indistinguishable could be split into several tokens.
Tokenization is basically a compression-friendly way to represent text.
Common words like "the" get a single token. Uncommon or long words get broken into subword pieces. A word like "indistinguishable" becomes four tokens, while "the" remains just one.
Why this matters to you as a developer: When an API says the maximum context is 4,096 tokens, that's not 4,000 words. It can be roughly 3,000 words of English. Tokens are smaller units than words, and every token costs you money on API calls.
After tokenization, each token gets assigned a token ID. So "I love programming. It's awesome." becomes a sequence of seven integers. That's what actually enters the model. But numbers alone don't carry meaning. That's where the next step comes in.
Step 2: Embeddings and Meaning Space
A token ID is just a number. The model needs to understand what it means. So every token gets converted into a vector (a long list of numbers representing its meaning).
These vectors have thousands of dimensions. GPT-3, for example, uses over 12,000 numbers per token. And these aren't random numbers. They're coordinates in what's called a "meaning space."
Think of it like this: words with similar meanings end up near each other in this space. "King" is near "queen." "Python" the programming language is near "JavaScript." But "Python" the snake is somewhere completely different.
There's a famous demonstration of how powerful these vectors are. If you take the vector for "king," subtract the vector for "man," and add the vector for "woman," you land near "queen." The model learned gender relationships purely from text patterns — nobody explicitly taught it.
These rich vectors now flow into the transformer.
Step 3: The Transformer and Attention
Your embedding vectors enter a neural network with billions of parameters. But the one mechanism that makes it all work is called attention.
Imagine a spotlight operator at a concert. The music shifts, and the operator decides which musician to highlight. During a guitar solo, the spotlight lands on the guitarist. During vocals, it shifts to the singer.
Attention works similarly. When processing each token, the model decides which other tokens in the sequence to focus on.
Let's look at this sentence in closer detail: "The cat sat on the mat because it was tired."
In this context, "it" refers to the cat, not the mat. This is exactly what attention does. When the model processes "it," it assigns high attention weight to "cat" and low weight to "mat."
Even though "mat" is closer in the sentence, the model learned from patterns across millions of examples that "was tired" matches with animals, not objects.
This attention calculation happens multiple times in parallel through what are called attention heads. Different heads can capture different types of relationships — some might track grammatical structure, others might track meaning or long-range dependencies.
What then comes out the other end? Vectors that encode not just individual token meanings, but rich contextual information about the entire input.
Step 4: Probabilities
Now, we need to predict the next token.
After the transformer has processed your input. The final layer produces a score for every single token in the vocabulary.
Llama 3 has 128,000 tokens in its vocabulary, and each one gets a score. These raw scores are called logits.
A function called softmax converts those logits into probabilities that sum to one. So for a given input, you might get a distribution like this:
This is the core reality of LLM generation. The model doesn't decide what to say. It produces a probability distribution over all possible next tokens. Your final response is just one path through an enormous space of possibilities.
Step 5: Sampling, Temperature, and Top-P
This is where you, the developer, have direct control.
Greedy decoding is the simplest approach. It picks the highest-probability token every time. At the end of the day, it's consistent, but it can also feel boring and repetitive.
That's where temperature comes in. Temperature adjusts how "confident" the distribution is.
With the same prompt, "What is Python?", different temperature settings produce very different behavior. A low temperature (like 0.2) sharpens the distribution, making safe, predictable choices dominate. A high temperature (like 1.5) flattens it, giving unlikely tokens a real chance. But push it too high and outputs become incoherent.
Then there's top-p, also called nucleus sampling. Top-p says: "only sample from the smallest set of tokens whose probabilities add up to p." If top-p is 0.9, you might be choosing from just 15 tokens or 500, depending on how confident the model is about the current position.
When you set these parameters in an API call, you're directly shaping the token selection process.
The Autoregressive Loop
We've walked through all five steps, but so far we've only generated a single token. How does the model produce an entire response?
Through a loop. Once one token is selected, it gets appended to the input, and the entire five-step pipeline runs again. Tokenize, embed, transform, compute probabilities, and sample. For every single token.
This continues token by token until the model produces a special end-of-sequence token or hits a length limit.
This is called autoregressive generation, and it has two important implications. First, generation slows down for longer outputs because every new token requires attention over all previous tokens. Second, the model genuinely doesn't know what it will say in advance. Each word is decided only when it's that word's turn, based on everything that came before.
What This Means For You
Now that you understand the mechanics, here are three practical takeaways that will improve how you build with LLMs.
Hallucinations aren't lies they're pattern matches. When an LLM hallucinates, it is not being dishonest in a human sense.
It is generating tokens that statistically fit the context and prompt style.
If your prompt asks for a citation and the model does not actually know one, it may still produce something that looks citation-like because that pattern is likely in similar contexts.
The takeaway? always verify factual claims, especially when the model sounds confident. Confidence in tone tells you nothing about accuracy.
Temperature doesn't make models "more creative." It makes them more likely to select lower-probability tokens. What we call "creativity" is a human interpretation of that randomness. For deterministic tasks like coding, data extraction, or formatting, use low temperature. Don't leave the output to chance when precision matters.
Context limits aren't arbitrary product restrictions.
Every token needs to interact with other tokens in context. That becomes expensive as context grows.
This affects:
- latency
- cost
- memory usage
So, next time you hit a context limit, it's not the company being stingy with resources. It's a result of how transformer attention works.
Final Takeaway
The next time you use an LLM, remember what's actually happening.
- A tokenizer split your input.
- Embeddings turned tokens into meaning vectors.
- Attention connected the context.
- A probability distribution was computed.
- One token was sampled.
- Then the entire process was looped.
Once you understand this mechanism, you've taken the first steps to building better systems.









Top comments (0)