DEV Community

Nick Lucas
Nick Lucas

Posted on

LLMs do know what they're going to say

The following is written to be as penetrable as possible for a semi-technical audience, and includes an app to explore the output of an LLM. As a result some technical explanations are necessarily imprecise though I've worked hard to not say anything outright wrong. It also comes in two halves, with the latter introducing more technical details and more formal terminology.

"LLMs are probabilistic models which generate a completion to a prompt, they do this one token at a time and produce non-deterministic results."

This is a well known explanation of how LLMs work, but in much the same way as Schrödinger's cat explains quantum physics to a lay-person but gives rise to misconceptions about quantum physics, this explanation gives rise to a number of misconceptions about LLMs:

  • LLMs have no idea what they're going to say until they've said them
  • LLMs are just autocomplete for the next word

Statements to this effect are only myopically true when considering the generation process, but are repeated constantly on social media as if they represent the truth of the technology. In this post I am going to focus on why the more accurate statement would be:

  • LLMs know every response they could provide, along with a fixed probability for each option, and during generation we make them use a single path to a complete output

LLMs are just Maths

LLMs are input->output maths machines, they have internal parameters set by their training which don't change, and if you give the same input to an equation you will always get the same output. This means LLMs themselves are highly deterministic, it's actually the human-designed harnesses we use to produce output text (often known as Decoders or Samplers) which promote varied generation by choosing tokens which an LLM may not have reported as the highest probability.

What follows, is given a prompt (for instance "The capital of France") and generating the probabilities (formally: logits, but I'll cover that later) of possible next-tokens, and then the probabilities of all tokens in the next layer, and so on, you will get an ever-exploding tree of generation paths with consistent probabilities for each future token. Additionally, if we take that prompt and suffix some generated tokens (for instance "The capital of France is Paris") and do the same thing, then we'll also get the exact same subtree of probabilities. The LLM knows statically everything it wants to say, and the harness we wrap an LLM with gets final say in the path we take through this tree structure.

When I describe a tree structure of tokens, this is a slice of what I mean:

Or maybe this will help:

A demonstration

To demonstrate this, I've built a custom sampler which generates the tree structure of future tokens+probabilities instead of selecting a single path through the tree. You can explore a number of generations I've prepared here:

https://llm-probability-tree.me-62f.workers.dev/

The pre-generated outputs are using Google's Gemma 3 1b, a small LLM which is possible to run quickly on consumer hardware (In my case a Macbook M2 Pro) and the pre-generated data you see expands the 5 highest probability tokens at each step up to 6 tokens into the future. I've also made sure to tune the LLM settings to be constant for each call, disabling anything which could inject non-determinism.

Even with a fast model like Gemma it becomes extremely slow to expand more steps into the future though, we have a branching factor of numTokens^depth, where 5^6 is already 15625 LLM calls (which takes me ~3 minutes to run) and 5^7 would be 78125 calls (15+ minutes) so the defaults I've set are all about choosing a balance which can be quickly reproduced but still demonstrates the consistency of output on subtrees.

If you're inclined towards code, this is an open source nodejs application with a react-based UI, so you can explore the setup and customise it with your own experiments. The generated trees are stored as JSON files and committed to git, which means you can easily compare differences in the output between executions.

https://github.com/Nick-Lucas/llm-probability-tree

Take for example this comparison of "The capital of France" on the left and "The capital of France is Paris" on the right. Darker shades of green are higher probabilities and purple is a completed response.

The right example goes 2 steps further into the future, but collapsing those steps you can see the printed probabilities are identical. To reference the point of this post, the LLM was always going to predict these tokens with these likelihoods even with the shorter prompt, it knows what it wants to say but it's our responsibility to plot a path to a completed response, and we choose not to always take the highest probability token at each step.

The lay-person's conclusion

So we can see that LLMs are deterministic, and understand that we wrap them with a bit of randomness to force exploration of responses you might otherwise not see. If you explore some of the output trees a bit more, you'll also see that the meaning of the responses very rarely changes significantly, it's often a choice between a short response and a more detailed one, and most paths lead to sensible responses in any case.

The final leg of this post will go into more depth on some of the low level details, it's more technically minded but if you're non-technical and interested I will still attempt to take you along on the journey!

"Latent Space"

The basic (simplified) workings of an LLM are like this:

  1. LLMs represent tokens as a list of numbers, we call these "Embeddings" and each token has a unique list of numbers assigned. We "embed" our prompt as a list of tokens (so a list of lists of numbers) and pass it into the LLM, this is where the maths begins
  2. The tokens are passed to a "Transformer", which uses its training to turn the tokens into a table of relationships between each token. For instance "Capital"/"France" might have a pretty strong relationship and "The"/"France" might not. This is what we call Attention, and it helps the LLM to understand the structure of an input.
  3. The Attention table is then passed to a Neural Network which uses its training to encode learned knowledge/meaning
  4. Steps 2+3 are repeated over many layers with their own trained parameters

aha, we've introduced "scores" and "softmax", we'll cover that in the next section

As text flows through this process, each input token carries an internal representation the model keeps refining. Those values encode a unique position in the model’s latent space. The resulting coordinates drive the next-token odds, and any of the highest probability tokens could be a valid choice to continue generation.

If you were to reach in and change one value you're now in a different position in latent space - though still quite near to your starting position given we're talking about billions of coordinates.

As demonstrated earlier, the token probabilities never change for the same input+settings, or an input suffixed with some number of next-tokens from the tree. It's just very complex maths which encodes and applies a learned understanding of the world.

Sampling: Non-determinism, Seeds, Randomness

Okay so I've avoided how we inject non-determinism for a bit now, it's definitely fair I cover where it comes from. After the forward pass through the LLM, we pass the results to our Decoder/Sampler. Sampling strategies are designed by humans so can and will vary significantly in how they work, but a historically common example uses the following process:

  • "temperature" - a setting typically between 0-2 which is used to reshape the probabilities of possible next tokens, bringing lower and higher probabilities nearer to each other. 0 generally results in only the highest probability token being selected, whereas higher values will bring many tokens into contention
  • "softmax" - not a setting, but a normalisation step. The model actually outputs raw scores (called logits), not probabilities like I've been calling them for simplicity in the first half. Softmax turns the logits into proper probabilities that add up to 1.
  • "top_k" - this setting dictates how many tokens should be in contention, for top_k=5 the 5 highest probability tokens are kept and remainder are discarded
  • "top_p" - this setting is used to produce a shortlist of tokens whose probabilities sum to >= top_p, the shortlist is constructed in probability order, and all remaining tokens are discarded
  • "seed" - this is used to kick off a random number generator, and the generator is used to select the next token, the probabilities still matter for weighting but there is always a probabilistic chance that a non-highest token may be selected

Using this process the sampler can produce outputs which differ each time we run the same prompt. But that doesn't mean the LLM has no idea what it wants to say, we just chose a specific path through the tree-structure which the LLM also knows its output for. Different technologies do this differently, for example OpenAI's gpt-5 has less control exposed than gpt-4 and so is harder to make deterministic.

I'll also give an honourable mention for differing GPU hardware and software, and differing compiler kernels, when deployed on large-scale infrastructure or across multiple providers. Since we're dealing with floating point maths, which can be imprecise and have differing rounding strategies, there can be subtle differences in probability output. Additionally in the extraordinarily rare case two probabilities come out identical, we may need a tie-breaking strategy.

All of this would be possible to treat as a bug and make deterministic for a single application though, assuming we wanted to. It just very often doesn't matter as we're dealing with such small differences in probability outputs.

Resources

If you want to learn more, then I would look no further than 3blue1brown's free Neural Networks course on YouTube. Grant is a master of visualising complex maths, which is a huge help when approaching LLMs at a low level. You can start from "Large Language Models explained briefly" fairly safely though starting at the beginning will give the strongest foundation.

Top comments (0)