The Essence

#ai #llm #machinelearning #rag

Last episode talked about model training, micro context amendments, and the need for RAG.

Do you ever feel, at least I used to, that there is something I don't really understand about AI? Once trained on data, AI has a north star in the form of frozen weights. No matter how many times the same query is made, It is tempting to think that the generated outcome would be the same. Hold on to that "same" for a while. If you do a database query n times you are going to get the same result, unless of course the data changes.

Now about that "same" we pinned.

1. The Fixed Map (Logits)

You must have understood by now that when you give AI a prompt, it passes through a frozen neural network. The base model doesn't "think" of a full response at once. Instead, it calculates a mathematical score called logits for every single token (word or part of a word) in the model's entire vocabulary. These are then converted into probabilities that determine how likely each token is. This ranks how likely each token is to be the very next word in the sequence.

Because the base model is static, if you give the exact same prompt and the exact same context window, those raw mathematical scores will always be identical.

For example, if the prompt is "The cat sat on the", the static model might output these raw probabilities for the next word:

mat: 65%
floor: 20%
rug: 10%
dog: 4%
moon: 1%

2. The Variable Engine (Sampling)

If the model were a deterministic system (like a standard database query), it would just take the top-ranked word (mat) every single time. If the model did that, the output would be 100% predictable, repetitive, robotic, and not so interesting after a point.

Instead, the system introduces a variable selection process called Sampling. The model doesn't automatically pick the first word. Rather, a virtual die is rolled, weighted by those percentages.

There are many sampling algorithms that choose the next word. To name a few:

Top-k
Top-p
Min-p
Temperature scaling

Because the selection is probabilistic, the path the model takes through the sentence changes every time you run the query. Once the model picks floor, that becomes part of the new input, which changes the probabilities for the next word, cascading into a completely different sentence.

Let's choose Temperature scaling for this episode.

3. The "Temperature" Control

The main dial that controls this variance is a parameter called Temperature. It allows you to dynamically alter how strictly the model follows its probabilities without ever actually retraining the model itself.

Key insight: A low temperature (e.g., 0.1) forces the system to be highly deterministic, almost always picking the top word. A high temperature (e.g., 1.5) flattens the distribution, giving lower-ranked words a fighting chance, resulting in more "creative" (but potentially chaotic) text.

How the "Essence" Survives

The "same" we pinned earlier is actually similar. It is similar in essence and similar in meaning.

The real question is how the essence stays the same even when the text varies.

It stays the same because the trained model assigns extremely low probabilities to words that break logic or context. Even at a high temperature, the probability of selecting moon in the example above remains incredibly low. It might swap "architect" for "engineer," or "difficult" for "complex," because their probabilities are nearly tied. The math prevents the model from swapping "difficult" for "pineapple."

The underlying model acts as the guardrails. The probabilistic sampling just decides which lane the model drives in on any given generation.

I cannot help but mention the mistake I made while learning about this for the first time. As soon as I got up to the concept of temperature, I jumped to the conclusion that it sounded similar to Google's PageRank or a Markov chain with a random jump to a different node. When reflecting on it, I realized that a purely random jump cannot preserve the essence. At a very high level, token generation resembles a probabilistic sequence process that can remind one of a Markov chain, but modern LLMs are far more expressive and use attention over a large context window.

4. The Diversity Engine (Repetition Penalty)

That PageRank epiphany does not go to waste. If temperature decides how strictly the model follows the rules, another parameter forces the model to actually be creative: the Repetition Penalty.

Some systems apply a repetition penalty or frequency penalty. These slightly reduce the probability of tokens that have already appeared, making repetitive loops less likely.

If the model just picked the highest-probability words every time, it would take the lazy route. It would repeat the exact same phrasing. To prevent this, the system tracks the words it has generated so far. Once a word is used, its probability may be reduced.

Imagine the model needs to explain that a problem is hard. The word "complex" might have the highest score. The model uses it. But because of the repetition penalty, the score for "complex" immediately drops. When the model needs to reference the difficulty again in the next paragraph, it is mathematically forced to pick the next best option, like "intricate" or "difficult."

The model is discouraged from taking the exact same path twice. It is forced to navigate a completely new route through its vocabulary, but the context window keeps it driving toward the exact same destination.

The reflection and what's next

The engineering mind's mistake:

A trained model is a deterministic worldview encoded into weights. Therefore the same question tends to converge toward a similar essence, even when the exact wording differs.

In the next episode let's ask the question what decides the tonality of the words.