The Autocomplete That Ate the World: What LLMs Actually Are (And How They Learn).

A large language model is a next-token prediction machine trained on hundreds of billions of words. It doesn't verify facts — it predicts plausible text. It has no memory between sessions. And the capabilities that make it remarkable weren't programmed in — they emerged from scale. This post explains what that means, why it matters, and what it changes about how you use these tools.

This is Part 1 of a five-part series from the Vectors pillar of Context First AI. Built for anyone starting their AI journey — developer or not. No prior knowledge assumed. Each part builds on the last.

Full series:

Part 1 — The Autocomplete That Ate the World*
Part 2 — You're Not Reading Words, You're Reading Chunks*
Part 3 — Meaning Has a Shape*
Part 4 — You're Not Writing Prompts, You're Writing Instructions for a Very Particular Mind
Part 5 — What to Do When the Model Doesn't Know Enough

The Feeling Nobody Talks About

It started, for a lot of developers, somewhere around the third or fourth time they used an AI tool and couldn't explain why it behaved the way it did.

Not a junior developer on their first project — a mid-level engineer with five years of experience, someone comfortable with APIs, databases, asynchronous logic. They could call the OpenAI API. They could parse the response. They could wire it into a product. But the why behind what came back — why the same prompt produced different results, why the model sometimes confidently produced nonsense, why it seemed to forget everything from the last session — remained opaque.

That gap matters more than it might seem. Because when something breaks and you don't understand the underlying model, debugging becomes guesswork.

So. Let's close the gap.

What an LLM Actually Is

A large language model — an LLM — has one core mechanism: next-token prediction.

You give it a sequence of tokens (we'll cover what tokens are in Part 2 — for now, think of them as chunks of text). It produces a probability distribution over all possible next tokens. It samples from that distribution. That sampled token gets appended to the sequence. The process repeats.

In pseudocode, the core loop looks something like this:

def generate(prompt, model, max_tokens=100):
    tokens = tokenize(prompt)

    for _ in range(max_tokens):
        # Model produces probability distribution over vocabulary
        logits = model.forward(tokens)
        probabilities = softmax(logits)

        # Sample next token from distribution
        next_token = sample(probabilities)
        tokens.append(next_token)

        # Stop if end-of-sequence token is produced
        if next_token == EOS_TOKEN:
            break

    return detokenize(tokens)

This is a simplified sketch — real implementations involve batching, key-value caching, and various sampling strategies like temperature and top-p. But the loop itself is accurate. The model is not retrieving answers from a database. It is not searching the web. It is predicting what comes next, one token at a time, using patterns it absorbed during training.

That distinction — prediction versus retrieval — is foundational. Keep it close.

How Pre-Training Works

Before the model can predict anything usefully, it has to learn. This is called pre-training, and it is the process that creates everything downstream — the knowledge, the reasoning patterns, the stylistic range, all of it.

The training setup is deceptively simple.

Take an enormous corpus of text — hundreds of billions of words drawn from books, websites, code repositories, articles, and more. Feed it to the model. Ask the model to predict the next token at each position. Measure how wrong it was. Update the weights to be slightly less wrong. Repeat.

In practice, this is framed as minimising cross-entropy loss between the model's predicted distribution and the actual next token:

python
import torch
import torch.nn.functional as F

def compute_loss(logits, targets):
# logits: (batch_size, sequence_length, vocab_size)
# targets: (batch_size, sequence_length)

# Reshape for cross-entropy computation
logits = logits.view(-1, logits.size(-1))  # (batch * seq_len, vocab_size)
targets = targets.view(-1)                  # (batch * seq_len,)

return F.cross_entropy(logits, targets)


The model has no explicit labels, no curated Q&A pairs, no hand-crafted rules. Just text and the task of predicting the next piece of it. Across trillions of these prediction attempts, something emerges: internal representations of grammar, factual associations, reasoning structures, code syntax — and something that looks, in practice, remarkably like understanding.

We wouldn't call it understanding in the philosophical sense. But for the purposes of building on top of it, it behaves like understanding in most situations that matter.

Why Scale Changes Everything

Here is the part that surprised even the researchers building these systems.

When you increase model size — more parameters, more training data, more compute — the model does not simply get incrementally better at prediction. At certain scale thresholds, new capabilities appear that were not present in smaller versions of the same architecture.

The model can answer questions it was never directly trained on. It can write working code in languages it was not explicitly fine-tuned for. It can follow complex multi-step instructions. It can explain its own reasoning. These are called **emergent capabilities**, and their appearance at scale is one of the more genuinely surprising empirical findings in recent AI research.

A rough intuition: a smaller model learns surface patterns. A larger model, trained on enough varied data, is forced to develop something closer to a generalisable internal model of how language and ideas work — because that is the only way to keep improving at prediction across such varied input.

From a developer's perspective, the practical implication is this: the capabilities you can build on top of a frontier model are substantially different from what was possible two or three years ago. Not just better — qualitatively different.

Three Things That Change How You Build

Now that the mechanism is clear, here are three behavioural realities worth internalising before you write another line of code that calls an LLM.

1. Confidence is not accuracy

When an LLM produces a confident-sounding answer, it is not because the model has verified the claim. It is because confident-sounding text frequently follows prompts like yours in its training distribution.

python

This prompt will get a confident-sounding answer regardless of accuracy

prompt = "What was the revenue of Acme Corp in Q3 2024?"

The model has no access to this data. It will predict plausible-sounding text.

If Acme Corp is not in its training data, it may fabricate a number.

If it is, that data may be outdated or misremembered.

Mitigation: supply the data in context, or use retrieval (covered in Part 5)

prompt_with_context = """
The following is Acme Corp's Q3 2024 financial report:
[PASTE ACTUAL DATA HERE]

Based only on the above, what was Acme Corp's revenue in Q3 2024?

The fix is not to distrust the model — it is to supply the information you need it to reason over, rather than asking it to retrieve information from its weights.

There is no memory between sessions

Each API call, each new conversation, starts from the same base model state. Nothing persists.

import openai

client = openai.OpenAI()

# Session 1
response_1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "My name is Alex and I work in compliance."}
    ]
)
# Model now knows this — within this session

# Session 2 (new API call, no history passed)
response_2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "What did I tell you about my role?"}
    ]
)
# Model has no idea. Previous session is completely gone.

# Correct approach: pass conversation history explicitly
conversation_history = [
    {"role": "user", "content": "My name is Alex and I work in compliance."},
    {"role": "assistant", "content": response_1.choices[0].message.content},
    {"role": "user", "content": "What did I tell you about my role?"}
]

response_3 = client.chat.completions.create(
    model="gpt-4o",
    messages=conversation_history
)
# Now the model has the context it needs

Managing conversation history is your responsibility as the developer. The model does not maintain state. You do.

Training distribution shapes output quality

The model performs best on topics, formats, and styles that were well-represented in its training data. Push it into territory that was sparse in training — highly specialised domains, obscure technical standards, your internal company knowledge — and quality degrades.

This is not a bug. It is a logical consequence of how learning works. The mitigation is to supply the relevant information in context, or to fine-tune on domain-specific data — both of which are topics for later in this series.

What This Foundation Unlocks

Understanding pre-training changes how you debug, how you architect, and how you set expectations.

When a model produces incorrect output, the first question is no longer "is the model broken?" It is "what did the model have to work with?" Was it relying on parametric knowledge that may be outdated or absent? Was the context window structured in a way that buried the important information? Was the prompt specific enough to narrow the prediction space toward what you actually needed?

These are tractable questions. And they are only askable once you understand what an LLM actually is.

What Comes Next

Part 2 covers tokens and context windows — the mechanics that determine what the model can see and process at any one time. For developers, this is where token counting, chunking strategies, and context management start to make concrete sense.

We'll see you there.

Created with AI assistance. Originally published at [Context First AI]