Ever wondered how ChatGPT and other large language models actually work? Under the hood, an LLM does exactly one thing: predict the next token. Everything else — reasoning, code, conversation — emerges from doing that incredibly well after training on trillions of words.
Here's the whole pipeline, end to end.
1. Tokenization
The model never sees raw letters. A tokenizer splits text into subword tokens and maps each to an integer id.
2. Embeddings
Each token id becomes a learned vector, so related meanings sit close together in space.
3. Self-attention
Every token looks at every other token and decides how much to attend to each — this is how the model builds context.
4. The transformer stack
Attention plus a feed-forward network, stacked dozens of times (GPT-3 used 96 layers), each refining the representation.
5. Predict and sample
The final layer scores every token in the vocabulary, softmax turns those scores into probabilities, and the model samples the next token — then appends it and runs again. That loop is how text is generated.
Watch the 2-minute explainer
Go deeper
The full written guide — with runnable Python (tiktoken, PyTorch), the math behind attention, and how training shapes the weights:
👉 LLM Foundations — How Large Language Models Actually Work (with Python)
Top comments (0)