How Large Language Models Work: Explained Simply

#rag #llm #architecture #machinelearning

Introduction

Recently, about a week ago, I finally figured out how the statistical system behind LLM models actually works. Most people only see the surface, but underneath there are transformers, attention mechanisms, training methods, alignment, and self-evaluation. In this post I would like to give only a minimal understanding of everything that is hidden under machine learning and LLM or RAG systems. Also, there was no depth in such topics as self-attention, cross-attention, generation of pictures and more specific topics. Here we will only go through the top to understand how the model answers our questions.

Let’s break it down step by step so it’s clear even for someone who is hearing about neural networks for the first time.

How rregular LLM works

A Regular Model Is Simply a Next-Token Predictor
A regular large language model is essentially a generator of the next most probable token. At the core of LLMs is the transformer architecture, which uses attention mechanisms to determine which parts of the input are most relevant when generating each token.
First, the model goes through pre-training. It is shown a huge amount of text from the internet, lectures, articles, documentation, and everything else. All of this is split into tokens — small pieces of text (words or parts of words).
Each token is turned into numbers — vectors (embeddings). The model analyzes these numbers and learns which words and phrases often appear together. For example, we humans intuitively understand the connection between the words “wall” and “hammer” — they are linked by the topic of repair. The model_ “understands”_(The model doesn’t truly understand meaning, but it captures statistical relationships between words based on patterns in data.) this too because in thousands of articles about repairs these words frequently appear near each other. Through the weights of connections, it sees how strongly one word depends on another.

After pre-training, the model is further fine-tuned using RLHF — Reinforcement Learning from Human Feedback. During RLHF, human evaluators rank or compare model outputs, and the model is optimized to prefer more helpful and safe responses.

In addition, many models are trained with additional alignment techniques and safety guidelines that shape what they can and cannot say. For example, regular models try not to sexualize content and follow certain ethical boundaries.

How the Model Answers Our Question

When we ask a question to the finished model, it is first converted into embeddings — vector representations. Then the model starts generating the answer token by token, choosing the most probable continuation each time.
If we talk about regular models like Gemini, GPT, Kimi, or Grok, they usually rely on their internal knowledge. But nowadays many systems also add the ability to connect external sources.

What Is RAG and How Systems with Documents Work (for example, NotebookLM)

When a model can rely on external documents, this is called RAG — Retrieval-Augmented Generation, a system with search and augmentation.
Imagine you uploaded two or three documents with the best African cuisine recipes into NotebookLM. Then you ask: “What is the best recipe I can cook if I only have onions, garlic, potatoes, and soy sauce?”
Here’s what happens inside:

When you upload the documents, the system reads them, splits them into small chunks(piece of text), and turns each chunk into an embedding. All these embeddings are stored in a special vector store(place, database, where all vectors are stored).
Your question is also converted into an embedding.
The system compares the question’s embedding with the embeddings from the documents. The comparison is based on the similarity (often cosine similarity) (the closer the vectors are in direction, the more similar the meaning). It finds the most relevant pieces of text.
These relevant chunks are retrieved from the original documents and added to your question. The result is an augmented prompt: your question plus useful excerpts from the documents.
Now the model itself receives this augmented prompt and generates the answer token by token, relying on the provided information.

The system can also add instructions such as “Use only these pieces of information” or include the source (document name, author, etc.). In the end, the model often receives these source details as part of the prompt, and the system can add proper references to the original documents so you understand where each piece of information came from.

The Main Thing to Remember

In the end, even the most advanced systems are still fundamentally probabilistic text generators. What makes them powerful is not true understanding, but the combination of large-scale training, attention mechanisms, and (optionally) access to external knowledge through systems like RAG.
This basic understanding is worth having before diving deeper into more advanced topics: how models work with images, video, agents, and so on.