An Intro to Large Language Models and the Transformer Architecture: Talking to a calculator

#machinelearning #llm #ai

All models are wrong, but some are useful.
— George E. P. Box

I want to tell you a story — or better yet, two stories.
One is about the support I received while feeling down; the other relates to limitations in understanding of Databricks. Both are connected to the GPT model.
Not so long ago, I went through a period in my life when I needed to restructure things. I started reading books and searching for information on YouTube, but that wasn’t enough. So I decided to talk to ChatGPT, which made me feel genuinely listened to by suggesting further reading and viewing. Just as I would with a person, I corrected it when it was wrong, allowing it to improve and guide me toward areas of self-development I hadn’t even dreamed about. In this case, the model was very useful.
Around the same time, I faced another challenge: I needed an easy way to compare two large DataFrames in Databricks. I asked the GPT model — dressed fancily in its Copilot attire — how to do it, and it suggested simply subtracting them. This did reveal differences, but it didn’t pinpoint what those differences were. The DataFrames I work with not only have lots of rows but usually many columns as well. I copied the differing rows into Excel to compare data types and values there, but that still didn’t help. In the end, only good old Stack Overflow could lend me a helping — though, as often, not the most welcoming — hand (I even wanted to link this issue here, but it got closed and removed in the meantime xD Ah, the culture of that place!).
Why was GPT so helpful in the first situation, but not in the second? I believe the answer to this question lies in how the transformer architecture works.

What is a Large Language Model?

Despite their impressive capabilities, large language models are essentially structured sets of numerical parameters — often billions — whose interactions are governed by a configurable neural network architecture. These parameters are arranged into matrices and vectors that emerge from the training process. Through exposure to enormous datasets, the model learns statistical relationships between tokens, building an internal representation of language.
In the words of Andrej Karpathy,

A Large Language Model is simply a fancy autocomplete.

This humorous simplification highlights that, at a high level, LLMs simply predict the next piece of text. They do not reason; they do not think. Instead, they process vast amounts of numbers to simulate human-like texting. The largest and most capable models, those able to work through complex, PhD-level mathematical tasks, often contain tens or even hundreds of billions of unquantized parameters; however, due to their cost, they are not — and will not be in the near future — broadly used.

Inside the Transformer

*Image borrowed from: https://www.mygreatlearning.com/blog/understanding-transformer-architecture/

The true engine of modern LLMs is the transformer architecture. Depending on the specific model family, transformers may use decoder-only layers (as in GPT models), encoder-only layers, or a combination of encoder–decoder stacks.
As embedding vectors flow through these transformer blocks, they undergo numerous learned transformations. The model uses mechanisms such as self-attention to evaluate how each token relates to all others in context. In doing so, it develops a deep representation of meaning, structure, and intent.
Layer by layer, the vectors become more abstract and refined, capturing multiple semantic layers.

Tokenization: Breaking Language Down

Before text enters a model, it must be converted into a format the model can understand. The text is split into chunks called tokens, which may represent characters, syllables, words, or subwords.

*Real-life prompt tokenized

Among various tokenization strategies, Byte-Pair Encoding (BPE) and its modern variants have become especially widespread, thanks to their efficiency and strong performance in popular models.

Embedding: Moving From Words to Numbers

Once tokenized, each discrete token is mapped to a continuous numerical vector through a process called embedding. These embeddings allow the model to work in a high-dimensional numerical space where patterns, relationships, and semantic meaning can be encoded. Tokens are processed individually and collectively, generating dense vector representations that form the basis for all later reasoning inside the model.

Output and Unembedding

Finally, these refined vectors are passed through an unembedding (or output projection) layer. Here, the model converts its internal numerical representation into tokens. The tokens form the words and sentences that appear in the model’s output.

*Real-life response tokenized

A Technology that Drives Technology

Large language models do not understand the world in the same way humans do. This limitation leads to what are commonly referred to as hallucinations. However, as many of you have witnessed, they can still be quite useful in certain contexts. As Professor Aleksander Mądry said:

AI is not just any technology; it is a technology that accelerates other technologies and science. It serves as a highway to faster progress. Ignoring it wouldn’t be wise.

Understanding how LLMs and transformers work is essential for making informed decisions about when and how to use them effectively.

Outro

As you can see, the model doesn’t perceive text — or its semantics — the way we do. It breaks words into smaller chunks and converts these chunks into numerical representations. It then performs transformations on these vectors and, based primarily on the results of these transformations, generates an output.
As I mentioned earlier, widely accessible models are often quantized, which essentially means they are less precise than they could be, in order to make them more affordable. This is why, in my experience, the GPT model was quite effective in helping me with a softer self-development task, but it struggled with the one that required detailed knowledge, where it couldn’t provide partial answers and be guided by me toward a final solution.