Understanding How LLMs Work: From Text to Tokens, Embeddings, Transformers, and Predictions

Klinsmann R — Wed, 01 Jul 2026 09:51:01 +0000

Artificial Intelligence is nothing new. It has been around since the early days of computing and has slowly evolved over time. But today, where we stand with Generative AI, or GenAI, it has become one of the most popular and widely adopted categories of advanced AI.
At the simplest level, most people understand GenAI like this:
Prompt (text) → LLM engine → Answer (text, image, video, code, etc.)
This is the easiest way to look at it. You give the system an input, and it gives you an output.
But let’s go one step deeper.
This next level of understanding should not be limited only to IT professionals. In the coming years, almost everyone will interact with AI tools in some form, so it is useful to understand what actually happens behind the scenes when we send a prompt to an AI model.
To understand that, we need to look at the journey from human text to machine-readable numbers, then to embeddings, Transformers, predictions, and finally back to human-readable output.

Large Language Models, commonly called LLMs, are one of the most important technologies behind modern AI tools like ChatGPT, Claude, Gemini, Llama, and GitHub Copilot. They can answer questions, write code, summarize documents, translate languages, explain concepts, and generate human-like text.

But behind the simple chat interface, a lot happens mathematically. An LLM does not understand human language the same way humans do. It converts text into numbers, processes those numbers using a neural network, and then predicts the next piece of text step by step.

This article explains the flow clearly: text → tokens → embeddings → Transformer → prediction → text.

What is an LLM?

LLM stands for Large Language Model.

It is called “large” because it is trained on huge amounts of text and has a very large number of internal parameters. It is called a “language model” because its main job is to model language patterns.

In simple terms, an LLM learns how words, phrases, sentences, code, and ideas usually appear together. Then, when we give it a prompt, it generates a response by predicting what tokens should come next.

LLMs are used for many tasks, including writing emails, answering questions, summarizing documents, translating languages, helping with programming, generating ideas, and acting as chat assistants.

Why Text Must Become Numbers

A computer does not directly understand words like humans do.

For us, the word “cat” immediately brings meaning: an animal, a pet, fur, meowing, drinking milk, and so on. But for a computer, raw text is not directly useful. Machine learning models work using numbers, vectors, matrices, and probabilities.

So before an LLM can process language, the text must be converted into a numerical form.

The rough flow is:

Human text → tokens → token IDs → vectors → model processing

Tokenization: Breaking Text Into Pieces

When we send a prompt to an LLM, the first step is tokenization.

Tokenization means breaking text into smaller pieces called tokens.

A token can be:

a full word
part of a word
punctuation
a symbol
a space or formatting pattern

For example:

I like cats

may become:

I / like / cats

Then each token is converted into a token ID:

I     → 40
like  → 892
cats  → 12075

So the prompt becomes:

[40, 892, 12075]

These numbers are not meanings by themselves. They are just IDs, like row numbers in a table.

A common mistake is thinking the token ID itself contains meaning. It does not. The meaning comes from the next stage: embeddings.

Embeddings: Turning Token IDs Into Vectors

After tokenization, the model uses an embedding table.

The embedding table is a giant learned lookup table inside the model. Each token ID has a matching vector.

For example:

Token ID     Embedding vector

40           [0.21, -0.55, 0.18, ...]
892          [-0.04, 0.77, 0.31, ...]
12075        [0.88, -0.12, 0.45, ...]

So when the model sees:

[40, 892, 12075]

it looks up each token ID in the embedding table:

40     → vector for "I"
892    → vector for "like"
12075  → vector for "cats"

The tokenized prompt is not inserted into the embedding table. The table already exists after training. The prompt’s token IDs simply select rows from that table.

So the process is:

token ID → lookup in embedding table → embedding vector

The result is a sequence of vectors, one vector per token.

If the prompt has 3 tokens and each vector has 4096 dimensions, then the model now has something shaped like:

3 tokens × 4096 numbers

This vector sequence is what gets passed into the Transformer.

How Are Embeddings Learned?

The embedding table is learned during training.

At the beginning of training, the embeddings are almost random. The model does not know that “cat” and “dog” are related. It slowly learns this by seeing how words are used in many examples.

For example, it may see sentences like:

The cat drank milk.
The dog chased the ball.
The kitten slept on the sofa.

As the model tries to predict missing or next tokens, it makes mistakes. During training, the model adjusts its internal numbers to reduce those mistakes. This includes the embedding vectors.

Over time, tokens that appear in similar contexts develop related vector representations.

So words like:

cat, dog, kitten, horse

may end up closer in the model’s learned vector space than unrelated words like:

cat, engine, tax, electricity

The model is not manually told these relationships. It learns them from patterns in data.

Position Information: Why Order Matters

There is one more important detail.

The model also needs to know the order of tokens.

These two sentences use similar words but have different meanings:

Dog bites man.
Man bites dog.

So the model needs position information.

After token embeddings are created, the model adds positional information so it knows where each token appears in the sequence.

A simple way to imagine it is:

final input vector = token embedding + position information

So for:

I like cats

the model receives something like:

"I"    meaning + position 1
"like" meaning + position 2
"cats" meaning + position 3

Now the Transformer receives vectors that contain both token meaning and token order.

What is a Transformer?

A Transformer is the main neural network architecture used in most modern LLMs.

It is a type of artificial neural network, but more specifically, it is designed to handle sequences like text, code, conversations, and documents.

The key idea inside a Transformer is attention.

Attention allows the model to decide which tokens are important to each other.

For example:

The bank near the river was flooded.

Here, “bank” probably means riverbank.

But in:

The bank approved my loan.

“bank” means a financial institution.

The same token can have different meanings depending on context. Attention helps the model update the meaning of each token based on the other tokens around it.

What Happens Inside the Transformer?

The Transformer receives the embedded prompt as a sequence of vectors.

For example:

I like cats

becomes:

vector for "I"
vector for "like"
vector for "cats"

After passing through the Transformer layers, each token still has a vector, but now those vectors are context-aware.

The original embedding for “cats” is general. But after the Transformer processes the full sentence, the vector for “cats” becomes specific to that sentence.

So there are two stages of meaning:

base embedding = general token representation
Transformer output = context-aware token representation

This is important because the base embedding for a word like “bank” may be the same at first, but after Transformer attention, its representation changes depending on whether the sentence is about a river or a loan.

How Prediction Happens

After the Transformer has processed the prompt, the model needs to predict the next token.

Suppose the prompt is:

The cat sat on the

The Transformer processes all the tokens and produces final context-aware vectors.

For next-token prediction, the model mainly uses the final vector at the last position. That final vector represents the context of the whole prompt up to that point.

Then this vector is passed into a final prediction layer, often called the prediction head or language modeling head.

This prediction layer gives a score for every possible token in the vocabulary.

For example:

mat       8.2
floor     6.7
chair     5.9
moon     -2.4
because  -3.1

These raw scores are called logits.

The logits are then converted into probabilities using a function called softmax.

For example:

mat       55%
floor     25%
chair     12%
moon       0.1%
because    0.05%

Then the model selects one token.

If it selects:

mat

the sentence becomes:

The cat sat on the mat

Then the process repeats.

The new token is added to the context, and the model predicts the next token again.

This continues token by token until the response is complete.

Is Prediction Inside the Transformer?

The clean way to understand it is:

Transformer body → prediction head

The Transformer body processes the prompt and creates context-aware vectors.

The prediction head converts the final vector into scores for all possible next tokens.

So prediction is part of the full LLM, but it is useful to separate the Transformer processing from the final prediction layer.

The full flow is:

Prompt
→ tokenization
→ token IDs
→ embedding lookup
→ add position information
→ Transformer layers with attention
→ final context-aware vector
→ prediction head
→ logits
→ probabilities
→ next token
→ text

Why ChatGPT Does Not Simply Copy From the Internet

When ChatGPT responds, it is usually not copying and pasting from a website.

Instead, the model has learned language patterns during training. When given a prompt, it generates a new sequence of tokens based on probabilities.

It predicts one token, then the next token, then the next, using the context it has so far.

This is why two answers to the same question can be slightly different. The model is generating, not retrieving a fixed paragraph.

However, this also means it can sometimes make mistakes. It may generate something that sounds correct but is not actually true. That is why facts, dates, prices, laws, and current information should be checked carefully.

Simple Analogy

Imagine the model as a very advanced reader and predictor.

First, it converts your sentence into pieces:

I like cats → [40, 892, 12075]

Then it looks up learned meaning vectors for each piece:

40 → vector
892 → vector
12075 → vector

Then the Transformer reads the whole context and updates the meaning of each token.

Finally, the prediction head asks:

What token is most likely to come next?

It scores all possible tokens and chooses one.

Then it repeats.

Final Mental Model

The most accurate simple version is:

An LLM does not directly understand raw text. First, the prompt is tokenized into token IDs. Each token ID is used to look up a learned embedding vector from the model’s embedding table. Position information is added so the model knows token order. These vectors are passed through Transformer layers, where attention mixes information between tokens and creates context-aware representations. A prediction head then scores every possible next token, chooses one, converts it back into text, and repeats the process until the response is complete.

In one line:

Text becomes tokens, tokens become vectors, Transformers process the vectors, and the model predicts the next token repeatedly to generate text.

#chaicode

Sorting Without Comparisons? Index Placement Sort (IPS): A Simple Yet Powerful Sorting Trick I Developed

Klinsmann R — Sun, 30 Mar 2025 01:36:05 +0000

Sorting algorithms are the backbone of efficient data processing. While traditional sorting methods rely on comparisons (like QuickSort or MergeSort), I recently stumbled upon a different approach—one that places elements directly in their correct position without explicit comparisons.

This inspired me to develop what I call Index Placement Sort (IPS), a sorting technique that is blazing fast for certain types of data. As the inventor of this method, I believe it offers a unique perspective on how sorting can be optimized for specific use cases.

How Index Placement Sort (IPS) Works
The idea is simple:

Create a large enough vector initialized with zero (or -1 for better handling).
Use each element as an index and place it directly in the array.
Iterate over the array and print only the non-zero (or non-negative) values.

IPS in Action

#include <iostream>
#include <vector>
using namespace std;

int main() {
    int arr[] = {5, 8, 2, 1, 3, 6};
    int n = 6;
    vector<int> sorted_arr(20000, -1); // Use -1 to mark empty slots

    for(int i = 0; i < n; i++) {
        if(arr[i] >= 0) // Ensure no negative indices
            sorted_arr[arr[i]] = arr[i];
    }

    for(auto x : sorted_arr) {
        if(x != -1)
            cout << x << " ";
    }

    return 0;
}
Output:
1 2 3 5 6 8

Why is IPS Special?
Unlike traditional sorting algorithms, IPS has unique properties:

✅ Time Complexity: O(n + k) (where n = number of elements, k = max value in the array)

✅ No Comparisons: No if(arr[i] > arr[j]) swap(arr[i], arr[j]) like in normal sorting

✅ Super Fast for Small Ranges: Perfect when numbers are within a known range (e.g., 0–20,000)

✅ Ideal for Unique Integer Datasets: Works best when elements are distinct and non-negative

How IPS Compares to Other Sorting Algorithms
IPS has advantages over several traditional sorting algorithms:

Faster than QuickSort (O(n log n)) when dealing with constrained value ranges.
More efficient than MergeSort (O(n log n)) for datasets with a known maximum value.
Similar to Counting Sort but with a simpler implementation.
Outperforms Bubble Sort, Selection Sort, and Insertion Sort in almost all cases.

However, Radix Sort and Counting Sort might be better alternatives in cases where IPS would consume too much space.

Limitations of IPS
❌ Cannot Handle Negatives (Without Modification): Since it uses numbers as indices, negative numbers cause out-of-bound errors

❌ Wastes Space for Large Ranges: If the largest number is 1,000,000, we need an array of size 1,000,000

❌ Duplicates Get Overwritten: If arr = {5,5,5}, only one 5 remains in sorted_arr

When to Use IPS?
IPS is great for:

✅ Sorting IDs, Ranks, Unique Scores, Ages (when within a fixed range)

✅ Pre-sorted data storage (e.g., maintaining a sorted structure efficiently)

✅ Competitive Programming where a super-fast O(n) sort is needed for constrained values

Next Steps
IPS is a cool technique that can be enhanced:

We can modify it to handle negatives by offsetting indices.
We can support duplicates by using an array of lists.
We can improve space efficiency using Radix Sort concepts.

If you found this useful, drop a like, share your thoughts, or suggest improvements!

DEV Community: Klinsmann R

Understanding How LLMs Work: From Text to Tokens, Embeddings, Transformers, and Predictions

What is an LLM?

Why Text Must Become Numbers

Tokenization: Breaking Text Into Pieces

Embeddings: Turning Token IDs Into Vectors

How Are Embeddings Learned?

Position Information: Why Order Matters

What is a Transformer?

What Happens Inside the Transformer?

How Prediction Happens

Is Prediction Inside the Transformer?

Why ChatGPT Does Not Simply Copy From the Internet

Simple Analogy

Final Mental Model

#chaicode

Sorting Without Comparisons? Index Placement Sort (IPS): A Simple Yet Powerful Sorting Trick I Developed