How LLMs Work

Suny Dutta — Wed, 01 Jul 2026 14:43:08 +0000

1. What is an LLM?

What Does LLM Stand For?

LLM stands for Large Language Model.

Let's unpack that:

Large → trained on enormous amounts of text data (think billions of web pages, books, articles)
Language → it works with human language — English, Hindi, code, you name it
Model → it's a mathematical system that has "learned" patterns from all that text Think of an LLM as a very well-read assistant. It hasn't experienced the world, but it has read an unimaginable amount about it.

What Problems Do LLMs Solve?

Before LLMs, getting computers to understand human language was incredibly hard. Computers are great at structured commands (print("Hello")), but terrible at vague, context-heavy human requests like "Can you explain this in simpler terms?"

LLMs solve this by:

Understanding natural, conversational language
Generating human-like text responses
Summarizing, translating, and explaining content
Writing code, emails, essays, and more

Popular Examples of LLMs

LLM	Made By
GPT-4 / ChatGPT	OpenAI
Claude	Anthropic
Gemini	Google DeepMind
LLaMA	Meta
Mistral	Mistral AI

Common Applications in Daily Life

You're probably already using LLMs without realizing it:

Chatbots — Customer support bots that actually understand your problem
Writing assistants — Grammarly, Notion AI, Google Docs Smart Compose
Code helpers — GitHub Copilot suggests code as you type
Search engines — Google's AI Overviews summarize results for you
Email — Gmail's "Help me write" feature
Translation — DeepL and Google Translate

2. What Happens When You Send a Message to ChatGPT?

Ever typed a question into ChatGPT and wondered what happens in those few seconds before the response appears? Here's a simple walkthrough.

Step 1: You Type a Prompt

You write something like: "Explain black holes in simple terms."

That's your prompt — the input you're giving to the model.

Step 2: Your Message Gets Processed

Your text doesn't travel to ChatGPT as-is. It gets:

Broken into pieces (called tokens — more on this soon)
Converted into numbers (because computers only speak numbers)
Fed into the model along with any previous conversation context

Step 3: The Model Generates a Response

The LLM doesn't "look up" an answer. It predicts the most likely next word, then the next, then the next — until it builds a complete response.

It's less like a search engine and more like a very sophisticated autocomplete.

Step 4: You See the Response

The numbers get converted back into human-readable text and streamed to your screen — often word by word, which is why responses appear gradually.

Why Responses Are NOT Copied From the Internet

This is a common misconception. ChatGPT doesn't Google things in real time (unless it has a browsing tool enabled). Instead, it generates responses from patterns it learned during training. It's more like a person who read a lot and is now answering from memory — not someone who's Googling the answer live.

This is also why LLMs can sometimes be confidently wrong — a phenomenon called hallucination.

Diagram 1: The high-level flow from your prompt to the model's response

3. Why Computers Don't Understand Human Language

Text vs Numbers

Here's the fundamental problem: computers only understand numbers.

Everything inside a computer — images, videos, music, and text — is stored as numbers (specifically, binary: 0s and 1s).

When you look at the letter "A", you see a character. Your computer sees 65 (its ASCII code). That works fine for storage, but it doesn't capture meaning.

The word "bank" could mean a riverbank or a financial institution. Storing it as a single number loses that nuance entirely.

Why Computers Need Everything Converted to Numbers

For an AI to understand language, it needs numbers that capture meaning and context, not just character codes.

This is where a concept called embeddings comes in — words and phrases get converted into long lists of numbers (called vectors) that place similar concepts close together in mathematical space.

For example:

"king" and "queen" end up as nearby numbers
"apple" (fruit) and "apple" (company) get different representations depending on context

Introduction to Tokens

But before any of that meaning-capture can happen, the text needs to be split into manageable pieces first.

Those pieces are called tokens.

4. Tokenization

What Are Tokens?

Tokens are the small chunks that text gets split into before being fed to an LLM.

A token is roughly ¾ of a word on average — but it can be:

A whole word: "cat"
Part of a word: "un", "believ", "able"
A punctuation mark: "."
A space or special character

Why Is Tokenization Needed?

LLMs can't process raw text. They need text broken into consistent, manageable units with known numerical IDs. Tokenization is the bridge between human text and the numerical world the model operates in.

Words vs Tokens — Simple Examples

Text	Tokens	Token Count
`"Hello"`	`["Hello"]`	1
`"ChatGPT"`	`["Chat", "G", "PT"]`	3
`"unbelievable"`	`["un", "believ", "able"]`	3
`"I love AI"`	`["I", " love", " AI"]`	3

Notice that "ChatGPT" becomes three tokens — the model splits unfamiliar or compound words into known subpieces.

Why This Matters

LLMs have a context window — a maximum number of tokens they can process at once (like a short-term memory limit)
GPT-4 can handle ~128,000 tokens; older models had 4,000
Knowing this helps you understand why very long conversations sometimes make the model "forget" earlier parts

Diagram 2: How your text travels through the tokenization pipeline

Diagram 3: The context window is the model's working memory — once full, earlier content gets dropped

5. Transformers

What Is a Transformer?

A Transformer is the neural network architecture that powers almost every modern LLM.

It was introduced in a landmark 2017 research paper titled "Attention Is All You Need" by researchers at Google. That paper changed AI forever.

Before Transformers, AI models struggled with long sequences of text — they'd "forget" earlier parts of a sentence by the time they reached the end.

Why It Changed AI

Transformers introduced a mechanism called self-attention, which lets the model look at all words in a sequence simultaneously and weigh which ones matter most for understanding each word.

Example: In the sentence "The animal didn't cross the street because it was too tired" — what does "it" refer to? The animal, not the street.

A Transformer figures this out by attending to all the other words at once and learning that "it" is more connected to "animal" than to "street".

How It Helps Understand Language

The Transformer's key innovations:

Parallel processing — processes all tokens simultaneously (much faster than sequential models)
Self-attention — understands relationships between words regardless of distance
Scalability — works incredibly well as you throw more data and compute at it

Why Almost Every Modern LLM Uses Transformers

It's simple: nothing else has come close in performance.

GPT (the "T" literally stands for Transformer), Claude, Gemini, LLaMA, Mistral — they all use the Transformer architecture at their core. Some use variations or improvements, but the fundamental design remains the same. GPT generates multiple next possible tokens with probabilities. User can change temperature i.e. value of the softmax to select most suitable response he likes.

Diagram-4: Low Temperature vs High Temperature output comparison

Putting It All Together: The Complete Picture

Here's the full journey of your message from input to output:

Every time you chat with an AI, this entire pipeline runs in seconds.

Diagram 5: Complete high-level LLM workflow

DEV Community: Suny Dutta