DEV Community

Cover image for How ChatGPT Understands Your Questions?
Akash Kumar
Akash Kumar

Posted on

How ChatGPT Understands Your Questions?

You type a question into ChatGPT. Half a second later, a surprisingly relevant, well-formed answer starts streaming back.

It feels like magic. It isn't. It's math — a lot of math, wrapped in a system that's been engineered to feel conversational.

In this post, we'll pull back the curtain and walk through exactly what happens between the moment you hit "Enter" and the moment a response appears — from raw text, to numbers, to tokens, to Transformers.

You: "What's the capital of France?"
                 │
                 ▼
        [ ...this article... ]
                 │
                 ▼
ChatGPT: "The capital of France is Paris."
Enter fullscreen mode Exit fullscreen mode

Let's open the black box.


1. What is an LLM?

LLM stands for Large Language Model.

Strip away the buzzwords, and an LLM is a program trained on enormous amounts of text (books, articles, code, websites) to do one core thing really well: predict the next word in a sequence, given everything that came before it.

That sounds almost too simple to power something like ChatGPT — but predicting "the next most likely word" over and over, billions of times during training, is enough for a model to learn grammar, facts, reasoning patterns, coding syntax, and conversational tone.

What problems do LLMs solve?

Before LLMs, computers were great at structured tasks (math, sorting, database lookups) but terrible at anything involving human language — understanding context, tone, ambiguity, or intent. LLMs closed that gap. They let computers:

  • Understand loosely-worded, ambiguous human input
  • Generate fluent, context-aware text
  • Summarize, translate, and explain complex information
  • Hold a conversation instead of just returning search results

    Popular examples of LLMs

  • GPT-4 / GPT-4o / GPT-5 (OpenAI) — powers ChatGPT

  • Claude (Anthropic) — the model answering this very prompt

  • Gemini (Google DeepMind)

  • LLaMA (Meta) — open-weight models

  • Mistral, DeepSeek, and other open-source families

    Common applications in daily life

  • Chatbots and customer support assistants

  • Code autocomplete (GitHub Copilot, Claude Code)

  • Writing assistants and grammar tools

  • Search engines with AI-generated summaries

  • Voice assistants that sound less robotic

  • Translation apps

        ┌─────────────────────────────┐
        │           LLM USES          │
        ├─────────────────────────────┤
        │  Chatbots      │  Coding     │
        │  Translation   │  Summaries  │
        │  Search        │  Writing    │
        └─────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

2. What Happens When You Send a Message to ChatGPT?

Let's trace the journey of a single message.

Step 1 — Typing a prompt

You type something like:

"Explain how tokenization works, like I'm 12."
Enter fullscreen mode Exit fullscreen mode

This is your prompt — plain human text, full of the messiness humans use: casual phrasing, typos, incomplete sentences.

Step 2 — Processing your message

Before the model can "think" about anything, your text has to be transformed into a format it can actually work with — a sequence of numbers. (We'll go deep on this in Sections 3 and 4.)

The system also adds invisible context around your message: system instructions, earlier turns in the conversation, and formatting markers — all packaged together and fed into the model at once.

Step 3 — Generating a response

The model doesn't write the whole answer in one shot. It predicts one token at a time — a token being a small chunk of text, often a word or part of a word — feeding each new token back in as input for predicting the next one.

Prompt tokens  →  [ MODEL ]  →  "The"
Prompt + "The" →  [ MODEL ]  →  "capital"
Prompt + ...   →  [ MODEL ]  →  "of"
Prompt + ...   →  [ MODEL ]  →  "France"
Prompt + ...   →  [ MODEL ]  →  "is"
Prompt + ...   →  [ MODEL ]  →  "Paris."
Enter fullscreen mode Exit fullscreen mode

This is why responses "stream in" word by word — that's not a UI trick, that's literally the order in which the model generates them.

Step 4 — Why responses aren't copied from the internet

A common misconception: people assume ChatGPT is "searching the web" or "quoting a database" for its answers. It isn't (unless a browsing tool is explicitly turned on).

Instead, during training, the model adjusted billions of internal parameters based on patterns across huge amounts of text. By the time you're chatting with it, none of the original training text is being looked up — the model is generating new text based on learned patterns, one probable next-token at a time. That's also why it can occasionally get things confidently wrong: it's producing what's statistically likely, not retrieving a verified fact from a source.


3. Why Computers Don't Understand Human Language

Here's the uncomfortable truth: computers don't understand words at all.

Text vs numbers

At the hardware level, a computer only understands one thing: numbers, encoded as electrical signals — on or off, 1 or 0. Every piece of text, image, or sound you've ever seen on a screen is, underneath, a pile of numbers.

"Hi" → H = 72, i = 105  (in plain ASCII)
Enter fullscreen mode Exit fullscreen mode

That's fine for storing and displaying text. But it's nowhere near enough for a computer to grasp meaning, grammar, or intent.

Why computers need everything converted into numbers

Machine learning models are, at their core, mathematical functions — matrix multiplications, additions, and non-linear transformations. You can't multiply a matrix by the word "dog." You can multiply it by a list of numbers that represents "dog."

So before any language model can process your sentence, that sentence has to be converted into numerical form the math can operate on.

Introduction to tokens

This is where tokens come in — the bridge between human-readable text and machine-readable numbers. Instead of chopping text into individual characters (too granular) or whole words (too limiting for a language with typos, slang, and multiple languages), models break text into tokens: small, reusable chunks of text.

That's the perfect segue into the next section.


4. Tokenization

What tokens are

A token is a chunk of text — sometimes a whole word, sometimes part of a word, sometimes a single character or punctuation mark. Tokenization is the process of breaking your input into these chunks before feeding it to the model.

Why tokenization is needed

Tokenization solves a few problems at once:

  • It gives the model a manageable, fixed-size vocabulary (tens of thousands of tokens) instead of infinite possible words.
  • It lets the model handle words it's never seen before, by breaking them into familiar sub-pieces.
  • It keeps common words efficient (one token) while rare or complex words get split into multiple tokens. ### Words vs tokens

A common beginner assumption is "1 word = 1 token." That's often wrong.

Text Token Breakdown Token Count
"cat" cat 1
"cats" cat, s 2
"tokenization" token, ization 2
"unbelievable" un, believ, able 3
"ChatGPT" Chat, G, PT 3

As a rough rule of thumb, in English, 1 token ≈ 4 characters, or about ¾ of a word.

Simple example — full sentence

Text:    "Tokenization is cool!"

Tokens:  [ "Token" | "ization" | " is" | " cool" | "!" ]

Numbers: [ 15496   | 1634     | 318   | 3608   | 0 ]
                       (illustrative IDs)
Enter fullscreen mode Exit fullscreen mode

Each token gets mapped to a unique number (a token ID) from the model's vocabulary. That list of numbers is what actually gets fed into the model — not the words themselves.

   TEXT
    │
    ▼
 TOKENS   ["Token", "ization", " is", " cool", "!"]
    │
    ▼
 TOKEN IDs   [15496, 1634, 318, 3608, 0]
    │
    ▼
  MODEL INPUT
Enter fullscreen mode Exit fullscreen mode

5. Transformers

What a Transformer is

The Transformer is a neural network architecture introduced by Google researchers in 2017, in a paper with the memorable title "Attention Is All You Need." It's the architectural backbone behind GPT ("Generative Pre-trained Transformer"), Claude, Gemini, and virtually every major LLM today.

At its core, a Transformer takes a sequence of token numbers and passes them through many stacked layers that repeatedly ask: "Given every other token in this sequence, how relevant is each one to understanding this particular token?"

Why it changed AI

Before Transformers, models processed text mostly one word at a time, in order (like reading left to right, one step at a time, forgetting earlier context along the way). This made it hard to handle long sentences or connect ideas that were far apart in the text.

Transformers introduced a mechanism called self-attention, which lets the model look at all tokens in a sequence simultaneously and weigh how much each one matters to every other one — regardless of distance. This was faster to train (highly parallelizable on GPUs) and dramatically better at capturing long-range context.

How it helps understand language

Consider this sentence:

"The trophy didn't fit in the suitcase because it was too big."
Enter fullscreen mode Exit fullscreen mode

What does "it" refer to — the trophy or the suitcase? Humans resolve this instantly using context. Self-attention lets the model do something similar: when processing the token "it," it can assign high attention weight to "trophy" (and lower weight to "suitcase"), effectively learning which words matter most for interpreting which other words.

        ATTENTION FOR "it"

  The   trophy   didn't  fit  suitcase  because  it  was  too  big
   .      ███       .     .      ▓        .      ●   .    .    .

  ███ = strong attention   ▓ = some attention   ● = current token
Enter fullscreen mode Exit fullscreen mode

Stack dozens of these attention layers on top of each other, across billions of parameters, and the model builds up a rich internal representation of grammar, meaning, and even reasoning patterns.

Why almost every modern LLM uses Transformers

Transformers scale remarkably well — feed them more data and more compute, and they reliably keep getting better, a trend often called the "scaling laws" of deep learning. Combined with their ability to train in parallel (much faster than older sequential architectures) and their strength at capturing context over long passages, Transformers became the default choice for essentially every major LLM built since 2018 — GPT, Claude, Gemini, LLaMA, and beyond.


Putting It All Together — The Complete Workflow

┌──────────┐   ┌────────────┐   ┌─────────────┐   ┌──────────────┐   ┌──────────┐
│  YOUR    │──▶│ TOKENIZE   │──▶│  CONVERT TO │──▶│  TRANSFORMER │──▶│ RESPONSE │
│  PROMPT  │   │ (text →    │   │  NUMBERS    │   │  (predicts   │   │ (decoded │
│  "Hi!"   │   │  tokens)   │   │  (token IDs)│   │  next token, │   │  back to │
│          │   │            │   │             │   │  repeatedly) │   │  text)   │
└──────────┘   └────────────┘   └─────────────┘   └──────────────┘   └──────────┘
Enter fullscreen mode Exit fullscreen mode

Bonus concept: the context window

Every LLM has a context window — the maximum number of tokens (prompt + conversation history + response) it can "see" at once. Think of it as the model's short-term memory span for a single conversation.

┌──────────────────────── CONTEXT WINDOW (e.g. 128K tokens) ────────────────────────┐
│  [ System instructions ] [ Earlier chat turns ] [ Your prompt ] [ Response so far ]│
└──────────────────────────────────────────────────────────────────────────────────┘
        ▲                                                                    ▲
   older tokens                                                     model generates
   may get dropped                                                  new tokens here
   if window is full
Enter fullscreen mode Exit fullscreen mode

If a conversation grows longer than the context window, the oldest parts eventually fall outside the model's view — which is why very long chats can sometimes cause a model to "forget" something mentioned much earlier.

Bonus concept: temperature

Temperature is a setting that controls how "predictable" vs "creative" the model's next-token choices are.

LOW TEMPERATURE (e.g. 0.2)          HIGH TEMPERATURE (e.g. 1.2)
────────────────────────            ────────────────────────
Prompt: "The sky is"                Prompt: "The sky is"

  blue      ████████████ 92%          blue      ███████ 40%
  clear     ██ 5%                     falling    █████ 28%
  grey      █ 2%                      infinite   ████ 20%
  (other)   ▏ 1%                      (other)    ███ 12%

→ Picks "blue" almost every time    → More willing to pick unusual,
  (focused, deterministic,            surprising words (creative,
  repetitive on retries)              varied, less predictable)
Enter fullscreen mode Exit fullscreen mode

Lower temperature is useful for factual, consistent answers (like code or math). Higher temperature is useful for creative writing, brainstorming, or varied phrasing.


Wrapping Up

ChatGPT doesn't "understand" your question the way a human does. It doesn't have beliefs, memories of you between separate conversations, or a database of facts it looks up. What it has is:

  1. Your text, broken into tokens
  2. Those tokens converted into numbers
  3. A Transformer architecture that uses self-attention to figure out which parts of your input matter most to each other
  4. A generation loop that predicts the most probable next token, one at a time, until it forms a complete response It's pattern-matching at a staggering scale — trained on enough human-written text that the patterns it learned are genuinely useful for reasoning, writing, and conversation. Not magic. Just very, very good math.

If you found this useful, the next post in this series will dig into **embeddings* — how tokens actually get turned into the vectors that carry meaning inside a Transformer. Stay tuned.*

Top comments (0)