DEV Community

Cover image for What's Actually Happening When You Talk to an AI?
Shravan Chaudhari
Shravan Chaudhari

Posted on

What's Actually Happening When You Talk to an AI?

You type a question into ChatGPT, hit enter, and in a few seconds, a perfectly coherent, often brilliant answer shows up.

Feels like magic, right?

It's not magic. It's math, data, and one genuinely revolutionary idea from a 2017 research paper. Let's break it down

So, What Exactly is GPT?

GPT stands for Generative Pre-trained Transformer.

Let's split that:

  • Generative — it generates new content (text, in this case), not copy-paste.
  • Pre-trained — it has already been trained on a massive amount of data before you ever talk to it.
  • Transformer — the architecture (the "engine") that makes it work. We'll get to this later, it's the real hero of this story.

In simple terms: GPT is a Large Language Model (LLM), trained on a huge amount of data, built by OpenAI.

And What is ChatGPT, Then?

If GPT is the brain, ChatGPT is the face.

ChatGPT is the application, the chat interface, that lets regular humans like you and me actually talk to the GPT model. The model doesn't have a mouth or a chat window on its own. ChatGPT is the product built around it so you can type a message and get a reply, instead of writing raw code to talk to it.

Think of it like this: GPT is the engine, ChatGPT is the car.

Wait, What is an LLM Though?

LLM = Large Language Model.

An LLM is a system that has read an enormous amount of text (basically a huge chunk of the internet, books, articles, code) and learned the patterns of how language works, which word tends to follow which, how ideas connect, how questions get answered.

It doesn't "know" facts the way a database does. It has learned patterns of language so well that it can predict, with impressive accuracy, what word should come next in a sentence.

Here's an example to make it click:

"The sky is ___"

Even without AI, your brain instantly says "blue." Why? Because you've seen that pattern a thousand times. An LLM does the exact same thing, except it has seen billions of sentences, and it does this prediction one word (technically, one token — more on that soon) at a time, again and again, to build a full response.

A funnel diagram showing massive text data (books, articles, internet, code) pouring into a box labeled

The OpenAI Story: What Were They Actually Trying to Solve?

OpenAI was founded back in 2015 with one core mission: make sure AI benefits humanity, and get there before AI becomes something uncontrollable or restricted to a few large corporations.

But the practical problem they were chasing was much simpler: can we teach a machine to understand and generate human language well enough to be genuinely useful?

For decades, computers were great at math and terrible at language. You could ask a calculator "what's 2+2" and get an instant answer. But ask an old-school computer "summarize this article for me" — it had no clue where to even start. Language is messy, filled with context, sarcasm, tone, and ambiguity. Machines just weren't built for that.

So the problem LLMs solve is this: bridging the gap between human communication and machine computation. Instead of you learning to "speak computer" (code, commands, rigid syntax), the computer learns to understand you.

That's the real "why" behind GPT.

The Paper That Changed Everything: "Attention Is All You Need"

In 2017, a group of researchers at Google published a paper with a slightly cocky, very confident title: "Attention Is All You Need."

This paper introduced the Transformer architecture — and it's not an exaggeration to say this single paper is the reason ChatGPT, Gemini, Claude, and basically every major AI model today exists in its current form.

Before this paper, models processed language sequentially — word by word, in order, like reading one word at a time and slowly building context. This was slow and struggled with long sentences (by the time the model reached the end of a paragraph, it had "forgotten" the beginning).

The Transformer's big idea was: what if the model could look at ALL words in a sentence at once, and figure out which words matter most to each other — regardless of their position?

That mechanism is called "Attention." Hence the paper's name — attention really was all they needed.

A Quick Look at the Attention Mechanism

Let's take a sentence:

"The dog didn't cross the road because it was tired."

Who is "it" referring to? The dog, obviously. But a computer doesn't know that instantly — it has to figure out which earlier word "it" is most strongly connected to.

Attention is the mechanism that lets the model assign a "relevance score" between every word and every other word in the sentence. So when processing "it," the model gives high attention weight to "dog" and lower weight to "road," "cross," etc.

Do this for every word, across every sentence, at massive scale — and you get a model that actually understands context, not just word order.

The sentence

Okay, So What ACTUALLY Happens When You Send a Message to ChatGPT?

Let's walk through this step by step, like a behind-the-scenes tour.

Step 1: Input Processing (NLP)

The moment you hit send, your sentence goes through Natural Language Processing (NLP) — the broader field concerned with helping machines process human language. Your raw text gets cleaned, structured, and prepared to be fed into the model.

Step 2: Tokenization

Before the model can do anything, your sentence has to be broken into tokens (we'll dive deep into this in the next section).

Step 3: Converting to Numbers (Vectors)

Since the model is fundamentally math, your tokens get converted into numbers — specifically, vectors (more on this below too).

Step 4: Passing Through the Transformer

The vectors flow through the Transformer's layers, where the attention mechanism kicks in, weighing relationships between all the tokens.

Step 5: Predicting the Next Token

Here's the core trick: the model isn't "answering" your question directly. It's predicting, one token at a time, what the most statistically likely next token should be, based on everything it has learned — and based on your input as context.

It predicts one token → adds it to the sequence → predicts the next token based on the updated sequence → repeats, again and again, until it forms a complete response.

This is exactly the "sky is blue" example from earlier — just happening at insane speed, thousands of times per response, guided by billions of learned patterns.

An Important Myth-Buster

ChatGPT is not copy-pasting answers from the internet. It doesn't have a database of pre-written responses it searches through. Every response is generated fresh, token by token, based on patterns learned during training. That's why it can write a poem about your dog's birthday that has never existed anywhere on the internet before — because it's not retrieving it, it's generating it.

A six-step horizontal flowchart showing the journey of a message: typed input, tokenization, conversion to vectors, Transformer and attention processing, next-token prediction, and final generated response


Why Computers Need Numbers: Text vs Numbers

Here's something fundamental that's easy to overlook: computers don't actually understand language. At all.

A computer, at its core, only understands numbers — specifically, patterns of 1s and 0s. When you type the word "happy," the computer doesn't see happiness. It sees nothing, until that word is converted into a numerical representation it can actually compute with.

Text What Computer Sees Directly What It Actually Needs
"happy" Nothing meaningful A number/vector representing "happy"
"sad" Nothing meaningful A number/vector representing "sad"

So every word, sentence, and idea you type has to be translated into vectors — lists of numbers — before any processing can happen.

Why Vectors, Specifically?

A single number isn't enough to capture the meaning of a word. So instead, each word is represented as a vector: a long list of numbers (think hundreds of numbers) where each number captures some tiny aspect of that word's meaning — its emotion, its usage context, its relationship to other words.

Here's the elegant part: words with similar meanings end up with similar vectors, positioned close to each other in this numerical "space."

Example:

  • "king" and "queen" → vectors close to each other
  • "king" and "banana" → vectors far apart

And here's the subtle but crucial concept: the same word can mean different things in different sentences — and the model has to figure out which meaning applies based on context.

"I went to the bank to withdraw cash."
"I sat by the river bank."

Same word, totally different meaning. This is exactly what the attention mechanism (from earlier) helps resolve — it looks at surrounding words to decide which "version" of the word's vector meaning applies here.

A 2D scatter diagram showing words positioned by meaning —

A dense, abstract scatter plot of hundreds of colored points representing real word embeddings projected into 2D space, resembling an actual word2vec or t-SNE visualization.

Introduction to Tokens

Now let's rewind to that word "tokenization" from earlier — because this is where the actual journey from your sentence to numbers begins.

What is a Token?

A token is the smallest chunk of text the model works with. It's not always a full word. Sometimes it's a whole word, sometimes just part of a word, sometimes even a single punctuation mark.

Why Not Just Use Whole Words?

Great question. If the model only worked with whole, complete words, it would need to memorize an almost infinite vocabulary — every possible word, every tense, every spelling variation, every made-up internet slang term. That's inefficient and impossible to scale.

By breaking words into smaller sub-word chunks (tokens), the model can:

  • Handle words it has never seen before by breaking them into familiar pieces
  • Work efficiently across multiple languages
  • Represent the entire vocabulary using a much smaller, manageable set of building blocks

Words vs Tokens — A Real Example

Let's tokenize this sentence: "Understanding tokenization isn't hard."

Words (how we see it) Tokens (how the model sees it)
Understanding Under + stand + ing
tokenization token + ization
isn't is + n't
hard hard

Notice how longer or less common words get split into multiple tokens, while short, common words like "hard" stay as a single token.

This is also literally why ChatGPT has usage limits measured in "tokens" rather than "words" — because internally, that's the actual unit of currency the model works with.

The Tokenization Process, Step by Step

  1. Your input sentence arrives as raw text.
  2. A tokenizer (a separate small algorithm) scans through it and breaks it into tokens based on a pre-built vocabulary of common sub-words.
  3. Each token gets mapped to a unique numerical ID.
  4. These IDs are then converted into vectors (as we discussed above).
  5. Now the model can actually process it.

The sentence


Transformers: The Architecture Behind It All

We touched on this earlier with the "Attention Is All You Need" paper — now let's zoom in properly.

What is a Transformer?

A Transformer is a type of neural network architecture designed to process sequences of data (like sentences) by looking at all parts of the sequence simultaneously and figuring out how each part relates to every other part — using the attention mechanism.

It's called a "Transformer" because it transforms input sequences into meaningful output sequences, layer by layer, refining understanding at each stage.

Why It Changed AI

Before Transformers, the standard architectures (RNNs, LSTMs) processed language one word at a time, in strict order. This created two big problems:

  1. Speed — sequential processing is slow, especially for long text, because you can't parallelize it easily.
  2. Memory/Context Loss — by the time these older models reached the 50th word in a long paragraph, they'd often "forget" what was said in the first 10 words.

Transformers solved both problems at once:

  • They process the entire input simultaneously (parallelizable → much faster training on modern GPUs).
  • The attention mechanism lets every word directly reference every other word, regardless of distance — no more "forgetting" the beginning of a long sentence.

How It Helps Understand Language

Because attention calculates relevance between all words at once, the model builds a much richer, contextual understanding of meaning. It's not just reading word-by-word — it's holding the entire sentence's relationships in view at the same time, similar to how you don't process a sentence one word at a time either; your brain grasps the whole meaning holistically.

Why (Almost) Every Modern LLM Uses Transformers

GPT, Claude, Gemini, LLaMA — virtually every major LLM today is built on the Transformer architecture (with plenty of engineering tweaks and improvements layered on top). It became the industry standard because:

  • It scales beautifully with more data and more compute (bigger models just keep getting better).
  • It handles long-range context far better than anything before it.
  • It's flexible — the same core architecture works for text, code, images, and even audio, with modifications.

That's the "Attention Is All You Need" prophecy fulfilled: attention-based Transformers weren't just a good idea — they became the foundation of modern AI.

A side-by-side comparison — old sequential models (RNN/LSTM) processing words one at a time with fading memory, versus Transformers processing all words simultaneously through a connected web of attention.

A vertical layered diagram of a Transformer model, showing the flow from input tokens through embedding and positional encoding, stacked self-attention and feed-forward layers, up to output next-token probabilities.


Wrapping Up

So next time someone asks "what is ChatGPT, is it just a fancy search engine?" — you now know the real answer: it's a Transformer-based Large Language Model that has learned the deep patterns of human language by training on massive amounts of text, converts your words into numbers it can actually compute with, and generates its response one predicted token at a time — guided by an attention mechanism that lets it understand context the way we intuitively do.

It's not magic. It's one of the most elegant applications of math and pattern recognition we've built so far — and now you understand exactly how it works, step by step.

Top comments (0)