Ankit Dey

Posted on Mar 14

AI Doesn't actually Read Words. Here's What It Reads (Day - 5/30 Beginner AI Series)

#ai #machinelearning #explainlikeimfive #coding

Welcome back to AI From Scratch.

If you've made it to Day 5, you've already done more deep‑learning theory than most people who tweet about AI.
Quick rewind:
Day 1: AI is a next‑word prediction machine with a ton of weights.
Day 2: Those weights are trained like a kid practicing free throws.
Day 3: Inside, layers and neurons act like an assembly line of tiny reactions.
Day 4: Transformers + attention let each word decide which other words to care about.

Today's question is sneakier:

If AI doesn't actually "see" words the way we do… what does it see?

Because when you paste some long rant into a chatbot, it's not reading sentences. It's reading something more primitive: tokens and numbers.

So what is the AI actually looking at?

When you type:

"Explain AI to me like I'm sleep‑deprived but curious."

The model doesn't see that as one clean string.
Step one is tokenization, chopping your text into little units called tokens.

Tokens are:

Sometimes full words (Apple, hello).
Sometimes pieces of words (un, believ, able).
Sometimes punctuation, spaces, even emojis.

Each of those tokens gets turned into an *ID *(a number like 42 or 18,307) from the model's vocabulary.
So what this means for you: when you talk to an AI, it's not thinking in "words and sentences" - it's thinking in IDs representing tiny chunks of text.

Chopping sentences into Lego pieces

Why bother with this token stuff? Why not just use whole words?

Because language is messy:

New slang shows up daily.
Names, hashtags, typos, random URLs…
Some languages don't really use spaces.

If the model only understood full words, it would be completely lost the moment you typed something it hadn't seen before.
Enter sub word tokenization, methods like BPE (Byte Pair Encoding) that learn common pieces of words and reuse them.

Think of it as Lego bricks:

Common words get their own big brick (computer, football).
Rare or weird words are built by snapping smaller bricks together (computational, micro‑saaS).

So what this means for you: the reason AI can handle made‑up words, weird usernames, and half‑English‑half‑Hindi chaos is because it's secretly breaking everything into reusable Lego‑like pieces.

From token IDs to "meaning space"

Okay, so now we have a sequence of token IDs: 154, 892, 77, 301…
The model still can't do anything interesting with just those IDs. They're like jersey numbers with no skills attached.

Next step: embeddings.

An embedding is a big list of numbers that acts like coordinates in a strange "meaning space."

Tokens with related meanings end up near each other.
Tokens with very different roles drift far apart.
Certain directions in this space line up with concepts like gender, tense, even "royalty."

This is where that classic example comes from:

_1. "king" and "queen" are close,

"king" and "banana" are very far,
"king" and "man" are related in a different way than "king" and "queen."_

You don't need the math. Just hold this picture: every token becomes a dot in a high‑dimensional map where closeness roughly means "similar vibe or role."

So what this means for you: before any attention, layers, or predictions, your words have already been turned into points in a meaning map. The rest of the model is just nudging and combining those points.

Order matters: why "AI loves you" ≠ "you love AI"

One more problem: embeddings capture what each token is, but not where it appears.
"Cat bites dog" and "dog bites cat" have the same words - different meaning.
Transformers fix this by adding some notion of position on top of the token embeddings, so the model knows "this is the first word, this is the second, …".

You can think of it like:

Each token gets its meaning coordinates.
Then it also gets a little tag that says "I'm the 5th token in this sentence."
The model blends those together so it knows both "what" and "where."

So what this means for you: the model doesn't just bag up your words and shake them , it knows the order they came in, which is why it can tell who did what to whom.

The context window: your AI's short‑term memory

Now for the sneaky bit that secretly controls a lot of your experience:
the context window.

Every model has a maximum number of tokens it can handle in one go - that includes your prompt + the model's reply.

Roughly:

1 token ≈ ¾ of a word of English.
A few thousand tokens ≈ a few pages of text, depending on the model.

If your conversation plus its answers go beyond that limit, the model starts forgetting the oldest tokens , they literally fall out of its working memory.
So what this means for you: when a chatbot suddenly "forgets" something you said 40 messages ago, it's not being rude, that part of the conversation may have been pushed out of its token budget.

How token limits shape your chats

Because everything is measured in tokens, a few non‑obvious things happen:
Long prompts eat into the memory budget fast , big copy‑pasted docs, code, or transcripts can leave less room for the model's answer.
Both input and output count , if you ask for a huge essay, there's less room left for past context.
Different models have different context windows , newer ones can handle way more tokens than older ones, which is why some feel better at long, multi‑step tasks.

In practice, this is why people talk about "prompt engineering" and "chunking" documents: you're really just managing what fits into that sliding window of tokens the model can see at once.

So what this means for you: how you feed information to the model (shorter, focused chunks vs giant walls of text) directly affects how coherent and on‑track its answers feel.

Putting it all together: what your AI actually reads

Let's stitch the whole path from your keyboard to the model's "brain":
You type a message.

The tokenizer slices it into tokens (little text chunks).
Each token becomes an ID from the model's vocabulary.
Each ID becomes an embedding , a point in meaning space.
Positional info gets mixed in so order isn't lost.
All of this fits into a context window, a fixed‑size memory slot measured in tokens.
Inside that window, transformers + attention from Day 4 do their thing and predict the next token.

So what this means for you: your AI is never "reading paragraphs" the way you see them. It's working with a long row of numeric Lego bricks inside a fixed‑size tray, and all the magic you see is built on top of that.

What's coming on Day 6

Now you know:

How text becomes tokens,
How tokens become numbers in a meaning space, and
How the context window limits what the AI can remember at once.

That sets up a very natural next question:

Why is a bigger AI "smarter"? (And where does that idea break down?)

On Day 6 - "Why Is a Bigger AI Smarter? (It's Not What You Think)"
we'll talk about:

From 1M to 1T parameters - what scaling actually buys you.

We'll look at what happens when you crank up parameter counts, why "just make it bigger" sometimes works weirdly well, and why size alone still doesn't guarantee good judgment.

What blew your mind most? Drop a comment!

DEV Community