DEV Community: Rith Banerjee

Tokenization Made Simple: How AI Turns Words into Numbers

Rith Banerjee — Wed, 13 Aug 2025 11:36:57 +0000

If you’re starting out in AI, Natural Language Processing (NLP), or just curious about how tools like ChatGPT understand text, you’ll run into a strange but important word: tokenization.

It might sound like some secret coding spell but really, it’s just the AI version of chopping vegetables before cooking. Let’s break it down.

What is Tokenization?

Tokenization is the process of splitting text into smaller, meaningful pieces called tokens.

These tokens can be:

Whole words
Parts of words (sub-words)
Even punctuation marks or symbols

The point? AI can’t work directly with raw sentences. Tokens are the prepped ingredients that models actually understand.

Why Do We Need Tokenization?

Computers don’t “read” like humans. They understand numbers.

Tokenization is the bridge:

Break text into tokens.
Assign each token a unique ID (a number).
Feed those IDs into the model.

Without tokenization, a sentence like:

I love coding

is just a messy blob of characters to a computer.

Quick Example

Text:

I love coding.

Tokens:

["I", "love", "coding", "."]

Token IDs (example):

[1, 25, 302, 7]

The exact numbers depend on the tokenizer you’re using each one has its own dictionary (also called a vocabulary).

Subword Tokenization (Used in GPT)

Large models like GPT often split rare words into subwords for flexibility.

Example:

Unbelievable → ["Un", "believ", "able"]

This way, even if the model has never seen “unbelievable” as one word, it still understands it by piecing together familiar chunks.

Why It Matters in AI

Token limits: Models have a maximum number of tokens they can process. More tokens = higher cost and slower response.
Efficiency: Breaking words into subwords lets models handle rare or made-up words without storing them all in the dictionary.

Try It Yourself

You can experiment with tokenization right now using my Custom Tokenizer Tool:

🔗 https://custom-tokenizer-rith.vercel.app/

Type any text and see how it’s split into tokens and turned into IDs just like AI does before generating a response.

A Fun Analogy

Imagine you’re explaining your favorite recipe to a robot chef:

First, you chop the ingredients (tokenization).
Then, you label each ingredient (token IDs).
Finally, the robot uses those labels to cook (model processing).

No chopping = no cooking. That’s how important tokenization is in AI.

Key Takeaways

Tokenization = Text → Tokens → Numbers → Model
It’s step one in almost every NLP task.
Different tokenizers split text differently.
Subwords = flexibility, but more tokens to process.

Summary

Tokenization is how AI breaks down human language into bite-sized chunks it can understand and process.

It’s the translation step between words and the numerical world of machine learning. Whether splitting into whole words, subwords, or even punctuation, tokenization makes it possible for models to handle any text from everyday phrases to rare, complex terms. Without it, AI would be staring blankly at a wall of letters.

How I Explained Vector Embeddings to My Mom (And She Got It)

Rith Banerjee — Wed, 13 Aug 2025 11:35:58 +0000

Mom: “Beta, what is this ‘vector embedding’ you keep talking about?”

Me: “It’s not as scary as it sounds, Mom. Let me explain.”

We’re in the kitchen, chai is boiling, and I decide this is the perfect time to make it easy for her to understand.

Turning Ideas Into Numbers

Think about how you’d describe someone you know.

If I had to describe you, Mom:

Sweetness level: 9 out of 10
Cooking skill: 10 out of 10
Ability to find things I’ve lost: 11 out of 10

I could do the same for everyone in our family.

Once I’ve written down everyone’s “scores,” I can compare them. If two people have similar scores, they’re probably similar in personality.

Mom: “Oh, like my recipe book? Different dishes, but I can group the ones with the same spices.”

Me: “Exactly, Mom.”

Computers Do the Same Thing

Before I scare you with the term “vector embedding,” let’s start simple:

It’s just a list of numbers that represents the meaning of something like a word, sentence, picture, or even a song.

These numbers are like the “scores” in our family example, except instead of sweetness or cooking skill, a computer’s traits might be:

How romantic it feels (like a Bollywood love song)
How technical it feels (like the washing machine manual)
How much it’s about cricket

And when you want to sound smart at a party, you can call these lists of numbers vector embeddings.

Why Bother?

Because once things are turned into numbers, computers can:

Find similar things - “puppy” is close to “dog”
Spot differences - “banana” is far from “spaceship”
Make smart recommendations - like suggesting a Shah Rukh Khan movie after you watch one

A Simple Example

You search for “a story about a dog”.

Even if the text says “a tale about a puppy”, the computer still finds it because their embeddings are close in meaning, even if the words are different.

Mom: “Ohhh, so the computer isn’t just matching the exact words, it’s matching the idea?”

Me: “Exactly, Mom. You’ve cracked it.”

Wrapping It Up

So Mom, the next time you hear me say “embedding,” you’ll know I’m just giving the computer a recipe for understanding meaning.

And unlike my cooking, this recipe actually works every time.

GPT Explained Like You're 5: The Kid-Friendly Guide to AI

Rith Banerjee — Wed, 13 Aug 2025 11:34:59 +0000

What is GPT? (Explained for a 5-Year-Old)

Have you ever talked to a computer that talks back like a person? That’s kind of what GPT is. But to understand it, let’s imagine…

The Magical Parrot

Picture a talking parrot.

But this isn’t just any parrot, this parrot has read every storybook, comic, newspaper, and joke book in the world. It’s heard every bedtime story, every “Knock Knock” joke, and even that weird poem your uncle made up last Christmas.

Now, if you ask the parrot:

"Tell me a story about a dinosaur who loves pizza"

it can make one up for you on the spot, using everything it’s read before.

That’s basically what GPT does… except GPT is not a bird. It’s a computer brain built by very smart people. It has read millions of pages from the internet, books, and articles. Then it learned how to guess the next word in a sentence really, really well.

How It Works

Think of GPT like Lego blocks but instead of plastic blocks, it has words. When you start a sentence, GPT picks the next “word block” that fits best. Then another. And another. Until you’ve built a whole castle made of words.

Example:

You say: “Once upon a time, there was a…”

GPT might finish: “…little dragon who loved painting rainbows.”

Why? Because it’s seen lots of stories like that, and it knows dragons and rainbows often go together in stories.

Why Is It Called GPT?

G – Generative: It makes (generates) new text.
P – Pre-trained: It learned a ton before you even talked to it.
T – Transformer: A fancy computer trick that helps it understand how words connect.

So, GPT = “A computer that’s good at making sentences.”

What Can GPT Do?

Tell bedtime stories.
Make up silly poems.
Help With Homework.
Answer questions like “Why is the sky blue?” or “Do cats have eyebrows?”

(Yes, cats do have tiny eyebrow-like whiskers.)

Why It Feels Like Magic

GPT doesn’t just copy what it’s read, it mixes ideas together. Like blending chocolate cake and strawberry ice cream to make chocolate-strawberry cake!

So even if you ask it something it’s never seen before, it can still give you an answer.

The Big Secret

Even though GPT seems super smart, it doesn’t really understand things like humans do. It’s more like your friend who’s great at finishing your sentences but doesn’t actually know what you’re thinking. It’s amazing with words, but it doesn’t have feelings or thoughts.

Summary

GPT is like a magical parrot with a library for a brain, a super-talking computer that can make up stories, answer your questions, and play with words.

So next time you chat with GPT, remember: it’s not magic… but it’s pretty close.