Months after ChatGPT launched, I still could not have told you what a token was. I had been using it since the first public launch and was basically having novel-long conversations with it. I had no idea that every time I hit "enter," my text was being chopped into pieces before the model even looked at it.
It turns out, those pieces (tokens) determine your usage limits, how much the AI can remember, and why it sometimes seems to forget things you told it.
So. Tokens.
Here is what I wish I understood earlier.
They are not words
I assumed "one token = one word," but that is not actually the case. A token is a chunk of text; it may be a whole word, part of a word, or punctuation. The word "hamburger" gets split into two tokens: h and amburger. Not "ham" and "burger". The splits are not based on syllables, like you might expect.
Here are a few more to make the point: "infrastructure" becomes inf and rastructure. "Unbelievable" becomes three tokens: un, belie, and vable. These splits look strange, but they are consistent. The same word always produces the same tokens. This isn't arbitrary; there is a method behind the madness...
The reason Large Language Models (LLMs) need to do this is that they don't actually work with text at all. They work with numbers. Tokenization is the step where human-readable text gets converted into a sequence of numbers the model can process. Each token maps to a number, and the model does all of its "thinking" in that numerical space. A "tokenizer" is basically a translation layer between your words and the model's math.
The splits themselves are not random either. Tokenizers are trained to find the most common patterns in language. A whole common word like "the" gets its own single token. Less common words get broken into reusable pieces that appear across many different words. That un in "unbelievable" is something the model has seen in hundreds of words: undo, unfair, unlikely, unusual. By splitting it out, the model learns what "un" means as a concept, not just as part of one specific word. The splits are chosen to maximize what the model can learn from the patterns in language.
So, essentially a tokenizer's job is to convert each chunk into a number that the model can work with, and that is done the same way every time. That consistency is what makes the math work.
Why should you care?
Because tokens are what determine your usage limits.
Most people use AI through a free tier. Free tiers do not charge you, but they do limit how many messages you can send per day or per hour. When you hit that cap and get the "you have reached your limit" message, it is because you used too many tokens. The longer your conversations get, the faster you burn through your allowance.
Even on a paid plan, tokens are the unit of measurement. Services price by the token, and input tokens (what you send) and output tokens (what the AI generates) are counted separately. To give you a sense of scale: pasting a 2,000 word document uses roughly 2,700 tokens. A detailed response might be another 800. At typical rates, that entire exchange costs less than two cents. For casual use, the cost is negligible. But the usage limits are very real.
The "context window" connection
You have probably seen numbers like "128K context" or "200K tokens" thrown around. That is the model's memory limit for a single conversation. It is measured in tokens because that is what the model actually works with.
If you have ever had an AI "forget" something you told it earlier in the conversation, there is a decent chance you hit the token limit. Everything past that boundary just falls off and is gone.
(We will get into context windows properly in one of the next posts. For now, just know that tokens are the unit of measurement for everything.)
What this means for you
If you are just chatting with an AI casually, you probably do not need to worry about tokens too much. The free tiers are generous enough for most conversations.
But here is something worth understanding. Every message you send in a conversation includes the entire conversation history. The AI doesn't just receive your latest message; it receives everything back to the start of the conversation, plus your new message, every time you hit "enter". So a chat that starts at 500 tokens per exchange can quietly grow to 10,000 or 20,000 tokens per exchange by message 30, because the whole history is being sent every time. That is where usage caps and missing context usually come from.
Pro tip: start new conversations frequently to avoid this and to keep the focus concentrated on the task at hand. Aside from staying under your usage limits, you will also get the benefit of more helpful responses to your current questions. Remember that when you change topics, the LLM is still considering the things you brought up with it before, even if they are unrelated. Understanding this is a prerequisite to understanding good prompt engineering.
Where tokens really start to matter is when you are building things. Automating workflows, processing documents, or running agents that make multiple calls. That is when tokens stop being an abstract concept and start being a line item in your budget.
Next time: do you actually need to care which AI you use? Honestly, it depends, but probably not the way you think...
If there is anything I left out or could have explained better, tell me in the comments.
Top comments (0)