Sreeraj Sreenivasan

Posted on May 24

Demystifying Tokens: How AI Actually Reads Your Code and Prompts

#ai #beginners #machinelearning #productivity

If you’ve been building with Large Language Models (LLMs), integrating APIs, or just messing around with prompt engineering, you’ve hit the word token a million times.

You know it’s the unit you get billed for. You know it’s the thing that fills up your "context window." But how does it actually work under the hood?

If you think LLMs read text word-by-word like humans, or character-by-character like traditional code compilers, think again. Let's pull back the curtain on tokenization and see what’s really going on when you hit "Send."

What Exactly is a Token?

To an AI, a token is the fundamental building block of language.

LLMs don't understand English, Python, or JavaScript directly. Instead, they run raw text through a processing step called tokenization, which chops strings into smaller pieces. A token can be a single character, a part of a word (sub-word), an entire word, or even punctuation and trailing spaces.

Here is a quick rule of thumb for English text:

1 token ≈ 4 characters
1 token ≈ 0.75 words
100 words ≈ 130–140 tokens

But things get weird when you look closer. Let's see how an AI tokenizer actually splits a sentence.

Tokenization in Action

Take a simple sentence like: "Learning AI is fun!"

A typical LLM tokenizer (like OpenAI's cl100k_base used for GPT-4) won't see four distinct words. It breaks them down like this:

Fragment	Token Type	Reason
Learn	Sub-word	The root root of the word
ing	Suffix	Common sub-word ending
AI	Space + Word	The space before a word is grouped with it
is	Space + Word	Grouped together to save space
fun	Space + Word	Grouped together
!	Punctuation	Standard punctuation gets its own token

A 4-word sentence instantly becomes 6 tokens.

The Developer's Gotcha: White Space and Code

Because spaces are often baked into the tokens themselves, formatting matters immensely. In programming languages like Python—where indentation defines scope—tabbing or spacing drastically increases your token count.

# This code block uses more tokens than you think because 
# indentation spaces are processed as distinct token fragments.
def hello_world():
    print("Hello, World!")

Why Don't We Just Use Whole Words?

It seems like an extra step, so why do AI researchers rely on sub-word tokenization instead of a massive dictionary of whole words?

1. The "Out of Vocabulary" (OOV) Problem

If an LLM only recognized whole words, what happens when a user types a typo, a brand new framework name, or internet slang (like rizz)? The model would break down. By using sub-words (like breaking ungettable into un + get + table), the AI can dynamically deduce the meaning of words it has never seen before.

2. Computational Efficiency

The English language has millions of words. Teaching an AI a unique mathematical identity for every single word—plus all its tenses and plural forms—would make the model's architecture massive and sluggish. By using a fixed vocabulary of roughly 50,000 to 100,000 sub-word tokens, the AI can assemble literally any word in existence, acting like a bucket of Lego bricks.

3. Turning Text into Vectors

Computers only process numbers. Tokenization is the bridge. Once text is split into tokens, each unique token is mapped to a specific integer ID.

Learn might be ID 4321
ing might be ID 128

These IDs are then converted into high-dimensional vectors (embeddings) so the LLM can run complex matrix multiplication to predict the next logical token.

The Context Window Budget

Every LLM has a Context Window (e.g., 8k, 32k, or even 1M+ tokens). Think of this as the model's short-term working memory. When you text a chatbot, the entire history of your conversation is bundled up and sent back to the API with every single new prompt. If your conversation history hits 4,000 tokens and the model's limit is 4,000, it cannot generate another word without "forgetting" the very first token at the top of the chat.

As developers, managing this budget is critical. Techniques like vector databases (RAG), text summarization, and aggressive trimming of system prompts are entirely about keeping token costs low and preventing your application from hitting memory ceilings.

Want to Test It Yourself?

If you are writing backend code or optimizing prompts, don't guess your token counts. You can experiment with official tokenizer tools to see exactly how your text is being sliced:

OpenAI Tokenizer: An interactive web tool showing how text translates to token IDs.
Tiktoken (Python): A fast BPE tokenizer library you can integrate into your Python backends to count tokens locally before hitting an API.

import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Learning AI is fun!")
print(f"Token Count: {len(tokens)}")  # Outputs: 6

Understanding tokens is the first step toward writing more cost-efficient prompts, building better AI apps, and understanding why models behave the way they do.

Over to you: Have you run into any weird bugs or massive cloud bills because of unexpected token usage? Let's talk about it in the comments below!

DEV Community