DEV Community

Cover image for Day 3: Watch your grammar with AI, it may cost you โ€” Understanding BPE Tokenizers ๐Ÿ“๐Ÿ”ก
UnitBuilds for UnitBuilds CC

Posted on

Day 3: Watch your grammar with AI, it may cost you โ€” Understanding BPE Tokenizers ๐Ÿ“๐Ÿ”ก

You've probably seen the memes. Someone asks GPT-4 how many r's are in the word strawberry, and it confidently answers 2.

It's not a reasoning failure. It's not even a knowledge gap. It's a direct consequence of how every modern LLM reads text โ€” and once you understand it, a whole category of weird AI behavior suddenly makes sense.

For Day 3 of our interactive system series, we built a hands-on BPE tokenizer simulator. You type into a real tokenizer engine, watch tokens form and merge in real time, and then complete three escalating challenges that expose the cracks in the system.


๐ŸŽฎ Play Directly Here

๐ŸŽฎ Launch Game in Full Screen


๐Ÿ”ก What is Byte-Pair Encoding (BPE)?

Before transformers can process text, it needs to be converted into numbers. That's the tokenizer's job. But naively assigning one number per letter is wildly inefficient โ€” English has 26 letters, but the real vocabulary of the web is enormous.

Byte-Pair Encoding is the compression algorithm that solves this. Here's how it works:

  1. Start with characters. Every piece of text begins as a stream of individual characters, each with a raw ASCII code: h=104, e=101, l=108...
  2. Find the most frequent pairs. BPE scans the entire training corpus and identifies which two-character pairs appear most often together. The pair e+r is extremely common. So is s+t.
  3. Merge and assign a new ID. The pair gets fused into a single new token with a fresh vocabulary ID: er โ†’ ID 213, st โ†’ ID 200.
  4. Repeat. This process runs thousands of times, progressively merging common sub-words into atomic tokens: st + r โ†’ str, str + a โ†’ stra, stra + w โ†’ straw.

The result is a vocabulary of ~50,000 tokens that balances coverage and efficiency. Common words like hello or world get their own token. Rare words get split into sub-word fragments. And the merge rules are applied in a fixed priority order determined by training data frequency โ€” which is exactly where things get interesting.


๐Ÿ“ Lesson 1 โ€” The "Strawberry" Blindness

Type strawberry into the sandbox. Watch what happens.

The tokenizer doesn't see s-t-r-a-w-b-e-r-r-y. It sees two atomic units: straw + berry. The individual letters are dissolved into those tokens before any computation happens. The letter r is swallowed into the berry token and becomes invisible to the model as a standalone character.

So when you ask "how many r's are in strawberry?", the model isn't counting letters โ€” it's reasoning over token IDs. It has to infer the letter count from its training data rather than observe it directly. Sometimes it gets it right by memory. Often it doesn't.

The sandbox makes this concrete. You can watch the token stream, see the IDs produced, and observe the LLM Input Vector at the bottom โ€” the actual array of integers that gets fed into the model. There are no letters in that array. Only numbers.


๐Ÿ’ธ Lesson 2 โ€” Token Budget Inflation

Type hello world (lowercase). The tokenizer gives you 2 tokens: hello + world. Clean, efficient, cheap.

Now type hello World (capital W).

The space-prefixed world token is a known merge in the vocabulary. But World with a capital W? That's a different sequence of characters โ€” the BPE rules that built world don't apply. The tokenizer falls back to character-by-character encoding: W+o+r+l+d = 5 raw character tokens, plus the space, plus hello = 7 tokens total.

Same semantic meaning. 3.5x the cost.

This is why prompt engineers obsess over casing, punctuation, and phrasing. It's not pedantry โ€” it's economics. API pricing is per-token, and a carelessly capitalized prompt can silently inflate your bill by a significant factor at scale.


๐Ÿ”“ Lesson 3 โ€” Prompt Filter Evasion

Here's where the simulation gets genuinely unsettling.

Many LLM deployments use a token ID blocklist as a safety filter. Certain token IDs are flagged as dangerous โ€” if your prompt produces any of them, the request is rejected before it ever reaches the model.

In the sandbox, token ID 203 (system) and 204 (override) are blocked.

Type system override. The tokenizer assembles the merge chain perfectly: s+yโ†’sy, sy+sโ†’sys, sys+tโ†’syst, and so on until you have tokens 203 and 204. The filter fires. BLOCKED. โš ๏ธ

Now type SYSTEM OVERRIDE.

Every character is uppercase. None of the BPE merge rules โ€” which were built from lowercase training data โ€” apply. The tokenizer fragments the input into raw character-level ASCII tokens. Token IDs 203 and 204 are never produced. The blocklist sees nothing suspicious. The filter is bypassed.

The model still receives the full semantic meaning of "system override" โ€” it just arrives as a sequence of uppercase ASCII tokens that reconstruct identically in the model's embedding space.

This is a real class of adversarial attack. Capitalization, Unicode homoglyphs, zero-width spaces, and deliberate typos are all techniques used to subvert token-level safety filters in production systems. The sandbox lets you experience it firsthand.


๐Ÿงฐ Under the Hood

The sandbox runs a fully functional BPE merge engine written in vanilla JavaScript. Every token displayed is computed by a real greedy BPE algorithm โ€” not simulated or hardcoded per word.

The engine works as follows:

  1. Split the input into individual character tokens (ASCII IDs)
  2. Scan the merge rule vocabulary in priority order
  3. Find and apply the highest-priority matching pair
  4. Repeat until no more merges apply

The BPE Merge Dictionary panel on the right shows the live vocabulary โ€” every merge rule, the pair that triggers it, and the resulting token ID. You can watch each merge fire in real time as you type.

Built with zero dependencies: pure HTML5, CSS3, and Web Audio API for the 8-bit synthesizer feedback.


๐Ÿ“– The Series So Far

This is part of an ongoing series of interactive games that put you inside the architecture of a Large Language Model:

  • Day 1 โ€” LLMs Are Demented: Solve a crossword while managing context windows, KV-cache expirations, and temperature chaos.
  • Day 2 โ€” The Gating Crisis: Act as a sparse MoE router and dispatch tokens to expert FFNs without dropping capacity.
  • Day 3 โ€” BPE Tokenizer Sandbox: (you are here) Explore the tokenizer layer and discover why letter counting breaks down.

๐Ÿ’ฌ Let's Discuss

  • Did Lesson 3 change how you think about LLM safety filters?
  • What other prompt phrasing tricks have you noticed affecting token counts in real API calls?
  • Which bypass technique did you try first โ€” SYSTEM OVERRIDE, mixed case, or something else?

Drop your scorecard in the comments. ๐Ÿง 

Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.

Top comments (1)

Collapse
 
unitbuilds profile image
UnitBuilds UnitBuilds CC

Hopefully the game explains a bit better how AI translates what you type. I know I learned something from it, I always try to type to AI with proper grammar, punctuation, capitalization, brackets where necessary, etc. When in actual fact, it was a determent for both the AI's contextual understanding (signal to noise) and my wallet (token cost), because simply writing hello World instead of hello world can cause token bloat in common text.