LLM Tokenization: GPT vs Claude vs Llama Edge Cases

#llm #tokenization #gpt4 #claude

The 🤗 Emoji Cost Me $47 in API Calls

I ran a batch job that sent 10,000 user-generated messages to GPT-4. The average message was about 200 characters. I budgeted for ~50 tokens per message based on the "~4 characters per token" rule everyone quotes.

Actual cost? 3.2x higher than expected.

Turns out half the messages contained emojis, non-ASCII usernames, or code snippets. A single 🤗 hugging face emoji? 5 tokens in GPT-4's tokenizer. The username "José" costs 3 tokens instead of 4 characters' worth. And don't get me started on what happened when someone pasted a JSON blob with Unicode escape sequences.

This isn't academic trivia. Tokenization edge cases directly impact your API bill, context window usage, and whether your RAG pipeline fits the documents you need. Here's what actually breaks in production.