- Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You ask Claude about the code you pasted five messages ago and it responds like it has never seen it. You did not do anything wrong. The model never forgot — it was never given the code in the first place.
That sentence is the whole post. The rest is why it is true and what to do about it.
If you are building on top of LLMs, three ideas do 90% of the debugging for you: what a token is, what a context window is, and the fact that the model has no memory between calls. Miss any of those and you will spend an afternoon convinced the API is broken when it is working exactly as specified.
A token is not a word
The model does not read English. It reads tokens. A token is a chunk of text that the model's tokenizer decided was frequent enough to get its own numeric id. The algorithm most providers use is called byte-pair encoding. Start with bytes, merge the most common adjacent pairs, repeat until you have a fixed-size vocabulary — typically 100k to 200k entries for modern models.
The consequences are small but load-bearing:
- Common words are one token. Uncommon words split into two or three.
- Whitespace is part of the token.
" the"with a leading space is a different token from"the". - Punctuation gets its own tokens.
- Non-English text is heavier — a Japanese sentence can use 2-3x more tokens than its English translation.
- Code is heavier than prose. Indentation, symbols, and identifiers all cost.
Easiest way to see it is to count. Install the OpenAI tokenizer:
pip install tiktoken
Then:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
for s in [
"Hello world",
"antidisestablishmentarianism",
"def add(a, b):\n return a + b",
"東京は日本の首都です",
]:
tokens = enc.encode(s)
print(f"{len(tokens):>3} tokens | {s!r}")
Output:
2 tokens | 'Hello world'
6 tokens | 'antidisestablishmentarianism'
11 tokens | 'def add(a, b):\n return a + b'
10 tokens | '東京は日本の首都です'
Two observations. A long English word is not one token — the tokenizer breaks it apart. And a short Japanese sentence costs as much as a full line of Python.
A rough heuristic that holds up for English prose: 1 word ≈ 1.3 tokens, or about 750 words per 1,000 tokens. For code, closer to 1 line of typical Python ≈ 10-15 tokens. A 500-line file is 5,000 to 8,000 tokens before you add your question.
You can confirm any claim in this post against the tokenizer. Do not trust your intuition on length — count.
A context window is a fixed-size budget
The context window is the maximum number of tokens the model can look at in a single call. That is input plus output combined, with some variation by provider on exactly how the budget is split.
In April 2026 the common tiers look roughly like this:
- GPT-4o: 128k tokens
- Claude Sonnet 4.5: 200k tokens, with a 1M-token tier for selected customers
- Gemini 2.5 Pro: 1M tokens, 2M in preview
- Most open-source Llama-family models: 8k to 128k
Numbers move. The shape does not. Every model has a limit, and every request is capped at that limit.
The window exists because the underlying attention mechanism has a compute cost that grows with the square of the sequence length. Attention is the operation that lets a transformer weigh every token against every other token. Double the context, quadruple the work. Providers make engineering choices to soften the curve, but the quadratic ceiling is still there under the optimizations.
Practical translation: a 128k context window is not a free 128k. It is a budget you are renting, priced by how much of it you actually use.
You are not chatting. You are resending a transcript.
Here is the part that throws most people.
When you have a "conversation" with an LLM through the API, the model does not remember the last message. It cannot. The API is a stateless HTTP endpoint. Between your messages, the weights of the model do not change, no session object sits on a server, no thread-local state survives. Every call is a fresh call.
What actually happens when you send turn five of a chat:
- Your client takes the full history (system prompt, turn 1, turn 2, turn 3, turn 4) and appends turn 5.
- That entire blob is tokenized.
- The whole thing goes into a single API request as
messages=[...]. - The model reads the full transcript from scratch, generates a reply, returns it.
- Your client appends the reply to the local history for next time.
The pattern in code:
from openai import OpenAI
client = OpenAI()
history = [
{"role": "system", "content": "You are a code reviewer."},
]
def ask(user_text: str) -> str:
history.append({"role": "user", "content": user_text})
resp = client.chat.completions.create(
model="gpt-4o",
messages=history, # the full transcript every time
)
reply = resp.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return reply
The line that matters is messages=history. The whole thing goes out on every call. Memory lives in your process, not in OpenAI's.
This is why your Claude tab in the web UI sometimes "forgets" code you pasted earlier. The web UI is a chat wrapper. When the transcript grows past a threshold, the wrapper starts dropping or summarizing old turns before sending. The code you pasted in message 2 might not be in the payload for message 8. The model is not forgetful. The wrapper is pruning.
In a production API integration, this is your job. You decide what stays in the transcript. You decide what gets summarized. You decide what gets dropped. The model will not tell you it is missing context — it will confabulate.
What 128k actually feels like
Numbers without a frame are noise. Rough reference points:
- A tweet: 30-40 tokens.
- This blog post so far: around 1,200 tokens.
- A typical README: 500-1,500 tokens.
- A medium source file (500 lines of Python): 5,000-8,000 tokens.
- The full text of The Great Gatsby: around 70,000 tokens.
- The HTTP/1.1 RFC: around 120,000 tokens.
So "128k context" means you can, in principle, paste the entirety of RFC 2616 and ask a question about it. You cannot paste two RFCs. You cannot paste one RFC plus 500 lines of code plus five turns of chat and expect everything to fit.
Long context is not free
Three reasons to be careful when reaching for the biggest window you can find.
It costs money. Most providers price per input token. A 100k-token request can be 20-30x the cost of a 4k-token request. Anthropic and Google both have tiered pricing that gets more expensive above certain thresholds — a 200k-token Claude call is not twice the price of a 100k one.
It costs latency. Prompt processing time scales with input length. A 100k-token prompt can take 5-10 seconds of time-to-first-token before streaming begins. Your users will notice.
Attention degrades in the middle. The lost-in-the-middle effect, documented by Liu et al. in 2023 and reproduced many times since, shows that models recall information at the beginning and end of a long context far better than information in the middle. Pasting a 100k-token document and asking about something in the middle of it is not the same request as asking about something at the edges, even though the prompt looks identical to you.
The engineering conclusion: a smaller, well-chosen context almost always beats a giant one. The skill is picking what to include.
What to do with this
Four moves that pay off immediately once you internalize the mental model:
-
Count tokens, do not guess. Before you ship a prompt, run it through
tiktokenor the Anthropic token-counting API and know your baseline. - Decide explicitly what stays in the transcript. Treat conversation history as a cache you are managing, not a memory the model is maintaining. Summarize old turns once they are no longer load-bearing.
- Put the important thing near the edges. The user's current question goes last. The system prompt's critical instructions go first. Avoid burying the lede in a middle that the model will half-read.
- Log your token counts per request. Input tokens, output tokens, total. You will not catch cost or truncation problems without them, and you will not notice when a prompt template starts drifting from 2k to 40k tokens because a retrieval step changed.
The model is a token-in, token-out function that runs exactly once per call. Everything that feels like conversation, memory, or personality is an illusion built on top of that function by the code around it. Once you see it that way, the behavior stops looking mysterious.
If this was useful
I wrote a book about running LLM applications in production: Observability for LLM Applications. Chapter 16 is an entire chapter on token accounting, context-window budgets, and the cost math that falls out of everything above — the paperback is live and the ebook drops Apr 22.
I am also building Hermes IDE, an IDE for developers who ship with Claude Code and other AI coding tools. The GitHub repo is where it lives.
Top comments (0)