Fazal Mansuri

Posted on Jul 5 • Edited on Jul 24

📦 AI Context Engineering (Part 2): Tokens, Context Windows & Memory - Why More Context Isn't Always Better

#ai #softwareengineering #llm #productivity

In Part 1, we learned that building great AI applications isn't just about writing better prompts.

It's about providing the right context.

But that naturally leads to another question:

How much context can an AI actually understand?

If you've used ChatGPT, Claude, Gemini, Cursor or any AI coding assistant for a while, you've probably experienced something like this.

🤔 "Didn't I Already Tell You That?"

Imagine you're debugging a production issue with an AI assistant.

You start by explaining the architecture.

Then you share the API flow.

Then database schema.

Then logs.

Then stack traces.

After 30 minutes of conversation, you ask:

"So what's causing the bug?"

Instead of giving the answer, the AI responds:

"Could you share your database schema?"

You stare at the screen.

"I already did..."

Sometimes it even forgets details from earlier in the same conversation.

Naturally, people assume:

The AI has bad memory.
The model is unreliable.
The conversation is broken.

In reality, something completely different is happening.

You're running into one of the most important concepts in modern AI systems:

The Context Window.

Understanding this concept changes how you interact with AI—and more importantly, how you build AI-powered applications.

🧠 Before We Talk About Context Windows…

We first need to understand something much smaller.

Tokens.

Almost every AI provider mentions them.

Pricing is based on them.

Context windows are measured using them.

Yet many developers still think:

1 token = 1 word

That isn't true.

🔤 What Exactly Is a Token?

A token is the basic unit of text that an AI model processes.

Humans naturally read text as:

Characters
Words
Sentences

Large Language Models don't.

Before text reaches the model, it's converted into smaller pieces called tokens by a tokenizer.

The model never sees your original sentence.

It only sees a sequence of tokens.

Think of a tokenizer as a translator between humans and AI.

You write English
        ↓
Tokenizer
        ↓
Sequence of Tokens
        ↓
LLM

The tokenizer decides how text is split.

And that's why tokens aren't the same as words.

Tokens ≠ Words

This is probably the biggest misconception about LLMs.

Consider this sentence:

Hello World

You might expect:

2 words = 2 tokens

Sometimes that's true.

Sometimes it isn't.

Now consider:

internationalization

One word.

But depending on the tokenizer, it may be split into multiple tokens because the model has learned common sub-word patterns rather than every possible English word.

Similarly:

authentication
authorization
microservices

may also become multiple tokens.

The tokenizer isn't trying to understand grammar.

It's trying to represent text efficiently based on patterns it learned during training.

📦 Different Types of Content Produce Different Numbers of Tokens

One reason token counting surprises developers is that not all text is equal.

Take these examples.

Example 1 — Simple English

The server is running normally.

This is relatively efficient because it contains common words.

Example 2 — JSON

{
  "userId": 123,
  "isActive": true
}

Even though this looks short, punctuation, quotes, braces and field names also become tokens.

Example 3 — Source Code

if err != nil {
    return err
}

Programming languages introduce symbols, keywords, indentation and identifiers, all of which contribute to the token count.

Example 4 — URLs

https://example.com/api/v1/users/123

URLs often tokenize differently because of slashes, punctuation and path segments.

Example 5 — Emojis

🚀🔥🎉

Even emojis are tokenized.

Different models may tokenize the same emoji sequence differently.

💡 The Important Lesson

Tokens are not based on what humans consider words.

They're based on how the tokenizer breaks text into pieces.

That's why two pieces of text with the same number of characters can consume very different numbers of tokens.

🎯 Why Should Developers Care About Tokens?

Many developers only think about tokens when they receive an API bill.

In reality, tokens influence almost everything about an AI system.

They affect:

Cost
Latency
Context size
Response quality
Memory
Scalability

Every AI request starts with a token budget.

The more tokens you consume, the more work the model has to perform.

💰 Everything Consumes Tokens

Here's something many people don't realize.

When you send a request to an LLM, you're not paying only for your latest prompt.

The model processes everything included in the request.

That may include:

Content	Uses Tokens?
System Instructions	✅
User Prompt	✅
Conversation History	✅
Uploaded Documents	✅
Retrieved RAG Results	✅
Tool Outputs	✅
Function Call Results	✅
AI Response	✅

Even if you only typed:

Summarize this.

The model might actually receive thousands of tokens.

Imagine asking an AI coding assistant:

Why is this function failing?

Behind the scenes, the application might automatically include:

System instructions
Current source file
Nearby functions
Package imports
Build configuration
Previous conversation
Compiler errors

Your four-word prompt becomes a request containing thousands of tokens.

This is another example of Context Engineering in action.

🔄 Every New Request Starts Again

This surprises many engineers.

Suppose your conversation looks like this:

User:
Explain JWT Authentication.

Assistant:
...

User:
Can you show a Go example?

Assistant:
...

User:
Can you make it use middleware?

Many people imagine the model simply remembers everything internally.

That's not how most stateless LLM API interactions work.

For each new request, the application typically sends the conversation history (or a selected portion of it) again so the model has the necessary context to continue the conversation coherently.

Conceptually, it looks something like this:

Request 1

System Prompt
User Message

↓

LLM

------------------------

Request 2

System Prompt
Conversation History
New User Message

↓

LLM

------------------------

Request 3

System Prompt
Conversation History
Latest User Message

↓

LLM

As the conversation grows longer, more tokens are often included with each request unless the application trims, summarizes or otherwise manages the context.

This is one reason token usage can increase significantly during long conversations.

🪟 What Is a Context Window?

Now we can finally answer the question from the beginning.

If tokens are the building blocks…

Then what is the Context Window?

A context window is the maximum amount of information (measured in tokens) that a model can consider at one time while generating a response.

Think of it like a whiteboard.

Imagine you're solving a difficult problem with a teammate.

Everything written on the whiteboard is available to both of you.

You can read it.

Reference it.

Connect ideas.

Reason about it.

But the whiteboard isn't infinite.

Eventually, it fills up.

To continue working, you have two choices:

Erase older notes.
Replace them with a summary.

Large Language Models work in a very similar way.

The context window is that whiteboard.

Everything currently inside it is visible to the model while it generates its next response.

Everything outside it is invisible unless your application decides to bring it back.

📖 What's Actually Inside the Context Window?

One common misconception is that the context window contains only your prompt.

In reality, modern AI applications often build a much richer context before sending the request.

A simplified request might look like this:

System Instructions

↓

Conversation History

↓

Retrieved Knowledge (RAG)

↓

Uploaded Files

↓

Tool Results

↓

Current User Prompt

↓

LLM

Notice something interesting.

Your latest message may only be a tiny fraction of everything the model receives.

The application has already assembled the rest of the context behind the scenes.

This is why two AI tools using the same underlying model can produce very different answers.

The difference often isn't the model itself.

It's how effectively each application constructs the context before making the request.

🤔 Does a Bigger Context Window Always Mean Better AI?

At first glance, it sounds obvious.

If an AI model can process more information, shouldn't the answers always improve?

Not necessarily.

Imagine you're preparing for an interview.

Someone gives you:

Your resume
The job description
Company documentation
Engineering handbook
API documentation
Meeting notes
Internal wiki
Product roadmap
Last year's planning documents
Random Slack conversations

Could you answer the interview questions?

Probably.

Would all that information actually help?

Probably not.

In fact, it might make things worse because you'd spend time filtering out irrelevant information before focusing on what matters.

Large Language Models face a similar challenge.

A larger context window allows the model to see more information, but it doesn't mean every piece of that information is equally useful.

Good Context Engineering isn't about maximizing context.

It's about optimizing context.

🎯 More Context ≠ Better Context

One of the biggest mistakes in AI applications is thinking:

If some context helps,
more context must help even more.

That's rarely true.

Imagine building an AI coding assistant.

A developer asks:

Why is this API returning 500?

Option A:

Provide only:

The API handler
Relevant service
Error log

Option B:

Provide:

Entire repository
All dependencies
Every README
CI/CD configuration
Kubernetes manifests
Previous conversations
All database schemas

Technically, the model has more information.

Practically, you've introduced noise.

Finding the relevant signal becomes harder.

This is why mature AI systems spend significant effort deciding what not to include.

Context Engineering is often about removing information rather than adding it.

🔍 The "Lost in the Middle" Problem

Researchers have observed an interesting behavior in long-context language models.

When a prompt becomes very large, information buried in the middle of the context may receive less attention than information closer to the beginning or end.

This phenomenon is commonly referred to as the "Lost in the Middle" effect.

Imagine reading a 300-page document.

You usually remember:

The introduction
The conclusion

But details from page 147?

Much harder.

Language models can exhibit similar behavior under certain conditions.

This doesn't mean long-context models are ineffective.

It means that how information is organized inside the context matters, not just how much information is included.

🧠 Context Is NOT Memory

This is probably the biggest misconception about modern AI.

People often say:

"ChatGPT remembered what I told it last week."

What actually happened depends on the application.

It's important to distinguish between context and memory.

Context

Context is the information available only for the current request.

Once the request is complete, the model itself doesn't retain that context for future API calls.

Think of it as a whiteboard used during a meeting.

Erase the whiteboard and the discussion is gone.

Memory

Memory is information that an application intentionally stores and later reintroduces into future conversations.

For example:

You tell an AI assistant:

I primarily write Go and React.

The application might choose to save that preference.

Next week, when you ask:

Generate an API example.

The application retrieves your stored preference and injects it into the new request.

The model appears to "remember."

In reality, the application supplied that information again.

The memory lives in the application—not inside the language model itself.

🏗️ A Simple Mental Model

Think of an AI application as three separate components:

                User

                  │
                  ▼

      ┌─────────────────────┐
      │   Memory Store      │
      └─────────────────────┘
                  │
                  ▼
      ┌─────────────────────┐
      │ Context Builder     │
      │ - System Prompt     │
      │ - History           │
      │ - RAG Results       │
      │ - Tool Outputs      │
      └─────────────────────┘
                  │
                  ▼
      ┌─────────────────────┐
      │        LLM          │
      └─────────────────────┘

The LLM only sees the final assembled context.

It doesn't know where each piece of information came from.

💰 Why Tokens Matter for Cost and Performance

We've already learned that everything consumes tokens.

But token usage affects much more than pricing.

It also impacts:

Request latency
Throughput
Infrastructure cost
User experience

Consider these two requests.

Efficient Request

System Prompt

Relevant API documentation

Current User Question

Inefficient Request

System Prompt

Entire conversation

Complete codebase

Multiple PDFs

Large JSON payloads

Old logs

Current User Question

The second request requires the model to process significantly more information before generating a response.

Even if the answer only depends on a small portion of that data.

More tokens generally mean:

More computation
Higher latency
Higher cost

This is why production AI systems invest heavily in efficient context management.

⚙️ How Production AI Systems Manage Context

Well-designed AI applications rarely send everything they know to the model.

Instead, they carefully construct the context.

Some common strategies include:

✅ Retrieve Only What's Relevant

Instead of sending an entire knowledge base,

retrieve only the documents related to the user's question.

✅ Summarize Long Conversations

Rather than sending hundreds of previous messages,

summarize earlier parts while preserving important facts.

For example:

Instead of:

150 previous messages

Send:

Conversation Summary:

• User uses Go
• Building REST APIs
• Uses PostgreSQL
• Currently debugging authentication

Far fewer tokens.

Much easier for the model.

✅ Keep System Prompts Focused

System instructions should define behavior,

not contain an entire company handbook.

✅ Separate Long-Term Memory from Current Context

Store reusable preferences separately.

Retrieve them only when needed.

✅ Measure Token Usage

Don't optimize blindly.

Most AI SDKs expose token usage statistics.

Monitoring them helps identify unnecessary context and control costs.

🚫 Common Context Engineering Mistakes

Here are some patterns that often reduce AI quality rather than improve it.

❌ Dumping entire documents into every request

More information isn't automatically better.

❌ Sending the full conversation forever

Long conversations should be summarized or trimmed when appropriate.

❌ Treating memory as conversation history

They're different concepts.

❌ Ignoring token usage

Unexpected token growth often leads to higher costs and slower responses.

❌ Assuming bigger context windows solve everything

Context quality matters far more than context quantity.

🎯 Best Practices

If you're building AI-powered applications, keep these principles in mind:

✅ Send only the information required for the current task.

✅ Prefer relevant context over larger context.

✅ Separate memory from conversation history.

✅ Summarize older interactions instead of continuously expanding prompts.

✅ Monitor token usage in production.

✅ Design context deliberately instead of simply appending more text.

🏁 Final Thoughts

When people first start working with AI, they often focus on prompts.

Then they discover Context Engineering.

Eventually, they realize another important truth:

Not all context is equally valuable.

Modern AI systems succeed because they carefully decide:

What to include
What to exclude
What to summarize
What to retrieve
What to remember

That decision-making process is often more important than the prompt itself.

The best AI applications aren't the ones with the biggest context windows.

They're the ones that provide the right information at the right time.

And that's exactly what Context Engineering is about.

📌 Key Takeaways

Tokens are the fundamental units processed by language models—they are not the same as words.
Every request consumes tokens, including system prompts, conversation history, retrieved documents, tool outputs and the model's response.
A context window defines how much information a model can consider at once.
Larger context windows are useful, but more context doesn't automatically lead to better answers.
Context and memory are different: context exists for the current request, while memory is information intentionally stored and reintroduced by the application.
Effective Context Engineering is about sending relevant information, not all available information.
Good context management improves response quality, reduces costs and lowers latency.

🚀 Coming Up in Part 3

So far, we've answered:

What is Context Engineering?
What are tokens?
How do context windows work?
Why isn't bigger context always better?
What's the difference between context and memory?

But another important question remains:

Where does all this context actually come from?

How do AI applications automatically search documentation, query databases, call APIs, execute tools and decide what information should be sent to the model?

That's where concepts like Retrieval-Augmented Generation (RAG), Tool Calling, Model Context Protocol (MCP) and AI Agents come into the picture.

We'll break each of them down in Part 3 with practical examples and real-world engineering scenarios.

Top comments (2)

Raju Dandigam • Jul 5

This is a useful reminder that context size and context quality are different engineering problems. Teams often respond to weak outputs by dumping more artifacts into the prompt, but once the context stops being curated, the model has to sort through stale, redundant, or weakly relevant evidence before it can do useful work. That usually shows up as slower runs and less reliable decisions, not better reasoning. I’ve seen the best results when context assembly is treated like a retrieval/debugging problem with explicit structure and traceability, especially for coding agents that mix files, tool output, and prior state. That’s also where agent-inspect has been useful for me: you can inspect which pieces of context actually reached the run instead of guessing after the fact. Are you planning to cover strategies for trimming or ranking context before it ever hits the model window?

Fazal Mansuri • Jul 7

Thanks for sharing such a thoughtful perspective! I completely agree that context size and context quality are two very different engineering challenges. Treating context assembly as a retrieval/debugging problem instead of just stuffing more information into the prompt is a great way to think about it. I haven't tried agent-inspect yet, but I'll definitely check it out - it sounds really useful for understanding what actually reaches the model. And yes, covering strategies for filtering, ranking and trimming context before it enters the model is definitely on my list for a future article. Thanks again! 🙌