DEV Community

Cover image for πŸ“¦ AI Context Engineering (Part 2): Tokens, Context Windows & Memory - Why More Context Isn't Always Better
Fazal Mansuri
Fazal Mansuri

Posted on

πŸ“¦ AI Context Engineering (Part 2): Tokens, Context Windows & Memory - Why More Context Isn't Always Better

In Part 1, we learned that building great AI applications isn't just about writing better prompts.

It's about providing the right context.

But that naturally leads to another question:

How much context can an AI actually understand?

If you've used ChatGPT, Claude, Gemini, Cursor or any AI coding assistant for a while, you've probably experienced something like this.


πŸ€” "Didn't I Already Tell You That?"

Imagine you're debugging a production issue with an AI assistant.

You start by explaining the architecture.

Then you share the API flow.

Then database schema.

Then logs.

Then stack traces.

After 30 minutes of conversation, you ask:

"So what's causing the bug?"

Instead of giving the answer, the AI responds:

"Could you share your database schema?"

You stare at the screen.

"I already did..."

Sometimes it even forgets details from earlier in the same conversation.

Naturally, people assume:

  • The AI has bad memory.
  • The model is unreliable.
  • The conversation is broken.

In reality, something completely different is happening.

You're running into one of the most important concepts in modern AI systems:

The Context Window.

Understanding this concept changes how you interact with AIβ€”and more importantly, how you build AI-powered applications.


🧠 Before We Talk About Context Windows…

We first need to understand something much smaller.

Tokens.

Almost every AI provider mentions them.

Pricing is based on them.

Context windows are measured using them.

Yet many developers still think:

1 token = 1 word

That isn't true.


πŸ”€ What Exactly Is a Token?

A token is the basic unit of text that an AI model processes.

Humans naturally read text as:

  • Characters
  • Words
  • Sentences

Large Language Models don't.

Before text reaches the model, it's converted into smaller pieces called tokens by a tokenizer.

The model never sees your original sentence.

It only sees a sequence of tokens.

Think of a tokenizer as a translator between humans and AI.

You write English
        ↓
Tokenizer
        ↓
Sequence of Tokens
        ↓
LLM
Enter fullscreen mode Exit fullscreen mode

The tokenizer decides how text is split.

And that's why tokens aren't the same as words.


Tokens β‰  Words

This is probably the biggest misconception about LLMs.

Consider this sentence:

Hello World
Enter fullscreen mode Exit fullscreen mode

You might expect:

2 words = 2 tokens
Enter fullscreen mode Exit fullscreen mode

Sometimes that's true.

Sometimes it isn't.

Now consider:

internationalization
Enter fullscreen mode Exit fullscreen mode

One word.

But depending on the tokenizer, it may be split into multiple tokens because the model has learned common sub-word patterns rather than every possible English word.

Similarly:

authentication
authorization
microservices
Enter fullscreen mode Exit fullscreen mode

may also become multiple tokens.

The tokenizer isn't trying to understand grammar.

It's trying to represent text efficiently based on patterns it learned during training.


πŸ“¦ Different Types of Content Produce Different Numbers of Tokens

One reason token counting surprises developers is that not all text is equal.

Take these examples.

Example 1 β€” Simple English

The server is running normally.
Enter fullscreen mode Exit fullscreen mode

This is relatively efficient because it contains common words.


Example 2 β€” JSON

{
  "userId": 123,
  "isActive": true
}
Enter fullscreen mode Exit fullscreen mode

Even though this looks short, punctuation, quotes, braces and field names also become tokens.


Example 3 β€” Source Code

if err != nil {
    return err
}
Enter fullscreen mode Exit fullscreen mode

Programming languages introduce symbols, keywords, indentation and identifiers, all of which contribute to the token count.


Example 4 β€” URLs

https://example.com/api/v1/users/123
Enter fullscreen mode Exit fullscreen mode

URLs often tokenize differently because of slashes, punctuation and path segments.


Example 5 β€” Emojis

πŸš€πŸ”₯πŸŽ‰
Enter fullscreen mode Exit fullscreen mode

Even emojis are tokenized.

Different models may tokenize the same emoji sequence differently.


πŸ’‘ The Important Lesson

Tokens are not based on what humans consider words.

They're based on how the tokenizer breaks text into pieces.

That's why two pieces of text with the same number of characters can consume very different numbers of tokens.


🎯 Why Should Developers Care About Tokens?

Many developers only think about tokens when they receive an API bill.

In reality, tokens influence almost everything about an AI system.

They affect:

  • Cost
  • Latency
  • Context size
  • Response quality
  • Memory
  • Scalability

Every AI request starts with a token budget.

The more tokens you consume, the more work the model has to perform.


πŸ’° Everything Consumes Tokens

Here's something many people don't realize.

When you send a request to an LLM, you're not paying only for your latest prompt.

The model processes everything included in the request.

That may include:

Content Uses Tokens?
System Instructions βœ…
User Prompt βœ…
Conversation History βœ…
Uploaded Documents βœ…
Retrieved RAG Results βœ…
Tool Outputs βœ…
Function Call Results βœ…
AI Response βœ…

Even if you only typed:

Summarize this.
Enter fullscreen mode Exit fullscreen mode

The model might actually receive thousands of tokens.

Imagine asking an AI coding assistant:

Why is this function failing?
Enter fullscreen mode Exit fullscreen mode

Behind the scenes, the application might automatically include:

  • System instructions
  • Current source file
  • Nearby functions
  • Package imports
  • Build configuration
  • Previous conversation
  • Compiler errors

Your four-word prompt becomes a request containing thousands of tokens.

This is another example of Context Engineering in action.


πŸ”„ Every New Request Starts Again

This surprises many engineers.

Suppose your conversation looks like this:

User:
Explain JWT Authentication.

Assistant:
...

User:
Can you show a Go example?

Assistant:
...

User:
Can you make it use middleware?
Enter fullscreen mode Exit fullscreen mode

Many people imagine the model simply remembers everything internally.

That's not how most stateless LLM API interactions work.

For each new request, the application typically sends the conversation history (or a selected portion of it) again so the model has the necessary context to continue the conversation coherently.

Conceptually, it looks something like this:

Request 1

System Prompt
User Message

↓

LLM

------------------------

Request 2

System Prompt
Conversation History
New User Message

↓

LLM

------------------------

Request 3

System Prompt
Conversation History
Latest User Message

↓

LLM
Enter fullscreen mode Exit fullscreen mode

As the conversation grows longer, more tokens are often included with each request unless the application trims, summarizes or otherwise manages the context.

This is one reason token usage can increase significantly during long conversations.


πŸͺŸ What Is a Context Window?

Now we can finally answer the question from the beginning.

If tokens are the building blocks…

Then what is the Context Window?

A context window is the maximum amount of information (measured in tokens) that a model can consider at one time while generating a response.

Think of it like a whiteboard.

Imagine you're solving a difficult problem with a teammate.

Everything written on the whiteboard is available to both of you.

You can read it.

Reference it.

Connect ideas.

Reason about it.

But the whiteboard isn't infinite.

Eventually, it fills up.

To continue working, you have two choices:

  • Erase older notes.
  • Replace them with a summary.

Large Language Models work in a very similar way.

The context window is that whiteboard.

Everything currently inside it is visible to the model while it generates its next response.

Everything outside it is invisible unless your application decides to bring it back.


πŸ“– What's Actually Inside the Context Window?

One common misconception is that the context window contains only your prompt.

In reality, modern AI applications often build a much richer context before sending the request.

A simplified request might look like this:

System Instructions

↓

Conversation History

↓

Retrieved Knowledge (RAG)

↓

Uploaded Files

↓

Tool Results

↓

Current User Prompt

↓

LLM
Enter fullscreen mode Exit fullscreen mode

Notice something interesting.

Your latest message may only be a tiny fraction of everything the model receives.

The application has already assembled the rest of the context behind the scenes.

This is why two AI tools using the same underlying model can produce very different answers.

The difference often isn't the model itself.

It's how effectively each application constructs the context before making the request.


πŸ€” Does a Bigger Context Window Always Mean Better AI?

At first glance, it sounds obvious.

If an AI model can process more information, shouldn't the answers always improve?

Not necessarily.

Imagine you're preparing for an interview.

Someone gives you:

  • Your resume
  • The job description
  • Company documentation
  • Engineering handbook
  • API documentation
  • Meeting notes
  • Internal wiki
  • Product roadmap
  • Last year's planning documents
  • Random Slack conversations

Could you answer the interview questions?

Probably.

Would all that information actually help?

Probably not.

In fact, it might make things worse because you'd spend time filtering out irrelevant information before focusing on what matters.

Large Language Models face a similar challenge.

A larger context window allows the model to see more information, but it doesn't mean every piece of that information is equally useful.

Good Context Engineering isn't about maximizing context.

It's about optimizing context.


🎯 More Context β‰  Better Context

One of the biggest mistakes in AI applications is thinking:

If some context helps,
more context must help even more.
Enter fullscreen mode Exit fullscreen mode

That's rarely true.

Imagine building an AI coding assistant.

A developer asks:

Why is this API returning 500?
Enter fullscreen mode Exit fullscreen mode

Option A:

Provide only:

  • The API handler
  • Relevant service
  • Error log

Option B:

Provide:

  • Entire repository
  • All dependencies
  • Every README
  • CI/CD configuration
  • Kubernetes manifests
  • Previous conversations
  • All database schemas

Technically, the model has more information.

Practically, you've introduced noise.

Finding the relevant signal becomes harder.

This is why mature AI systems spend significant effort deciding what not to include.

Context Engineering is often about removing information rather than adding it.


πŸ” The "Lost in the Middle" Problem

Researchers have observed an interesting behavior in long-context language models.

When a prompt becomes very large, information buried in the middle of the context may receive less attention than information closer to the beginning or end.

This phenomenon is commonly referred to as the "Lost in the Middle" effect.

Imagine reading a 300-page document.

You usually remember:

  • The introduction
  • The conclusion

But details from page 147?

Much harder.

Language models can exhibit similar behavior under certain conditions.

This doesn't mean long-context models are ineffective.

It means that how information is organized inside the context matters, not just how much information is included.


🧠 Context Is NOT Memory

This is probably the biggest misconception about modern AI.

People often say:

"ChatGPT remembered what I told it last week."

What actually happened depends on the application.

It's important to distinguish between context and memory.

Context

Context is the information available only for the current request.

Once the request is complete, the model itself doesn't retain that context for future API calls.

Think of it as a whiteboard used during a meeting.

Erase the whiteboard and the discussion is gone.


Memory

Memory is information that an application intentionally stores and later reintroduces into future conversations.

For example:

You tell an AI assistant:

I primarily write Go and React.
Enter fullscreen mode Exit fullscreen mode

The application might choose to save that preference.

Next week, when you ask:

Generate an API example.
Enter fullscreen mode Exit fullscreen mode

The application retrieves your stored preference and injects it into the new request.

The model appears to "remember."

In reality, the application supplied that information again.

The memory lives in the applicationβ€”not inside the language model itself.


πŸ—οΈ A Simple Mental Model

Think of an AI application as three separate components:

                User

                  β”‚
                  β–Ό

      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚   Memory Store      β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚ Context Builder     β”‚
      β”‚ - System Prompt     β”‚
      β”‚ - History           β”‚
      β”‚ - RAG Results       β”‚
      β”‚ - Tool Outputs      β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚        LLM          β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

The LLM only sees the final assembled context.

It doesn't know where each piece of information came from.


πŸ’° Why Tokens Matter for Cost and Performance

We've already learned that everything consumes tokens.

But token usage affects much more than pricing.

It also impacts:

  • Request latency
  • Throughput
  • Infrastructure cost
  • User experience

Consider these two requests.

Efficient Request

System Prompt

Relevant API documentation

Current User Question
Enter fullscreen mode Exit fullscreen mode

Inefficient Request

System Prompt

Entire conversation

Complete codebase

Multiple PDFs

Large JSON payloads

Old logs

Current User Question
Enter fullscreen mode Exit fullscreen mode

The second request requires the model to process significantly more information before generating a response.

Even if the answer only depends on a small portion of that data.

More tokens generally mean:

  • More computation
  • Higher latency
  • Higher cost

This is why production AI systems invest heavily in efficient context management.


βš™οΈ How Production AI Systems Manage Context

Well-designed AI applications rarely send everything they know to the model.

Instead, they carefully construct the context.

Some common strategies include:

βœ… Retrieve Only What's Relevant

Instead of sending an entire knowledge base,

retrieve only the documents related to the user's question.


βœ… Summarize Long Conversations

Rather than sending hundreds of previous messages,

summarize earlier parts while preserving important facts.

For example:

Instead of:

150 previous messages
Enter fullscreen mode Exit fullscreen mode

Send:

Conversation Summary:

β€’ User uses Go
β€’ Building REST APIs
β€’ Uses PostgreSQL
β€’ Currently debugging authentication
Enter fullscreen mode Exit fullscreen mode

Far fewer tokens.

Much easier for the model.


βœ… Keep System Prompts Focused

System instructions should define behavior,

not contain an entire company handbook.


βœ… Separate Long-Term Memory from Current Context

Store reusable preferences separately.

Retrieve them only when needed.


βœ… Measure Token Usage

Don't optimize blindly.

Most AI SDKs expose token usage statistics.

Monitoring them helps identify unnecessary context and control costs.


🚫 Common Context Engineering Mistakes

Here are some patterns that often reduce AI quality rather than improve it.

❌ Dumping entire documents into every request

More information isn't automatically better.


❌ Sending the full conversation forever

Long conversations should be summarized or trimmed when appropriate.


❌ Treating memory as conversation history

They're different concepts.


❌ Ignoring token usage

Unexpected token growth often leads to higher costs and slower responses.


❌ Assuming bigger context windows solve everything

Context quality matters far more than context quantity.


🎯 Best Practices

If you're building AI-powered applications, keep these principles in mind:

βœ… Send only the information required for the current task.

βœ… Prefer relevant context over larger context.

βœ… Separate memory from conversation history.

βœ… Summarize older interactions instead of continuously expanding prompts.

βœ… Monitor token usage in production.

βœ… Design context deliberately instead of simply appending more text.


🏁 Final Thoughts

When people first start working with AI, they often focus on prompts.

Then they discover Context Engineering.

Eventually, they realize another important truth:

Not all context is equally valuable.

Modern AI systems succeed because they carefully decide:

  • What to include
  • What to exclude
  • What to summarize
  • What to retrieve
  • What to remember

That decision-making process is often more important than the prompt itself.

The best AI applications aren't the ones with the biggest context windows.

They're the ones that provide the right information at the right time.

And that's exactly what Context Engineering is about.


πŸ“Œ Key Takeaways

  • Tokens are the fundamental units processed by language modelsβ€”they are not the same as words.
  • Every request consumes tokens, including system prompts, conversation history, retrieved documents, tool outputs and the model's response.
  • A context window defines how much information a model can consider at once.
  • Larger context windows are useful, but more context doesn't automatically lead to better answers.
  • Context and memory are different: context exists for the current request, while memory is information intentionally stored and reintroduced by the application.
  • Effective Context Engineering is about sending relevant information, not all available information.
  • Good context management improves response quality, reduces costs and lowers latency.

πŸš€ Coming Up in Part 3

So far, we've answered:

  • What is Context Engineering?
  • What are tokens?
  • How do context windows work?
  • Why isn't bigger context always better?
  • What's the difference between context and memory?

But another important question remains:

Where does all this context actually come from?

How do AI applications automatically search documentation, query databases, call APIs, execute tools and decide what information should be sent to the model?

That's where concepts like Retrieval-Augmented Generation (RAG), Tool Calling, Model Context Protocol (MCP) and AI Agents come into the picture.

We'll break each of them down in Part 3 with practical examples and real-world engineering scenarios.

Top comments (0)