Your AI agent needs to remember things. User preferences, project details, conversation history, tool results—all of it matters for providing intelligent responses. But every token you send costs money and consumes your context window.
Send too little context, and your agent gives generic, unhelpful responses. Send too much, and you hit token limits, rack up costs, and slow down responses.
I've built agents that manage context for manufacturing operations, sales workflows, and productivity systems. Here's how to give agents the right memory at the right time—without exploding your budget.
The Context Budget Problem
Every LLM has a context window—a maximum number of tokens it can process. Claude Sonnet 4.5 has 200K tokens. GPT-4 has 128K tokens. Sounds like a lot, right?
Here's what actually fits in that budget:
A typical agent's baseline context:
- System prompt: 1,500 tokens
- Tool definitions: 2,000 tokens
- Agent instructions: 1,000 tokens
- Subtotal: 4,500 tokens
For a project-based agent, add:
- Project machines list: 800 tokens
- Materials and specifications: 600 tokens
- Historical data: 1,200 tokens
- Subtotal: 2,600 tokens
For a 20-turn conversation:
- User messages (avg 100 tokens): 2,000 tokens
- Agent responses (avg 300 tokens): 6,000 tokens
- Subtotal: 8,000 tokens
Total: 15,100 tokens for a moderate conversation.
And you haven't even added:
- Retrieved documents
- Search results
- Previous workflow outputs
- Related task context
You're already at 15% of a 100K context window. By turn 50, you're hitting limits. By turn 100, you're out of space.
The naive solution—"send everything always"—fails fast.
Pattern 1: Lazy Context Loading (Just-In-Time)
The Problem
Most agents load all available context upfront:
# ❌ Eager loading - wasteful
def build_context(project_id):
return {
'machines': get_all_machines(project_id), # 50 machines
'materials': get_all_materials(project_id), # 200 materials
'specs': get_all_specs(project_id), # 100 specs
'history': get_full_history(project_id), # 1000 records
'users': get_all_users(project_id), # 30 users
'schedules': get_all_schedules(project_id) # 500 schedules
}
# Result: 8,000+ tokens, most unused
The agent receives data it never uses. You pay for tokens that don't contribute to the response.
The Solution: Load Only What's Needed
Give the agent tools to request context when needed:
# ✅ Lazy loading - efficient
def build_minimal_context(project_id):
return {
'project_name': get_project_name(project_id),
'project_type': get_project_type(project_id)
}
# Agent has tools to fetch more
tools = [
Tool(
name='get_machines',
description='Fetch machines for this project',
function=lambda: get_machines(project_id)
),
Tool(
name='get_materials',
description='Fetch materials for this project',
function=lambda: get_materials(project_id)
),
Tool(
name='search_history',
description='Search historical records by keyword',
function=lambda query: search_history(project_id, query)
)
]
When the Agent Needs Data
Turn 1:
- User: "Create a quality plan for automotive parts"
- Agent: receives minimal context
- Agent: calls get_machines() and get_materials()
- Agent: "I see you have 3 machines available..."
Turn 5:
- User: "What maintenance was done on Machine A last month?"
- Agent: calls search_history(query="Machine A maintenance")
- Agent: "Machine A had preventive maintenance on..."
Why This Works
✅ Reduced baseline: Start with ~500 tokens instead of 8,000
✅ On-demand loading: Only fetch what's relevant to the current task
✅ Token efficiency: Pay for what you use, not what you might use
✅ Better relevance: Agent gets focused, pertinent data
Real-World Impact
Before lazy loading:
- Average context: 12,000 tokens per request
- Cost per conversation: $0.48
- Irrelevant data: 60-70%
After lazy loading:
- Average context: 4,500 tokens per request
- Cost per conversation: $0.18
- Irrelevant data: <10%
62% cost reduction with better response quality.
Pattern 2: Task-Specific Context Windows
The Problem
Different tasks need different context. A quality planning agent needs machines and materials. A maintenance agent needs service history. An SOP agent needs workstations and TAKT times.
Loading everything for every task wastes tokens and confuses the agent with irrelevant data.
The Solution: Context Profiles per Agent Type
Define exactly what each agent needs:
class ContextManager:
CONTEXT_PROFILES = {
'quality_planning': {
'required': ['machines', 'materials', 'specifications'],
'optional': ['previous_plans', 'quality_metrics'],
'exclude': ['maintenance_history', 'sop_data']
},
'maintenance_scheduling': {
'required': ['machines', 'maintenance_history'],
'optional': ['upcoming_schedules', 'parts_inventory'],
'exclude': ['materials', 'quality_specs']
},
'sop_creation': {
'required': ['workstations', 'resources', 'takt_time'],
'optional': ['existing_sops', 'process_flow'],
'exclude': ['quality_specs', 'maintenance_history']
},
'issue_tracking': {
'required': ['machines', 'materials'],
'optional': ['recent_issues', 'resolution_history'],
'exclude': ['sop_data', 'schedules']
}
}
def build_context(self, agent_type: str, project_id: str) -> dict:
"""
Build context based on agent's specific needs.
"""
profile = self.CONTEXT_PROFILES[agent_type]
context = {}
# Load required data
for key in profile['required']:
context[key] = self._load_data(key, project_id)
# Load optional data if available (don't fail if missing)
for key in profile['optional']:
try:
context[key] = self._load_data(key, project_id)
except DataNotFoundError:
pass # Optional, skip if unavailable
# Explicitly exclude irrelevant data
return context
Context Isolation Benefits
✅ Clarity: Agent sees only relevant information
✅ Speed: Less data to load and process
✅ Accuracy: No confusion from unrelated data
✅ Cost: Fewer tokens per request
✅ Debugging: Easy to see what context each agent receives
Example: Quality Planning vs. Maintenance
Quality Planning Context:
{
"project_name": "Automotive Assembly Line",
"machines": [
{"id": "M1", "name": "CNC Mill", "specs": "..."},
{"id": "M2", "name": "Lathe", "specs": "..."}
],
"materials": [
{"id": "MAT1", "name": "Steel", "grade": "304"}
],
"specifications": {
"tolerance": "±0.01mm",
"surface_finish": "Ra 1.6"
}
}
Maintenance Scheduling Context:
{
"project_name": "Automotive Assembly Line",
"machines": [
{"id": "M1", "name": "CNC Mill", "last_service": "2024-01-15"},
{"id": "M2", "name": "Lathe", "last_service": "2024-02-01"}
],
"maintenance_history": [
{"machine": "M1", "date": "2024-01-15", "type": "preventive"},
{"machine": "M2", "date": "2024-02-01", "type": "repair"}
],
"upcoming_schedules": [
{"machine": "M1", "due": "2024-04-15", "type": "preventive"}
]
}
Notice: No overlap in specifications, quality metrics, or workstation data. Each agent gets exactly what it needs.
Pattern 3: Conversation History Windowing
The Problem
LLM conversations grow unbounded. By turn 50, you have:
- 50 user messages
- 50 agent responses
- Tool calls and results
- System messages
This exceeds context limits and makes responses slower and more expensive.
The Solution: Smart Windowing with Summarization
Keep recent messages in full, summarize older ones:
class ConversationWindow:
def __init__(self, max_full_messages=10):
self.max_full_messages = max_full_messages
self.summary_cache = {}
async def prepare_history(self, session_id: str) -> list:
"""
Prepare conversation history within token budget.
"""
full_history = await self.get_full_history(session_id)
if len(full_history) <= self.max_full_messages:
return full_history
# Split into recent and old
recent_messages = full_history[-self.max_full_messages:]
old_messages = full_history[:-self.max_full_messages]
# Check if we already have a summary
summary_key = f"{session_id}:{len(old_messages)}"
if summary_key in self.summary_cache:
summary = self.summary_cache[summary_key]
else:
summary = await self._create_summary(old_messages)
self.summary_cache[summary_key] = summary
# Combine summary + recent messages
return [
{
'role': 'system',
'content': f'Previous conversation summary: {summary}'
},
*recent_messages
]
async def _create_summary(self, messages: list) -> str:
"""
Create concise summary focusing on key information.
"""
# Extract key information to preserve
decisions = self._extract_decisions(messages)
data_collected = self._extract_data(messages)
progress = self._extract_progress(messages)
summary_prompt = f"""
Summarize this conversation segment in 3-4 sentences:
Focus on:
- Key decisions: {decisions}
- Data collected: {data_collected}
- Progress made: {progress}
Messages:
{self._format_messages(messages)}
Create a concise summary that preserves essential context.
"""
summary = await self.llm.complete(summary_prompt)
return summary.strip()
What to Preserve in Summaries
Always preserve:
- User decisions and choices
- Specific data provided (numbers, names, IDs)
- Task progress and completion status
- Error messages or issues encountered
- Tool call results that affect future actions
Can compress:
- Clarifying questions and answers
- Explanations of concepts
- Confirmation messages
- General chitchat
- Repetitive information
Example Summary
Original (10 messages, 2,000 tokens):
User: "I need to create a quality plan"
Agent: "What product are you manufacturing?"
User: "Automotive brake pads"
Agent: "What materials are you using?"
User: "Steel alloy, grade 304"
Agent: "What machines will you use?"
User: "CNC Mill M1 and Lathe M2"
Agent: "What tolerances are required?"
User: "±0.01mm"
Agent: "Got it. Let me create the plan..."
Summary (150 tokens):
User requested quality plan for automotive brake pads.
Materials: Steel alloy grade 304.
Machines: CNC Mill M1, Lathe M2.
Tolerance requirement: ±0.01mm.
Plan creation initiated.
Token Savings
- Original: 2,000 tokens
- Summary: 150 tokens
- Savings: 92.5%
Multiply this across a 50-turn conversation and you save thousands of tokens per request.
Pattern 4: RAG (Retrieval-Augmented Generation) for Large Knowledge Bases
The Problem
Some agents need access to large knowledge bases:
- 500+ product specifications
- 1,000+ historical maintenance records
- 200+ standard operating procedures
- Complete company documentation
You can't fit this in context. Even with a 200K token window, it's inefficient.
The Solution: Vector Search + Selective Retrieval
Store knowledge in a vector database, retrieve only relevant chunks:
class KnowledgeRetriever:
def __init__(self, vector_db):
self.db = vector_db
async def retrieve_relevant(self, query: str, top_k: int = 3) -> list:
"""
Retrieve most relevant knowledge chunks for query.
"""
# Embed the query
query_embedding = await self.embed(query)
# Search vector database
results = await self.db.search(
vector=query_embedding,
limit=top_k,
threshold=0.7 # Similarity threshold
)
# Return relevant chunks
return [
{
'content': result.content,
'source': result.metadata['source'],
'relevance': result.score
}
for result in results
]
async def build_rag_context(self, user_message: str, base_context: dict) -> dict:
"""
Augment base context with retrieved knowledge.
"""
# Retrieve relevant documents
relevant_docs = await self.retrieve_relevant(user_message)
# Add to context
augmented_context = {
**base_context,
'retrieved_knowledge': relevant_docs
}
return augmented_context
When to Use RAG vs. Direct Context
Use direct context when:
- Data is small (<2,000 tokens)
- Data is frequently needed
- Data is structured and predictable
- Fast response time is critical
Use RAG when:
- Knowledge base is large (>10,000 tokens)
- Data is accessed occasionally
- Relevance varies by query
- Full text search is needed
RAG Implementation Example
Scenario: Agent needs to reference maintenance procedures.
Without RAG (fails):
# Can't fit 200 procedures in context
procedures = load_all_procedures() # 50,000 tokens
# Context limit exceeded!
With RAG (works):
# User asks about specific machine
user_query = "How do I service Machine A?"
# Retrieve only relevant procedures
relevant = await retriever.retrieve_relevant(user_query, top_k=3)
# Result: 3 procedures, ~1,500 tokens
# Agent sees only what's needed
agent_context = {
'project': project_data,
'relevant_procedures': relevant # Just 3, not 200
}
RAG Architecture
User Query
↓
Query Embedding
↓
Vector Search
↓
Top K Results (by similarity)
↓
Relevance Filtering (threshold)
↓
Context Augmentation
↓
Agent Processing
Vector Database Options
Weaviate:
- Good for production scale
- Rich filtering capabilities
- Self-hosted or cloud
Pinecone:
- Managed service
- Fast and reliable
- Easy to get started
pgvector (PostgreSQL):
- Use existing PostgreSQL
- Good for moderate scale
- No additional infrastructure
When to use each:
- pgvector: <100K vectors, already using PostgreSQL
- Weaviate: 100K-10M vectors, need rich filtering
- Pinecone: Any scale, want managed solution
Pattern 5: Session State vs. Long-Term Memory
The Problem
Agents need two types of memory:
- Session memory: Current conversation, temporary
- Long-term memory: User preferences, historical decisions, persistent
Treating them the same leads to:
- Session data polluting long-term memory
- Long-term memory cluttering sessions
- Difficulty clearing temporary data
- Privacy and data retention issues
The Solution: Separate Storage with Clear Boundaries
class MemoryManager:
def __init__(self, session_store, long_term_store):
self.session = session_store # Redis, TTL: 1 hour
self.long_term = long_term_store # PostgreSQL, permanent
async def get_session_context(self, session_id: str) -> dict:
"""
Get temporary session data.
Auto-expires after inactivity.
"""
return await self.session.get(f"session:{session_id}")
async def get_long_term_context(self, user_id: str) -> dict:
"""
Get persistent user data.
Requires explicit deletion.
"""
return await self.long_term.query(
"SELECT preferences, history FROM user_memory WHERE user_id = $1",
user_id
)
async def build_complete_context(self, session_id: str, user_id: str) -> dict:
"""
Combine session and long-term memory.
"""
session_data = await self.get_session_context(session_id)
long_term_data = await self.get_long_term_context(user_id)
return {
'current_session': session_data,
'user_memory': long_term_data
}
async def save_to_long_term(self, user_id: str, key: str, value: any):
"""
Explicitly save important information for future sessions.
"""
await self.long_term.execute(
"INSERT INTO user_memory (user_id, key, value) VALUES ($1, $2, $3)",
user_id, key, value
)
What Goes Where
Session Storage (Temporary):
- Current conversation history
- Active task state
- Temporary tool results
- Draft outputs
- Workflow progress
Long-Term Storage (Permanent):
- User preferences (language, style)
- Project associations
- Historical decisions
- Learned patterns
- Completed task outcomes
Example: Quality Planning
Session memory:
{
"session_id": "sess_123",
"task": "quality_planning",
"current_step": 5,
"collected_data": {
"product": "brake pads",
"materials": ["steel 304"],
"machines": ["M1", "M2"]
},
"draft_plan": {...}
}
Long-term memory:
{
"user_id": "user_456",
"preferences": {
"default_tolerance": "±0.01mm",
"preferred_machines": ["M1", "M2"],
"notification_style": "summary"
},
"completed_plans": [
{"project": "Project A", "date": "2024-01-15"},
{"project": "Project B", "date": "2024-02-20"}
]
}
Memory Lifecycle
Session memory:
- Created on first message
- Updated each turn
- Auto-expires after 1 hour of inactivity
- Can be explicitly cleared
Long-term memory:
- Created on user signup
- Updated on explicit events (preferences changed, task completed)
- Never expires (except for data retention policies)
- Requires user action to delete
Pattern 6: Context Compression Techniques
The Problem
Sometimes you need to reference large documents but can't fit them in context. User uploads a 20-page PDF. You need key information but not everything.
The Solution: Multi-Level Compression
class ContextCompressor:
async def compress_document(self, document: str, target_tokens: int) -> str:
"""
Compress document to fit within token budget.
"""
current_tokens = self.count_tokens(document)
if current_tokens <= target_tokens:
return document # Already fits
# Level 1: Extract key sections
if current_tokens < target_tokens * 2:
return await self._extract_key_sections(document, target_tokens)
# Level 2: Summarize sections
if current_tokens < target_tokens * 5:
return await self._summarize_sections(document, target_tokens)
# Level 3: Create hierarchical summary
return await self._hierarchical_summary(document, target_tokens)
async def _extract_key_sections(self, document: str, target: int) -> str:
"""
Extract most relevant sections based on headings and keywords.
"""
sections = self._split_by_headings(document)
scored_sections = []
for section in sections:
score = self._relevance_score(section)
scored_sections.append((score, section))
# Take top sections until we hit token limit
sorted_sections = sorted(scored_sections, reverse=True)
result = []
tokens_used = 0
for score, section in sorted_sections:
section_tokens = self.count_tokens(section)
if tokens_used + section_tokens <= target:
result.append(section)
tokens_used += section_tokens
else:
break
return '\n\n'.join(result)
async def _summarize_sections(self, document: str, target: int) -> str:
"""
Summarize each section independently.
"""
sections = self._split_by_headings(document)
summaries = []
for section in sections:
summary = await self.llm.complete(
f"Summarize this section in 2-3 sentences:\n{section}"
)
summaries.append(f"**{section.heading}:** {summary}")
return '\n\n'.join(summaries)
async def _hierarchical_summary(self, document: str, target: int) -> str:
"""
Create multi-level summary for very large documents.
"""
# Split into chunks
chunks = self._split_into_chunks(document, chunk_size=2000)
# Summarize each chunk
chunk_summaries = []
for chunk in chunks:
summary = await self.llm.complete(
f"Summarize key points from this text:\n{chunk}"
)
chunk_summaries.append(summary)
# Summarize the summaries
combined_summaries = '\n'.join(chunk_summaries)
final_summary = await self.llm.complete(
f"Create a comprehensive summary from these section summaries:\n{combined_summaries}"
)
return final_summary
Compression Strategies by Document Type
Code files:
- Extract function signatures
- Keep docstrings
- Summarize implementation
- Preserve key logic
Reports/Documents:
- Keep executive summary
- Extract headings and key points
- Compress body paragraphs
- Preserve conclusions
Data files:
- Show schema/structure
- Provide sample rows
- Summarize statistics
- List unique values
Conversations:
- Keep decisions and actions
- Compress explanations
- Preserve outcomes
- Remove redundancy
Putting It All Together: The Context Stack
Here's how all these patterns combine in a production system:
User Message
↓
┌─────────────────────────┐
│ Base Context Builder │
│ (Minimal required) │
└───────────┬─────────────┘
↓
┌─────────────────────────┐
│ Task-Specific Context │
│ (Profile-based loading) │
└───────────┬─────────────┘
↓
┌─────────────────────────┐
│ Conversation Window │
│ (Recent + Summary) │
└───────────┬─────────────┘
↓
┌─────────────────────────┐
│ RAG Retrieval │
│ (If knowledge needed) │
└───────────┬─────────────┘
↓
┌─────────────────────────┐
│ Memory Integration │
│ (Session + Long-term) │
└───────────┬─────────────┘
↓
┌─────────────────────────┐
│ Final Context Assembly │
│ (Within budget) │
└───────────┬─────────────┘
↓
Agent Processing
Example: Quality Planning Request
User: "Create a quality plan for brake pads"
Context assembly:
- Base context (500 tokens):
{
"user_id": "user_123",
"session_id": "sess_456",
"task": "quality_planning"
}
- Task-specific context (2,000 tokens):
{
"machines": [...],
"materials": [...],
"specifications": [...]
}
- Conversation window (1,500 tokens):
{
"summary": "User requested quality plan. Product: brake pads.",
"recent_messages": [last 5 messages]
}
- RAG retrieval (1,000 tokens):
{
"retrieved_procedures": [
"Quality planning procedure for automotive parts",
"Brake pad inspection guidelines",
"Material specification standards"
]
}
- Memory (500 tokens):
{
"user_preferences": {
"default_tolerance": "±0.01mm"
},
"previous_plans": ["Project A", "Project B"]
}
Total context: 5,500 tokens (within 10K budget, leaving room for response)
Key Takeaways
Effective context engineering requires:
✅ 1. Lazy Loading
- Start minimal, load on-demand
- Use tools for dynamic retrieval
- Pay only for what you use
✅ 2. Task-Specific Profiles
- Define context needs per agent type
- Load only relevant data
- Isolate contexts between tasks
✅ 3. Smart Windowing
- Keep recent messages in full
- Summarize older messages
- Preserve critical information
✅ 4. RAG for Large Knowledge
- Vector search for relevant chunks
- Don't fit everything in context
- Retrieve top-k similar items
✅ 5. Separate Memory Types
- Session: temporary, auto-expires
- Long-term: persistent, explicit
- Clear boundaries between them
✅ 6. Compression Techniques
- Extract key sections
- Summarize large documents
- Hierarchical summarization
Common Anti-Patterns to Avoid
❌ Loading everything upfront → Wasted tokens, high costs, context limit errors
❌ No conversation history limits → Unbounded growth, eventual failure
❌ Treating all memory as permanent → Cluttered context, privacy issues
❌ No context compression → Can't handle large documents
❌ Same context for all tasks → Irrelevant data confuses agents
The Bottom Line
Context is your agent's memory—and memory is expensive. The key is giving agents the right information at the right time without exceeding token budgets.
What works:
- Lazy loading with on-demand tools
- Task-specific context profiles
- Smart conversation windowing
- RAG for large knowledge bases
- Separate session and long-term memory
- Context compression for large docs
What fails:
- Eager loading of all data
- Unlimited conversation history
- Mixed memory types
- No compression strategy
- Universal context for all tasks
Context engineering isn't about cramming everything into the window—it's about strategic selection, smart summarization, and ruthless prioritization.
Get this right, and your agents have perfect memory at sustainable cost.
About the Author
I build production-grade multi-agent systems with optimized context management strategies. My implementations achieve 60% cost reduction while improving response relevance through lazy loading, RAG, and smart windowing.
Specialized in context engineering, token optimization, and cost-effective agent architectures using CrewAI, Agno, and vector databases.
Open to consulting on context management challenges!
📧 Contact: gupta.akshay1996@gmail.com
Found this helpful? Share it with other AI builders! 🚀
What context management challenges are you facing? Drop a comment below!
Top comments (0)