Akshay Gupta

Posted on Nov 5

Context Engineering: Giving AI Agents Memory Without Breaking the Token Budget

#ai #agents #context #llm

Your AI agent needs to remember things. User preferences, project details, conversation history, tool results—all of it matters for providing intelligent responses. But every token you send costs money and consumes your context window.

Send too little context, and your agent gives generic, unhelpful responses. Send too much, and you hit token limits, rack up costs, and slow down responses.

I've built agents that manage context for manufacturing operations, sales workflows, and productivity systems. Here's how to give agents the right memory at the right time—without exploding your budget.

The Context Budget Problem

Every LLM has a context window—a maximum number of tokens it can process. Claude Sonnet 4.5 has 200K tokens. GPT-4 has 128K tokens. Sounds like a lot, right?

Here's what actually fits in that budget:

A typical agent's baseline context:

System prompt: 1,500 tokens
Tool definitions: 2,000 tokens
Agent instructions: 1,000 tokens
Subtotal: 4,500 tokens

For a project-based agent, add:

Project machines list: 800 tokens
Materials and specifications: 600 tokens
Historical data: 1,200 tokens
Subtotal: 2,600 tokens

For a 20-turn conversation:

User messages (avg 100 tokens): 2,000 tokens
Agent responses (avg 300 tokens): 6,000 tokens
Subtotal: 8,000 tokens

Total: 15,100 tokens for a moderate conversation.

And you haven't even added:

Retrieved documents
Search results
Previous workflow outputs
Related task context

You're already at 15% of a 100K context window. By turn 50, you're hitting limits. By turn 100, you're out of space.

The naive solution—"send everything always"—fails fast.

Pattern 1: Lazy Context Loading (Just-In-Time)

The Problem

Most agents load all available context upfront:

# ❌ Eager loading - wasteful
def build_context(project_id):
    return {
        'machines': get_all_machines(project_id),      # 50 machines
        'materials': get_all_materials(project_id),    # 200 materials
        'specs': get_all_specs(project_id),            # 100 specs
        'history': get_full_history(project_id),       # 1000 records
        'users': get_all_users(project_id),            # 30 users
        'schedules': get_all_schedules(project_id)     # 500 schedules
    }

# Result: 8,000+ tokens, most unused

The agent receives data it never uses. You pay for tokens that don't contribute to the response.

The Solution: Load Only What's Needed

Give the agent tools to request context when needed:

# ✅ Lazy loading - efficient
def build_minimal_context(project_id):
    return {
        'project_name': get_project_name(project_id),
        'project_type': get_project_type(project_id)
    }

# Agent has tools to fetch more
tools = [
    Tool(
        name='get_machines',
        description='Fetch machines for this project',
        function=lambda: get_machines(project_id)
    ),
    Tool(
        name='get_materials', 
        description='Fetch materials for this project',
        function=lambda: get_materials(project_id)
    ),
    Tool(
        name='search_history',
        description='Search historical records by keyword',
        function=lambda query: search_history(project_id, query)
    )
]

When the Agent Needs Data

Turn 1:

User: "Create a quality plan for automotive parts"
Agent: receives minimal context
Agent: calls get_machines() and get_materials()
Agent: "I see you have 3 machines available..."

Turn 5:

User: "What maintenance was done on Machine A last month?"
Agent: calls search_history(query="Machine A maintenance")
Agent: "Machine A had preventive maintenance on..."

Why This Works

✅ Reduced baseline: Start with ~500 tokens instead of 8,000

✅ On-demand loading: Only fetch what's relevant to the current task

✅ Token efficiency: Pay for what you use, not what you might use

✅ Better relevance: Agent gets focused, pertinent data

Real-World Impact

Before lazy loading:

Average context: 12,000 tokens per request
Cost per conversation: $0.48
Irrelevant data: 60-70%

After lazy loading:

Average context: 4,500 tokens per request
Cost per conversation: $0.18
Irrelevant data: <10%

62% cost reduction with better response quality.

Pattern 2: Task-Specific Context Windows

The Problem

Different tasks need different context. A quality planning agent needs machines and materials. A maintenance agent needs service history. An SOP agent needs workstations and TAKT times.

Loading everything for every task wastes tokens and confuses the agent with irrelevant data.

The Solution: Context Profiles per Agent Type

Define exactly what each agent needs:

class ContextManager:

    CONTEXT_PROFILES = {
        'quality_planning': {
            'required': ['machines', 'materials', 'specifications'],
            'optional': ['previous_plans', 'quality_metrics'],
            'exclude': ['maintenance_history', 'sop_data']
        },

        'maintenance_scheduling': {
            'required': ['machines', 'maintenance_history'],
            'optional': ['upcoming_schedules', 'parts_inventory'],
            'exclude': ['materials', 'quality_specs']
        },

        'sop_creation': {
            'required': ['workstations', 'resources', 'takt_time'],
            'optional': ['existing_sops', 'process_flow'],
            'exclude': ['quality_specs', 'maintenance_history']
        },

        'issue_tracking': {
            'required': ['machines', 'materials'],
            'optional': ['recent_issues', 'resolution_history'],
            'exclude': ['sop_data', 'schedules']
        }
    }

    def build_context(self, agent_type: str, project_id: str) -> dict:
        """
        Build context based on agent's specific needs.
        """
        profile = self.CONTEXT_PROFILES[agent_type]
        context = {}

        # Load required data
        for key in profile['required']:
            context[key] = self._load_data(key, project_id)

        # Load optional data if available (don't fail if missing)
        for key in profile['optional']:
            try:
                context[key] = self._load_data(key, project_id)
            except DataNotFoundError:
                pass  # Optional, skip if unavailable

        # Explicitly exclude irrelevant data

        return context

Context Isolation Benefits

✅ Clarity: Agent sees only relevant information

✅ Speed: Less data to load and process

✅ Accuracy: No confusion from unrelated data

✅ Cost: Fewer tokens per request

✅ Debugging: Easy to see what context each agent receives

Example: Quality Planning vs. Maintenance

Quality Planning Context:

{
    "project_name": "Automotive Assembly Line",
    "machines": [
        {"id": "M1", "name": "CNC Mill", "specs": "..."},
        {"id": "M2", "name": "Lathe", "specs": "..."}
    ],
    "materials": [
        {"id": "MAT1", "name": "Steel", "grade": "304"}
    ],
    "specifications": {
        "tolerance": "±0.01mm",
        "surface_finish": "Ra 1.6"
    }
}

Maintenance Scheduling Context:

{
    "project_name": "Automotive Assembly Line",
    "machines": [
        {"id": "M1", "name": "CNC Mill", "last_service": "2024-01-15"},
        {"id": "M2", "name": "Lathe", "last_service": "2024-02-01"}
    ],
    "maintenance_history": [
        {"machine": "M1", "date": "2024-01-15", "type": "preventive"},
        {"machine": "M2", "date": "2024-02-01", "type": "repair"}
    ],
    "upcoming_schedules": [
        {"machine": "M1", "due": "2024-04-15", "type": "preventive"}
    ]
}

Notice: No overlap in specifications, quality metrics, or workstation data. Each agent gets exactly what it needs.

Pattern 3: Conversation History Windowing

The Problem

LLM conversations grow unbounded. By turn 50, you have:

50 user messages
50 agent responses
Tool calls and results
System messages

This exceeds context limits and makes responses slower and more expensive.

The Solution: Smart Windowing with Summarization

Keep recent messages in full, summarize older ones:

class ConversationWindow:
    def __init__(self, max_full_messages=10):
        self.max_full_messages = max_full_messages
        self.summary_cache = {}

    async def prepare_history(self, session_id: str) -> list:
        """
        Prepare conversation history within token budget.
        """
        full_history = await self.get_full_history(session_id)

        if len(full_history) <= self.max_full_messages:
            return full_history

        # Split into recent and old
        recent_messages = full_history[-self.max_full_messages:]
        old_messages = full_history[:-self.max_full_messages]

        # Check if we already have a summary
        summary_key = f"{session_id}:{len(old_messages)}"
        if summary_key in self.summary_cache:
            summary = self.summary_cache[summary_key]
        else:
            summary = await self._create_summary(old_messages)
            self.summary_cache[summary_key] = summary

        # Combine summary + recent messages
        return [
            {
                'role': 'system',
                'content': f'Previous conversation summary: {summary}'
            },
            *recent_messages
        ]

    async def _create_summary(self, messages: list) -> str:
        """
        Create concise summary focusing on key information.
        """

        # Extract key information to preserve
        decisions = self._extract_decisions(messages)
        data_collected = self._extract_data(messages)
        progress = self._extract_progress(messages)

        summary_prompt = f"""
        Summarize this conversation segment in 3-4 sentences:

        Focus on:
        - Key decisions: {decisions}
        - Data collected: {data_collected}
        - Progress made: {progress}

        Messages:
        {self._format_messages(messages)}

        Create a concise summary that preserves essential context.
        """

        summary = await self.llm.complete(summary_prompt)
        return summary.strip()

What to Preserve in Summaries

Always preserve:

User decisions and choices
Specific data provided (numbers, names, IDs)
Task progress and completion status
Error messages or issues encountered
Tool call results that affect future actions

Can compress:

Clarifying questions and answers
Explanations of concepts
Confirmation messages
General chitchat
Repetitive information

Example Summary

Original (10 messages, 2,000 tokens):

User: "I need to create a quality plan"
Agent: "What product are you manufacturing?"
User: "Automotive brake pads"
Agent: "What materials are you using?"
User: "Steel alloy, grade 304"
Agent: "What machines will you use?"
User: "CNC Mill M1 and Lathe M2"
Agent: "What tolerances are required?"
User: "±0.01mm"
Agent: "Got it. Let me create the plan..."

Summary (150 tokens):

User requested quality plan for automotive brake pads. 
Materials: Steel alloy grade 304. 
Machines: CNC Mill M1, Lathe M2. 
Tolerance requirement: ±0.01mm. 
Plan creation initiated.

Token Savings

Original: 2,000 tokens
Summary: 150 tokens
Savings: 92.5%

Multiply this across a 50-turn conversation and you save thousands of tokens per request.

Pattern 4: RAG (Retrieval-Augmented Generation) for Large Knowledge Bases

The Problem

Some agents need access to large knowledge bases:

500+ product specifications
1,000+ historical maintenance records
200+ standard operating procedures
Complete company documentation

You can't fit this in context. Even with a 200K token window, it's inefficient.

The Solution: Vector Search + Selective Retrieval

Store knowledge in a vector database, retrieve only relevant chunks:

class KnowledgeRetriever:
    def __init__(self, vector_db):
        self.db = vector_db

    async def retrieve_relevant(self, query: str, top_k: int = 3) -> list:
        """
        Retrieve most relevant knowledge chunks for query.
        """

        # Embed the query
        query_embedding = await self.embed(query)

        # Search vector database
        results = await self.db.search(
            vector=query_embedding,
            limit=top_k,
            threshold=0.7  # Similarity threshold
        )

        # Return relevant chunks
        return [
            {
                'content': result.content,
                'source': result.metadata['source'],
                'relevance': result.score
            }
            for result in results
        ]

    async def build_rag_context(self, user_message: str, base_context: dict) -> dict:
        """
        Augment base context with retrieved knowledge.
        """

        # Retrieve relevant documents
        relevant_docs = await self.retrieve_relevant(user_message)

        # Add to context
        augmented_context = {
            **base_context,
            'retrieved_knowledge': relevant_docs
        }

        return augmented_context

When to Use RAG vs. Direct Context

Use direct context when:

Data is small (<2,000 tokens)
Data is frequently needed
Data is structured and predictable
Fast response time is critical

Use RAG when:

Knowledge base is large (>10,000 tokens)
Data is accessed occasionally
Relevance varies by query
Full text search is needed

RAG Implementation Example

Scenario: Agent needs to reference maintenance procedures.

Without RAG (fails):

# Can't fit 200 procedures in context
procedures = load_all_procedures()  # 50,000 tokens
# Context limit exceeded!

With RAG (works):

# User asks about specific machine
user_query = "How do I service Machine A?"

# Retrieve only relevant procedures
relevant = await retriever.retrieve_relevant(user_query, top_k=3)
# Result: 3 procedures, ~1,500 tokens

# Agent sees only what's needed
agent_context = {
    'project': project_data,
    'relevant_procedures': relevant  # Just 3, not 200
}

RAG Architecture

User Query
    ↓
Query Embedding
    ↓
Vector Search
    ↓
Top K Results (by similarity)
    ↓
Relevance Filtering (threshold)
    ↓
Context Augmentation
    ↓
Agent Processing

Vector Database Options

Weaviate:

Good for production scale
Rich filtering capabilities
Self-hosted or cloud

Pinecone:

Managed service
Fast and reliable
Easy to get started

pgvector (PostgreSQL):

Use existing PostgreSQL
Good for moderate scale
No additional infrastructure

When to use each:

pgvector: <100K vectors, already using PostgreSQL
Weaviate: 100K-10M vectors, need rich filtering
Pinecone: Any scale, want managed solution

Pattern 5: Session State vs. Long-Term Memory

The Problem

Agents need two types of memory:

Session memory: Current conversation, temporary
Long-term memory: User preferences, historical decisions, persistent

Treating them the same leads to:

Session data polluting long-term memory
Long-term memory cluttering sessions
Difficulty clearing temporary data
Privacy and data retention issues

The Solution: Separate Storage with Clear Boundaries

class MemoryManager:
    def __init__(self, session_store, long_term_store):
        self.session = session_store      # Redis, TTL: 1 hour
        self.long_term = long_term_store  # PostgreSQL, permanent

    async def get_session_context(self, session_id: str) -> dict:
        """
        Get temporary session data.
        Auto-expires after inactivity.
        """
        return await self.session.get(f"session:{session_id}")

    async def get_long_term_context(self, user_id: str) -> dict:
        """
        Get persistent user data.
        Requires explicit deletion.
        """
        return await self.long_term.query(
            "SELECT preferences, history FROM user_memory WHERE user_id = $1",
            user_id
        )

    async def build_complete_context(self, session_id: str, user_id: str) -> dict:
        """
        Combine session and long-term memory.
        """
        session_data = await self.get_session_context(session_id)
        long_term_data = await self.get_long_term_context(user_id)

        return {
            'current_session': session_data,
            'user_memory': long_term_data
        }

    async def save_to_long_term(self, user_id: str, key: str, value: any):
        """
        Explicitly save important information for future sessions.
        """
        await self.long_term.execute(
            "INSERT INTO user_memory (user_id, key, value) VALUES ($1, $2, $3)",
            user_id, key, value
        )

What Goes Where

Session Storage (Temporary):

Current conversation history
Active task state
Temporary tool results
Draft outputs
Workflow progress

Long-Term Storage (Permanent):

User preferences (language, style)
Project associations
Historical decisions
Learned patterns
Completed task outcomes

Example: Quality Planning

Session memory:

{
    "session_id": "sess_123",
    "task": "quality_planning",
    "current_step": 5,
    "collected_data": {
        "product": "brake pads",
        "materials": ["steel 304"],
        "machines": ["M1", "M2"]
    },
    "draft_plan": {...}
}

Long-term memory:

{
    "user_id": "user_456",
    "preferences": {
        "default_tolerance": "±0.01mm",
        "preferred_machines": ["M1", "M2"],
        "notification_style": "summary"
    },
    "completed_plans": [
        {"project": "Project A", "date": "2024-01-15"},
        {"project": "Project B", "date": "2024-02-20"}
    ]
}

Memory Lifecycle

Session memory:

Created on first message
Updated each turn
Auto-expires after 1 hour of inactivity
Can be explicitly cleared

Long-term memory:

Created on user signup
Updated on explicit events (preferences changed, task completed)
Never expires (except for data retention policies)
Requires user action to delete

Pattern 6: Context Compression Techniques

The Problem

Sometimes you need to reference large documents but can't fit them in context. User uploads a 20-page PDF. You need key information but not everything.

The Solution: Multi-Level Compression

class ContextCompressor:

    async def compress_document(self, document: str, target_tokens: int) -> str:
        """
        Compress document to fit within token budget.
        """

        current_tokens = self.count_tokens(document)

        if current_tokens <= target_tokens:
            return document  # Already fits

        # Level 1: Extract key sections
        if current_tokens < target_tokens * 2:
            return await self._extract_key_sections(document, target_tokens)

        # Level 2: Summarize sections
        if current_tokens < target_tokens * 5:
            return await self._summarize_sections(document, target_tokens)

        # Level 3: Create hierarchical summary
        return await self._hierarchical_summary(document, target_tokens)

    async def _extract_key_sections(self, document: str, target: int) -> str:
        """
        Extract most relevant sections based on headings and keywords.
        """
        sections = self._split_by_headings(document)
        scored_sections = []

        for section in sections:
            score = self._relevance_score(section)
            scored_sections.append((score, section))

        # Take top sections until we hit token limit
        sorted_sections = sorted(scored_sections, reverse=True)
        result = []
        tokens_used = 0

        for score, section in sorted_sections:
            section_tokens = self.count_tokens(section)
            if tokens_used + section_tokens <= target:
                result.append(section)
                tokens_used += section_tokens
            else:
                break

        return '\n\n'.join(result)

    async def _summarize_sections(self, document: str, target: int) -> str:
        """
        Summarize each section independently.
        """
        sections = self._split_by_headings(document)
        summaries = []

        for section in sections:
            summary = await self.llm.complete(
                f"Summarize this section in 2-3 sentences:\n{section}"
            )
            summaries.append(f"**{section.heading}:** {summary}")

        return '\n\n'.join(summaries)

    async def _hierarchical_summary(self, document: str, target: int) -> str:
        """
        Create multi-level summary for very large documents.
        """
        # Split into chunks
        chunks = self._split_into_chunks(document, chunk_size=2000)

        # Summarize each chunk
        chunk_summaries = []
        for chunk in chunks:
            summary = await self.llm.complete(
                f"Summarize key points from this text:\n{chunk}"
            )
            chunk_summaries.append(summary)

        # Summarize the summaries
        combined_summaries = '\n'.join(chunk_summaries)
        final_summary = await self.llm.complete(
            f"Create a comprehensive summary from these section summaries:\n{combined_summaries}"
        )

        return final_summary

Compression Strategies by Document Type

Code files:

Extract function signatures
Keep docstrings
Summarize implementation
Preserve key logic

Reports/Documents:

Keep executive summary
Extract headings and key points
Compress body paragraphs
Preserve conclusions

Data files:

Show schema/structure
Provide sample rows
Summarize statistics
List unique values

Conversations:

Keep decisions and actions
Compress explanations
Preserve outcomes
Remove redundancy

Putting It All Together: The Context Stack

Here's how all these patterns combine in a production system:

User Message
    ↓
┌─────────────────────────┐
│  Base Context Builder   │
│  (Minimal required)     │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Task-Specific Context   │
│ (Profile-based loading) │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Conversation Window     │
│ (Recent + Summary)      │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ RAG Retrieval           │
│ (If knowledge needed)   │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Memory Integration      │
│ (Session + Long-term)   │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Final Context Assembly  │
│ (Within budget)         │
└───────────┬─────────────┘
            ↓
Agent Processing

Example: Quality Planning Request

User: "Create a quality plan for brake pads"

Context assembly:

Base context (500 tokens):

{
    "user_id": "user_123",
    "session_id": "sess_456",
    "task": "quality_planning"
}

Task-specific context (2,000 tokens):

{
    "machines": [...],
    "materials": [...],
    "specifications": [...]
}

Conversation window (1,500 tokens):

{
    "summary": "User requested quality plan. Product: brake pads.",
    "recent_messages": [last 5 messages]
}

RAG retrieval (1,000 tokens):

{
    "retrieved_procedures": [
        "Quality planning procedure for automotive parts",
        "Brake pad inspection guidelines",
        "Material specification standards"
    ]
}

Memory (500 tokens):

{
    "user_preferences": {
        "default_tolerance": "±0.01mm"
    },
    "previous_plans": ["Project A", "Project B"]
}

Total context: 5,500 tokens (within 10K budget, leaving room for response)

Key Takeaways

Effective context engineering requires:

✅ 1. Lazy Loading

Start minimal, load on-demand
Use tools for dynamic retrieval
Pay only for what you use

✅ 2. Task-Specific Profiles

Define context needs per agent type
Load only relevant data
Isolate contexts between tasks

✅ 3. Smart Windowing

Keep recent messages in full
Summarize older messages
Preserve critical information

✅ 4. RAG for Large Knowledge

Vector search for relevant chunks
Don't fit everything in context
Retrieve top-k similar items

✅ 5. Separate Memory Types

Session: temporary, auto-expires
Long-term: persistent, explicit
Clear boundaries between them

✅ 6. Compression Techniques

Extract key sections
Summarize large documents
Hierarchical summarization

Common Anti-Patterns to Avoid

❌ Loading everything upfront → Wasted tokens, high costs, context limit errors

❌ No conversation history limits → Unbounded growth, eventual failure

❌ Treating all memory as permanent → Cluttered context, privacy issues

❌ No context compression → Can't handle large documents

❌ Same context for all tasks → Irrelevant data confuses agents

The Bottom Line

Context is your agent's memory—and memory is expensive. The key is giving agents the right information at the right time without exceeding token budgets.

What works:

Lazy loading with on-demand tools
Task-specific context profiles
Smart conversation windowing
RAG for large knowledge bases
Separate session and long-term memory
Context compression for large docs

What fails:

Eager loading of all data
Unlimited conversation history
Mixed memory types
No compression strategy
Universal context for all tasks

Context engineering isn't about cramming everything into the window—it's about strategic selection, smart summarization, and ruthless prioritization.

Get this right, and your agents have perfect memory at sustainable cost.

About the Author

I build production-grade multi-agent systems with optimized context management strategies. My implementations achieve 60% cost reduction while improving response relevance through lazy loading, RAG, and smart windowing.

Specialized in context engineering, token optimization, and cost-effective agent architectures using CrewAI, Agno, and vector databases.

Open to consulting on context management challenges!
📧 Contact: gupta.akshay1996@gmail.com

Found this helpful? Share it with other AI builders! 🚀

What context management challenges are you facing? Drop a comment below!