DEV Community

Akshay Gupta
Akshay Gupta

Posted on

Context Engineering: Giving AI Agents Memory Without Breaking the Token Budget

Your AI agent needs to remember things. User preferences, project details, conversation history, tool results—all of it matters for providing intelligent responses. But every token you send costs money and consumes your context window.

Send too little context, and your agent gives generic, unhelpful responses. Send too much, and you hit token limits, rack up costs, and slow down responses.

I've built agents that manage context for manufacturing operations, sales workflows, and productivity systems. Here's how to give agents the right memory at the right time—without exploding your budget.


The Context Budget Problem

Every LLM has a context window—a maximum number of tokens it can process. Claude Sonnet 4.5 has 200K tokens. GPT-4 has 128K tokens. Sounds like a lot, right?

Here's what actually fits in that budget:

A typical agent's baseline context:

  • System prompt: 1,500 tokens
  • Tool definitions: 2,000 tokens
  • Agent instructions: 1,000 tokens
  • Subtotal: 4,500 tokens

For a project-based agent, add:

  • Project machines list: 800 tokens
  • Materials and specifications: 600 tokens
  • Historical data: 1,200 tokens
  • Subtotal: 2,600 tokens

For a 20-turn conversation:

  • User messages (avg 100 tokens): 2,000 tokens
  • Agent responses (avg 300 tokens): 6,000 tokens
  • Subtotal: 8,000 tokens

Total: 15,100 tokens for a moderate conversation.

And you haven't even added:

  • Retrieved documents
  • Search results
  • Previous workflow outputs
  • Related task context

You're already at 15% of a 100K context window. By turn 50, you're hitting limits. By turn 100, you're out of space.

The naive solution—"send everything always"—fails fast.


Pattern 1: Lazy Context Loading (Just-In-Time)

The Problem

Most agents load all available context upfront:

# ❌ Eager loading - wasteful
def build_context(project_id):
    return {
        'machines': get_all_machines(project_id),      # 50 machines
        'materials': get_all_materials(project_id),    # 200 materials
        'specs': get_all_specs(project_id),            # 100 specs
        'history': get_full_history(project_id),       # 1000 records
        'users': get_all_users(project_id),            # 30 users
        'schedules': get_all_schedules(project_id)     # 500 schedules
    }

# Result: 8,000+ tokens, most unused
Enter fullscreen mode Exit fullscreen mode

The agent receives data it never uses. You pay for tokens that don't contribute to the response.

The Solution: Load Only What's Needed

Give the agent tools to request context when needed:

# ✅ Lazy loading - efficient
def build_minimal_context(project_id):
    return {
        'project_name': get_project_name(project_id),
        'project_type': get_project_type(project_id)
    }

# Agent has tools to fetch more
tools = [
    Tool(
        name='get_machines',
        description='Fetch machines for this project',
        function=lambda: get_machines(project_id)
    ),
    Tool(
        name='get_materials', 
        description='Fetch materials for this project',
        function=lambda: get_materials(project_id)
    ),
    Tool(
        name='search_history',
        description='Search historical records by keyword',
        function=lambda query: search_history(project_id, query)
    )
]
Enter fullscreen mode Exit fullscreen mode

When the Agent Needs Data

Turn 1:

  • User: "Create a quality plan for automotive parts"
  • Agent: receives minimal context
  • Agent: calls get_machines() and get_materials()
  • Agent: "I see you have 3 machines available..."

Turn 5:

  • User: "What maintenance was done on Machine A last month?"
  • Agent: calls search_history(query="Machine A maintenance")
  • Agent: "Machine A had preventive maintenance on..."

Why This Works

Reduced baseline: Start with ~500 tokens instead of 8,000

On-demand loading: Only fetch what's relevant to the current task

Token efficiency: Pay for what you use, not what you might use

Better relevance: Agent gets focused, pertinent data

Real-World Impact

Before lazy loading:

  • Average context: 12,000 tokens per request
  • Cost per conversation: $0.48
  • Irrelevant data: 60-70%

After lazy loading:

  • Average context: 4,500 tokens per request
  • Cost per conversation: $0.18
  • Irrelevant data: <10%

62% cost reduction with better response quality.


Pattern 2: Task-Specific Context Windows

The Problem

Different tasks need different context. A quality planning agent needs machines and materials. A maintenance agent needs service history. An SOP agent needs workstations and TAKT times.

Loading everything for every task wastes tokens and confuses the agent with irrelevant data.

The Solution: Context Profiles per Agent Type

Define exactly what each agent needs:

class ContextManager:

    CONTEXT_PROFILES = {
        'quality_planning': {
            'required': ['machines', 'materials', 'specifications'],
            'optional': ['previous_plans', 'quality_metrics'],
            'exclude': ['maintenance_history', 'sop_data']
        },

        'maintenance_scheduling': {
            'required': ['machines', 'maintenance_history'],
            'optional': ['upcoming_schedules', 'parts_inventory'],
            'exclude': ['materials', 'quality_specs']
        },

        'sop_creation': {
            'required': ['workstations', 'resources', 'takt_time'],
            'optional': ['existing_sops', 'process_flow'],
            'exclude': ['quality_specs', 'maintenance_history']
        },

        'issue_tracking': {
            'required': ['machines', 'materials'],
            'optional': ['recent_issues', 'resolution_history'],
            'exclude': ['sop_data', 'schedules']
        }
    }

    def build_context(self, agent_type: str, project_id: str) -> dict:
        """
        Build context based on agent's specific needs.
        """
        profile = self.CONTEXT_PROFILES[agent_type]
        context = {}

        # Load required data
        for key in profile['required']:
            context[key] = self._load_data(key, project_id)

        # Load optional data if available (don't fail if missing)
        for key in profile['optional']:
            try:
                context[key] = self._load_data(key, project_id)
            except DataNotFoundError:
                pass  # Optional, skip if unavailable

        # Explicitly exclude irrelevant data

        return context
Enter fullscreen mode Exit fullscreen mode

Context Isolation Benefits

Clarity: Agent sees only relevant information

Speed: Less data to load and process

Accuracy: No confusion from unrelated data

Cost: Fewer tokens per request

Debugging: Easy to see what context each agent receives

Example: Quality Planning vs. Maintenance

Quality Planning Context:

{
    "project_name": "Automotive Assembly Line",
    "machines": [
        {"id": "M1", "name": "CNC Mill", "specs": "..."},
        {"id": "M2", "name": "Lathe", "specs": "..."}
    ],
    "materials": [
        {"id": "MAT1", "name": "Steel", "grade": "304"}
    ],
    "specifications": {
        "tolerance": "±0.01mm",
        "surface_finish": "Ra 1.6"
    }
}
Enter fullscreen mode Exit fullscreen mode

Maintenance Scheduling Context:

{
    "project_name": "Automotive Assembly Line",
    "machines": [
        {"id": "M1", "name": "CNC Mill", "last_service": "2024-01-15"},
        {"id": "M2", "name": "Lathe", "last_service": "2024-02-01"}
    ],
    "maintenance_history": [
        {"machine": "M1", "date": "2024-01-15", "type": "preventive"},
        {"machine": "M2", "date": "2024-02-01", "type": "repair"}
    ],
    "upcoming_schedules": [
        {"machine": "M1", "due": "2024-04-15", "type": "preventive"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Notice: No overlap in specifications, quality metrics, or workstation data. Each agent gets exactly what it needs.


Pattern 3: Conversation History Windowing

The Problem

LLM conversations grow unbounded. By turn 50, you have:

  • 50 user messages
  • 50 agent responses
  • Tool calls and results
  • System messages

This exceeds context limits and makes responses slower and more expensive.

The Solution: Smart Windowing with Summarization

Keep recent messages in full, summarize older ones:

class ConversationWindow:
    def __init__(self, max_full_messages=10):
        self.max_full_messages = max_full_messages
        self.summary_cache = {}

    async def prepare_history(self, session_id: str) -> list:
        """
        Prepare conversation history within token budget.
        """
        full_history = await self.get_full_history(session_id)

        if len(full_history) <= self.max_full_messages:
            return full_history

        # Split into recent and old
        recent_messages = full_history[-self.max_full_messages:]
        old_messages = full_history[:-self.max_full_messages]

        # Check if we already have a summary
        summary_key = f"{session_id}:{len(old_messages)}"
        if summary_key in self.summary_cache:
            summary = self.summary_cache[summary_key]
        else:
            summary = await self._create_summary(old_messages)
            self.summary_cache[summary_key] = summary

        # Combine summary + recent messages
        return [
            {
                'role': 'system',
                'content': f'Previous conversation summary: {summary}'
            },
            *recent_messages
        ]

    async def _create_summary(self, messages: list) -> str:
        """
        Create concise summary focusing on key information.
        """

        # Extract key information to preserve
        decisions = self._extract_decisions(messages)
        data_collected = self._extract_data(messages)
        progress = self._extract_progress(messages)

        summary_prompt = f"""
        Summarize this conversation segment in 3-4 sentences:

        Focus on:
        - Key decisions: {decisions}
        - Data collected: {data_collected}
        - Progress made: {progress}

        Messages:
        {self._format_messages(messages)}

        Create a concise summary that preserves essential context.
        """

        summary = await self.llm.complete(summary_prompt)
        return summary.strip()
Enter fullscreen mode Exit fullscreen mode

What to Preserve in Summaries

Always preserve:

  • User decisions and choices
  • Specific data provided (numbers, names, IDs)
  • Task progress and completion status
  • Error messages or issues encountered
  • Tool call results that affect future actions

Can compress:

  • Clarifying questions and answers
  • Explanations of concepts
  • Confirmation messages
  • General chitchat
  • Repetitive information

Example Summary

Original (10 messages, 2,000 tokens):

User: "I need to create a quality plan"
Agent: "What product are you manufacturing?"
User: "Automotive brake pads"
Agent: "What materials are you using?"
User: "Steel alloy, grade 304"
Agent: "What machines will you use?"
User: "CNC Mill M1 and Lathe M2"
Agent: "What tolerances are required?"
User: "±0.01mm"
Agent: "Got it. Let me create the plan..."
Enter fullscreen mode Exit fullscreen mode

Summary (150 tokens):

User requested quality plan for automotive brake pads. 
Materials: Steel alloy grade 304. 
Machines: CNC Mill M1, Lathe M2. 
Tolerance requirement: ±0.01mm. 
Plan creation initiated.
Enter fullscreen mode Exit fullscreen mode

Token Savings

  • Original: 2,000 tokens
  • Summary: 150 tokens
  • Savings: 92.5%

Multiply this across a 50-turn conversation and you save thousands of tokens per request.


Pattern 4: RAG (Retrieval-Augmented Generation) for Large Knowledge Bases

The Problem

Some agents need access to large knowledge bases:

  • 500+ product specifications
  • 1,000+ historical maintenance records
  • 200+ standard operating procedures
  • Complete company documentation

You can't fit this in context. Even with a 200K token window, it's inefficient.

The Solution: Vector Search + Selective Retrieval

Store knowledge in a vector database, retrieve only relevant chunks:

class KnowledgeRetriever:
    def __init__(self, vector_db):
        self.db = vector_db

    async def retrieve_relevant(self, query: str, top_k: int = 3) -> list:
        """
        Retrieve most relevant knowledge chunks for query.
        """

        # Embed the query
        query_embedding = await self.embed(query)

        # Search vector database
        results = await self.db.search(
            vector=query_embedding,
            limit=top_k,
            threshold=0.7  # Similarity threshold
        )

        # Return relevant chunks
        return [
            {
                'content': result.content,
                'source': result.metadata['source'],
                'relevance': result.score
            }
            for result in results
        ]

    async def build_rag_context(self, user_message: str, base_context: dict) -> dict:
        """
        Augment base context with retrieved knowledge.
        """

        # Retrieve relevant documents
        relevant_docs = await self.retrieve_relevant(user_message)

        # Add to context
        augmented_context = {
            **base_context,
            'retrieved_knowledge': relevant_docs
        }

        return augmented_context
Enter fullscreen mode Exit fullscreen mode

When to Use RAG vs. Direct Context

Use direct context when:

  • Data is small (<2,000 tokens)
  • Data is frequently needed
  • Data is structured and predictable
  • Fast response time is critical

Use RAG when:

  • Knowledge base is large (>10,000 tokens)
  • Data is accessed occasionally
  • Relevance varies by query
  • Full text search is needed

RAG Implementation Example

Scenario: Agent needs to reference maintenance procedures.

Without RAG (fails):

# Can't fit 200 procedures in context
procedures = load_all_procedures()  # 50,000 tokens
# Context limit exceeded!
Enter fullscreen mode Exit fullscreen mode

With RAG (works):

# User asks about specific machine
user_query = "How do I service Machine A?"

# Retrieve only relevant procedures
relevant = await retriever.retrieve_relevant(user_query, top_k=3)
# Result: 3 procedures, ~1,500 tokens

# Agent sees only what's needed
agent_context = {
    'project': project_data,
    'relevant_procedures': relevant  # Just 3, not 200
}
Enter fullscreen mode Exit fullscreen mode

RAG Architecture

User Query
    ↓
Query Embedding
    ↓
Vector Search
    ↓
Top K Results (by similarity)
    ↓
Relevance Filtering (threshold)
    ↓
Context Augmentation
    ↓
Agent Processing
Enter fullscreen mode Exit fullscreen mode

Vector Database Options

Weaviate:

  • Good for production scale
  • Rich filtering capabilities
  • Self-hosted or cloud

Pinecone:

  • Managed service
  • Fast and reliable
  • Easy to get started

pgvector (PostgreSQL):

  • Use existing PostgreSQL
  • Good for moderate scale
  • No additional infrastructure

When to use each:

  • pgvector: <100K vectors, already using PostgreSQL
  • Weaviate: 100K-10M vectors, need rich filtering
  • Pinecone: Any scale, want managed solution

Pattern 5: Session State vs. Long-Term Memory

The Problem

Agents need two types of memory:

  1. Session memory: Current conversation, temporary
  2. Long-term memory: User preferences, historical decisions, persistent

Treating them the same leads to:

  • Session data polluting long-term memory
  • Long-term memory cluttering sessions
  • Difficulty clearing temporary data
  • Privacy and data retention issues

The Solution: Separate Storage with Clear Boundaries

class MemoryManager:
    def __init__(self, session_store, long_term_store):
        self.session = session_store      # Redis, TTL: 1 hour
        self.long_term = long_term_store  # PostgreSQL, permanent

    async def get_session_context(self, session_id: str) -> dict:
        """
        Get temporary session data.
        Auto-expires after inactivity.
        """
        return await self.session.get(f"session:{session_id}")

    async def get_long_term_context(self, user_id: str) -> dict:
        """
        Get persistent user data.
        Requires explicit deletion.
        """
        return await self.long_term.query(
            "SELECT preferences, history FROM user_memory WHERE user_id = $1",
            user_id
        )

    async def build_complete_context(self, session_id: str, user_id: str) -> dict:
        """
        Combine session and long-term memory.
        """
        session_data = await self.get_session_context(session_id)
        long_term_data = await self.get_long_term_context(user_id)

        return {
            'current_session': session_data,
            'user_memory': long_term_data
        }

    async def save_to_long_term(self, user_id: str, key: str, value: any):
        """
        Explicitly save important information for future sessions.
        """
        await self.long_term.execute(
            "INSERT INTO user_memory (user_id, key, value) VALUES ($1, $2, $3)",
            user_id, key, value
        )
Enter fullscreen mode Exit fullscreen mode

What Goes Where

Session Storage (Temporary):

  • Current conversation history
  • Active task state
  • Temporary tool results
  • Draft outputs
  • Workflow progress

Long-Term Storage (Permanent):

  • User preferences (language, style)
  • Project associations
  • Historical decisions
  • Learned patterns
  • Completed task outcomes

Example: Quality Planning

Session memory:

{
    "session_id": "sess_123",
    "task": "quality_planning",
    "current_step": 5,
    "collected_data": {
        "product": "brake pads",
        "materials": ["steel 304"],
        "machines": ["M1", "M2"]
    },
    "draft_plan": {...}
}
Enter fullscreen mode Exit fullscreen mode

Long-term memory:

{
    "user_id": "user_456",
    "preferences": {
        "default_tolerance": "±0.01mm",
        "preferred_machines": ["M1", "M2"],
        "notification_style": "summary"
    },
    "completed_plans": [
        {"project": "Project A", "date": "2024-01-15"},
        {"project": "Project B", "date": "2024-02-20"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Memory Lifecycle

Session memory:

  1. Created on first message
  2. Updated each turn
  3. Auto-expires after 1 hour of inactivity
  4. Can be explicitly cleared

Long-term memory:

  1. Created on user signup
  2. Updated on explicit events (preferences changed, task completed)
  3. Never expires (except for data retention policies)
  4. Requires user action to delete

Pattern 6: Context Compression Techniques

The Problem

Sometimes you need to reference large documents but can't fit them in context. User uploads a 20-page PDF. You need key information but not everything.

The Solution: Multi-Level Compression

class ContextCompressor:

    async def compress_document(self, document: str, target_tokens: int) -> str:
        """
        Compress document to fit within token budget.
        """

        current_tokens = self.count_tokens(document)

        if current_tokens <= target_tokens:
            return document  # Already fits

        # Level 1: Extract key sections
        if current_tokens < target_tokens * 2:
            return await self._extract_key_sections(document, target_tokens)

        # Level 2: Summarize sections
        if current_tokens < target_tokens * 5:
            return await self._summarize_sections(document, target_tokens)

        # Level 3: Create hierarchical summary
        return await self._hierarchical_summary(document, target_tokens)

    async def _extract_key_sections(self, document: str, target: int) -> str:
        """
        Extract most relevant sections based on headings and keywords.
        """
        sections = self._split_by_headings(document)
        scored_sections = []

        for section in sections:
            score = self._relevance_score(section)
            scored_sections.append((score, section))

        # Take top sections until we hit token limit
        sorted_sections = sorted(scored_sections, reverse=True)
        result = []
        tokens_used = 0

        for score, section in sorted_sections:
            section_tokens = self.count_tokens(section)
            if tokens_used + section_tokens <= target:
                result.append(section)
                tokens_used += section_tokens
            else:
                break

        return '\n\n'.join(result)

    async def _summarize_sections(self, document: str, target: int) -> str:
        """
        Summarize each section independently.
        """
        sections = self._split_by_headings(document)
        summaries = []

        for section in sections:
            summary = await self.llm.complete(
                f"Summarize this section in 2-3 sentences:\n{section}"
            )
            summaries.append(f"**{section.heading}:** {summary}")

        return '\n\n'.join(summaries)

    async def _hierarchical_summary(self, document: str, target: int) -> str:
        """
        Create multi-level summary for very large documents.
        """
        # Split into chunks
        chunks = self._split_into_chunks(document, chunk_size=2000)

        # Summarize each chunk
        chunk_summaries = []
        for chunk in chunks:
            summary = await self.llm.complete(
                f"Summarize key points from this text:\n{chunk}"
            )
            chunk_summaries.append(summary)

        # Summarize the summaries
        combined_summaries = '\n'.join(chunk_summaries)
        final_summary = await self.llm.complete(
            f"Create a comprehensive summary from these section summaries:\n{combined_summaries}"
        )

        return final_summary
Enter fullscreen mode Exit fullscreen mode

Compression Strategies by Document Type

Code files:

  • Extract function signatures
  • Keep docstrings
  • Summarize implementation
  • Preserve key logic

Reports/Documents:

  • Keep executive summary
  • Extract headings and key points
  • Compress body paragraphs
  • Preserve conclusions

Data files:

  • Show schema/structure
  • Provide sample rows
  • Summarize statistics
  • List unique values

Conversations:

  • Keep decisions and actions
  • Compress explanations
  • Preserve outcomes
  • Remove redundancy

Putting It All Together: The Context Stack

Here's how all these patterns combine in a production system:

User Message
    ↓
┌─────────────────────────┐
│  Base Context Builder   │
│  (Minimal required)     │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Task-Specific Context   │
│ (Profile-based loading) │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Conversation Window     │
│ (Recent + Summary)      │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ RAG Retrieval           │
│ (If knowledge needed)   │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Memory Integration      │
│ (Session + Long-term)   │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Final Context Assembly  │
│ (Within budget)         │
└───────────┬─────────────┘
            ↓
Agent Processing
Enter fullscreen mode Exit fullscreen mode

Example: Quality Planning Request

User: "Create a quality plan for brake pads"

Context assembly:

  1. Base context (500 tokens):
{
    "user_id": "user_123",
    "session_id": "sess_456",
    "task": "quality_planning"
}
Enter fullscreen mode Exit fullscreen mode
  1. Task-specific context (2,000 tokens):
{
    "machines": [...],
    "materials": [...],
    "specifications": [...]
}
Enter fullscreen mode Exit fullscreen mode
  1. Conversation window (1,500 tokens):
{
    "summary": "User requested quality plan. Product: brake pads.",
    "recent_messages": [last 5 messages]
}
Enter fullscreen mode Exit fullscreen mode
  1. RAG retrieval (1,000 tokens):
{
    "retrieved_procedures": [
        "Quality planning procedure for automotive parts",
        "Brake pad inspection guidelines",
        "Material specification standards"
    ]
}
Enter fullscreen mode Exit fullscreen mode
  1. Memory (500 tokens):
{
    "user_preferences": {
        "default_tolerance": "±0.01mm"
    },
    "previous_plans": ["Project A", "Project B"]
}
Enter fullscreen mode Exit fullscreen mode

Total context: 5,500 tokens (within 10K budget, leaving room for response)


Key Takeaways

Effective context engineering requires:

1. Lazy Loading

  • Start minimal, load on-demand
  • Use tools for dynamic retrieval
  • Pay only for what you use

2. Task-Specific Profiles

  • Define context needs per agent type
  • Load only relevant data
  • Isolate contexts between tasks

3. Smart Windowing

  • Keep recent messages in full
  • Summarize older messages
  • Preserve critical information

4. RAG for Large Knowledge

  • Vector search for relevant chunks
  • Don't fit everything in context
  • Retrieve top-k similar items

5. Separate Memory Types

  • Session: temporary, auto-expires
  • Long-term: persistent, explicit
  • Clear boundaries between them

6. Compression Techniques

  • Extract key sections
  • Summarize large documents
  • Hierarchical summarization

Common Anti-Patterns to Avoid

Loading everything upfront → Wasted tokens, high costs, context limit errors

No conversation history limits → Unbounded growth, eventual failure

Treating all memory as permanent → Cluttered context, privacy issues

No context compression → Can't handle large documents

Same context for all tasks → Irrelevant data confuses agents


The Bottom Line

Context is your agent's memory—and memory is expensive. The key is giving agents the right information at the right time without exceeding token budgets.

What works:

  • Lazy loading with on-demand tools
  • Task-specific context profiles
  • Smart conversation windowing
  • RAG for large knowledge bases
  • Separate session and long-term memory
  • Context compression for large docs

What fails:

  • Eager loading of all data
  • Unlimited conversation history
  • Mixed memory types
  • No compression strategy
  • Universal context for all tasks

Context engineering isn't about cramming everything into the window—it's about strategic selection, smart summarization, and ruthless prioritization.

Get this right, and your agents have perfect memory at sustainable cost.

About the Author

I build production-grade multi-agent systems with optimized context management strategies. My implementations achieve 60% cost reduction while improving response relevance through lazy loading, RAG, and smart windowing.

Specialized in context engineering, token optimization, and cost-effective agent architectures using CrewAI, Agno, and vector databases.

Open to consulting on context management challenges!
📧 Contact: gupta.akshay1996@gmail.com


Found this helpful? Share it with other AI builders! 🚀

What context management challenges are you facing? Drop a comment below!

Top comments (0)