DEV Community: Akshay Gupta

I scraped 25 AI/tech communities for 6 months. Here's what the data actually says.

Akshay Gupta — Thu, 26 Mar 2026 11:32:47 +0000

There's a funny thing that happens when you track what developers actually say across 25 platforms simultaneously: you discover that most of what passes for "market intelligence" is vibes.

Press says a technology is the next big thing. VCs pour money in. Twitter gets excited. But what are the people actually building things saying? Often something completely different.

I built a platform to find out — and then open-sourced it.

GitHub: ai-community-intelligence

The question that started this

I kept noticing a pattern: the technologies that developers on Reddit were raving about often had lukewarm reception on Hacker News. The tools getting VC funding sometimes had GitHub repos with declining velocity. Job boards were hiring for skills that community sentiment said were already peaking.

No single source tells you the truth. But when you cross-reference 25 of them — communities, code repos, research papers, job postings, news — patterns emerge that are impossible to see otherwise.

So I built Community Mind Mirror, a platform that scrapes 25 data sources, processes them through statistical + LLM analysis, and runs 10 cross-source intelligence agents that surface signals no individual platform can show you.

Then I open-sourced the whole thing.

Some things the data revealed

Here are some of the more interesting findings after running this across 200K+ records from Reddit (55 subreddits), Hacker News, GitHub (675+ repos), ArXiv, YouTube (29 channels), ProductHunt, Y Combinator, Stack Overflow, 10+ job boards, and news feeds.

1. Hype vs Reality is measurable

You can actually quantify the gap between what press/VCs say about a sector and what builders think. I call this the Hype vs Reality Index — it compares builder sentiment (from Reddit, HN, GitHub discussions) against press/VC sentiment (from news, funding announcements, ProductHunt) for each sector.

For some sectors, this gap is enormous — meaning either the money is wrong or the builders are. Historically? The builders are right more often.

The Traction Scorer was built specifically to cut through hype. A technology trending on Twitter means nothing. But if it has GitHub velocity AND package downloads AND organic community mentions AND companies are hiring for it — that's real traction. The scoring weights:

GitHub stars + commit velocity: 30%
Package downloads (PyPI/npm): 20%
Organic community mentions: 15%
Job listings: 10%
Recommendation rate: 10%
Remaining signals: 15%

2. When Reddit and Hacker News disagree, pay attention

This was one of the most surprising findings. When builders on Reddit are bullish on a technology but HN engineers are skeptical (or vice versa), it's often an early warning signal.

The Platform Divergence agent tracks this in real time. It compares sentiment scores across Reddit, HN, YouTube, and ProductHunt for the same topic. In the data, these disagreements tend to resolve within 3-6 months — and predicting the direction is genuinely valuable.

The agent classifies each divergence into one of four statuses: correction_expected, genuine_adoption, hype_bubble, or early_signal.

3. "Switched from X to Y" is an underrated signal

People publicly announcing they switched from one tool to another is one of the most honest data points you can find. Nobody has an incentive to lie about it.

The system extracts these migration patterns automatically across all community sources. Phrases like "switched from X to Y", "replaced X with Y", "migrated from X to Y" get parsed and aggregated. When you see 50+ people independently making the same switch over a month — that's a competitive signal no press release will tell you.

4. Every community frustration is a product opportunity

People complaining is data. The Pain Point Processor clusters frustrations from across Reddit, HN, and Stack Overflow by topic, scores them by intensity, and checks whether any existing product solves the problem.

When the Market Gap Detector agent combines these pain points with job market data, it finds opportunities where high pain + zero solutions + active hiring = something worth building.

The formula: gap_score = pain_score × (1 / existing_products) × (1 + job_postings/100)

Some of the gaps it's surfaced are surprisingly specific and actionable.

5. You can track a paper's journey from research to production

ArXiv papers don't stay academic forever. Some of them become GitHub repos within weeks, get HuggingFace model uploads within months, and show up in community discussions shortly after. Then they appear on ProductHunt. Then companies start hiring for the underlying skill.

The Research Pipeline agent tracks this entire journey: ArXiv → GitHub → HuggingFace → Community → ProductHunt → Jobs. The metric it produces is "days to commercialization" — and it's getting shorter every quarter.

6. Opinion leaders shift their stances — and that's a leading indicator

The system profiles 3,400+ community leaders across platforms — their core beliefs, communication style, expertise, and influence type. When an opinion leader changes their stance on a topic, the Leader Shift Detection processor catches it.

Why does this matter? Because when 5 influential developers independently go from skeptical to enthusiastic about a technology within the same month, that's a signal the broader community usually follows 2-3 months later.

7. The job market tells you what's real

Job postings are one of the most honest signals in the dataset. Companies don't hire for technologies they're not serious about.

The system pulls from 10+ job boards plus ATS feeds from 57 companies (including OpenAI, Anthropic, Figma, Notion, Vercel, Databricks) via Greenhouse, Lever, and Ashby APIs. The Job Intelligence Processor extracts structured data: role category, seniority, salary (normalized to annual USD), tech stack, company stage, and culture signals.

The Talent Flow agent then maps skill supply vs demand with salary pressure indicators. When a skill has high demand but low supply, salaries rise — and that tells you where the market is heading.

8. Where the smart money converges

When YC companies cluster around a sector, VCs write about it, builders create repos for it, and community volume spikes simultaneously — something is happening.

The Smart Money Tracker watches for this convergence. It combines YC batch composition, VC-focused news articles, builder GitHub activity, and community discussion volume to identify sectors where capital and talent are flowing simultaneously.

9. Narratives shift before markets do

Every technology has a "story" the community tells about it. The AI narrative went from "this will take all our jobs" to "this is a productivity tool" to "this is overhyped" to "this is quietly useful" — all within 18 months.

The Narrative Shift agent detects these transitions by comparing older discussion frames with recent ones. When the story changes, markets follow — but there's usually a lag where the old pricing/valuation hasn't caught up to the new sentiment.

How it works (the short version)

The platform has a 3-layer processing pipeline:

Layer 1 — No LLM, fast and free. VADER sentiment scoring on every post. Regex-based product mention detection. Migration pattern extraction. Complaint clustering. This runs on everything at near-zero cost.

Layer 2 — Statistical analysis. Topic velocity (24h mentions vs 6-day average). Hype vs Reality Index. Influence scoring. Platform divergence measurement. All computed, no LLM tokens burned.

Layer 3 — LLM-powered deep analysis. Topic extraction with opinion camps. Persona profiling. Pain point synthesis. Gig classification. Product review synthesis. This is where gpt-4o-mini earns its keep — but the spending tracker keeps costs under control.

Then 10 cross-source agents combine signals across all the processed data to produce intelligence that no single source can provide.

A full pipeline run costs $0.50 to $2.00.

Who finds this useful

I've found different people care about very different parts of the data:

If you're a founder — the market gap detector and competitive threat analysis are gold. Knowing where people are frustrated and nobody's solving it is literally product-market fit detection.

If you're a VC or investor — the traction scorer and hype vs reality index help cut through noise. Is this company actually gaining users, or just getting press? The community reaction to funding rounds is also telling.

If you're a product manager — technology lifecycle mapping and platform divergence help with timing. Is this technology too early to bet on? Already commoditized? And what are users of competing products actually complaining about?

If you're hiring — the talent flow agent and gig board (2,600+ classified opportunities from 21 subreddits) show where the market is going. Which skills are in shortage? Where are salaries under pressure?

Try it yourself

The whole thing is open-source, MIT licensed.

git clone https://github.com/akshayturtle/ai-community-intelligence.git
cd community-mind-mirror/community-mind-mirror

cp .env.example .env
# Set DATABASE_URL and your OpenAI-compatible API key

docker-compose up -d          # Postgres + Redis
pip install -r requirements.txt
python init_db.py             # Create 45 tables

uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
cd dashboard && npm install && npm run dev

# Run the full pipeline
python run_scrapers_bg.py

Most scrapers work without any API keys — Reddit (RSS), Hacker News, ArXiv, all job boards, PyPI, npm, HuggingFace, Papers with Code — all public endpoints. You just need Postgres and an LLM API key.

You can also run individual pieces:

python main.py --scraper reddit        # Just Reddit
python main.py --processor pain_points # Just pain point analysis
python main.py --agent market_gaps     # Just the market gap detector
python main.py --summary               # See all table counts

What I'd love feedback on

The signal agents are the part I think has the most potential — but also the most room to improve. Some questions I'm thinking about:

What other cross-source signals would be useful? I have 10 agents, but the pattern of combining data from community + code + jobs + research can probably surface more.
How would you validate the traction scorer? The weights were set based on intuition and iteration. There's probably a more rigorous way to calibrate them.
Should platform divergence be weighted by platform? Right now Reddit and HN are treated equally. But maybe one is a better leading indicator for certain technology categories.

If any of these questions interest you, the codebase is designed to make it easy to add new agents. The pattern is: query across tables in Python, structure the data, send to LLM for synthesis.

GitHub: github.com/akshayturtle/ai-community-intelligence

Built by Turtle Techsai — we build AI-powered intelligence tools. If you're interested in a custom deployment or have a use case in mind, happy to chat: akshay.gupta@turtletechsai.com

What cross-source signals would you find most useful? Drop your ideas in the comments — I'm genuinely looking for what to build next.

Rethinking API Design for AI Agents: From Data Plumbing to Intelligent Interfaces

Akshay Gupta — Wed, 10 Dec 2025 16:50:41 +0000

The Problem: APIs Built for Humans, Not Machines That Think

Your APIs were designed for a different era. They were built for:

Developers writing if statements
React components fetching data
Dashboards displaying records

But today's autonomous AI agents don't just fetch data—they need to understand, reason, and decide. When you feed them raw database records and HTTP status codes, you're asking them to do the job your APIs should have already done.

The hard truth: If your APIs only expose data, you're feeding intelligence with noise.

What Makes an API "Agent-Ready"?

The shift from traditional to agentic API design isn't about adding more endpoints. It's about fundamentally rethinking what your APIs should provide.

Traditional APIs	Agent-Ready APIs
Return raw records	Return interpreted insights
Expose CRUD operations	Expose business capabilities
"Here's the data"	"Here's what it means"
Fine-grained (20+ calls/task)	Goal-oriented (2-3 calls/task)
Developer-friendly	Reasoning-friendly

Example: Equipment Maintenance Status

Traditional API Response:

{
  "equipment_id": "CNC-001",
  "last_maintenance": "2024-11-01",
  "maintenance_interval_days": 30,
  "maintenance_type": "preventive"
}

The agent must now:

Calculate if it's overdue (date math)
Assess the risk level (business logic)
Determine priority (domain knowledge)
Generate a recommendation (reasoning)

Agent-Ready API Response:

{
  "equipment_id": "CNC-001",
  "last_maintenance": "2024-11-01",
  "maintenance_interval_days": 30,
  "maintenance_type": "preventive",

  // 🔥 Semantic enrichment
  "status": "overdue",
  "overdue_by_days": 8,
  "risk_level": "high",
  "priority": 1,
  "next_due_date": "2024-12-01",
  "recommendation": "Schedule immediately - approaching critical threshold",
  "impact_if_delayed": "Estimated 4-hour production loss, $12,000 cost"
}

Now the agent can act instead of calculate.

The Three Pillars of Agent-Ready APIs

1. Clarity: Speak Business Intent, Not Database Schema

Agents shouldn't need to reverse-engineer your domain logic from 15 microservice calls.

❌ Bad: Fragmented Microservices

GET /users/{id}
GET /accounts/{id}
GET /transactions/{id}
GET /credit_scores/{id}
GET /eligibility_rules
[Agent orchestrates all of this]

✅ Good: Intent-Based API

GET /customer_loan_eligibility/{id}

Returns:
{
  "eligible": true,
  "confidence": 0.94,
  "pre_approved_amount": 50000,
  "rationale": "Strong credit history, stable income, low debt ratio",
  "next_steps": ["Submit income verification", "Review terms"]
}

The difference: One call vs. five. Clear intent vs. scattered logic.

2. Context: Add Meaning, Not Just Data

Raw numbers are useless without interpretation. Agents need semantic context to understand what data means.

Example: Risk Assessment

Without Context:

{
  "risk_score": 67,
  "incidents_last_90_days": 3
}

Questions the agent must answer:

Is 67 high or low?
Are 3 incidents concerning?
What should happen next?

With Context:

{
  "risk_score": 67,
  "risk_category": "moderate",
  "percentile": "82nd", // Worse than 82% of peers

  "incidents_last_90_days": 3,
  "trend": "increasing", // Was 1 per month, now 1 per week
  "comparison": "2x industry average",

  "interpretation": "Risk level elevated due to incident frequency increase",
  "suggested_actions": [
    "Implement additional safety protocols",
    "Schedule equipment inspection",
    "Review operator training records"
  ]
}

Now the agent understands not just what but why it matters and what to do.

3. Consistency: Predictable Contracts Agents Can Trust

Agents need stable, well-governed APIs. Inconsistent schemas break reasoning chains.

❌ Inconsistent (Agent nightmare):

Endpoint A returns: {"created_at": "2024-11-01"}
Endpoint B returns: {"createdDate": "2024-11-01T00:00:00Z"}
Endpoint C returns: {"timestamp": 1698796800}

✅ Consistent (Agent friendly):

All timestamps: ISO 8601 format
All IDs: UUID v4
All currencies: ISO 4217 codes
All status fields: Enum with defined values

When agents can trust your contracts, they can reason at scale.

The Microservices Trap: When Modularity Becomes Fragmentation

Microservices gave us scalability and independence. But they created a new problem: cognitive fragmentation for AI agents.

Scenario: Determining Expedited Shipping Eligibility

Microservices Architecture (8 API calls):

1. GET /customer/{id}/profile
2. GET /customer/{id}/membership_status
3. GET /customer/{id}/order_history
4. GET /orders/{order_id}/items
5. GET /inventory/availability
6. GET /shipping/zones/{zip}
7. GET /shipping/rates
8. [Agent does complex orchestration logic]

Issues:

8 network round-trips (latency)
8 potential failure points (reliability)
Complex business logic in agent code (maintainability)
Agent must know your entire service topology (coupling)

Agent-Ready Architecture (1 API call):

GET /shipping/eligibility/{order_id}

Returns:
{
  "expedited_eligible": true,
  "confidence": 0.97,
  "reason": "Premium member + in-stock inventory + metro area",
  "estimated_delivery": "2024-12-11",
  "cost": 12.99,
  "alternatives": [
    {
      "type": "standard",
      "estimated_delivery": "2024-12-14",
      "cost": 5.99
    }
  ]
}

The key insight: Microservices are perfect for your internal architecture. But your external API layer should hide that complexity behind goal-oriented endpoints.

Two Approaches to Adding Intelligence

When transforming APIs to be agent-ready, you have two options:

Approach 1: Pure Logic (Deterministic Intelligence) ✅

Add semantic enrichment using business rules and calculations—no LLMs required.

def enrich_maintenance_record(record):
    # Calculate semantic fields
    days_overdue = (today - record['last_done']).days - record['interval']

    # Apply business rules
    if days_overdue > 14:
        risk = "critical"
        priority = 1
        recommendation = "URGENT: Schedule immediately"
    elif days_overdue > 7:
        risk = "high"
        priority = 2
        recommendation = "Schedule within 48 hours"
    elif days_overdue > 0:
        risk = "moderate"
        priority = 3
        recommendation = "Schedule this week"
    else:
        risk = "low"
        priority = 4
        recommendation = f"Due in {abs(days_overdue)} days"

    return {
        **record,
        "status": "overdue" if days_overdue > 0 else "current",
        "days_overdue": max(0, days_overdue),
        "risk_level": risk,
        "priority": priority,
        "recommendation": recommendation
    }

Advantages:

⚡ Fast (50-100ms response time)
💰 Free (no LLM costs)
🎯 Reliable (deterministic outputs)
🔍 Debuggable (easy to trace logic)

Best for:

Status calculations
Risk assessments
Priority scoring
Date/time computations
Aggregations and summaries
Rule-based recommendations

Approach 2: Hybrid (Selective LLM Enhancement) 🎯

Use deterministic logic for 80% of intelligence, LLMs for the remaining 20% where flexibility is needed.

async def create_maintenance_plan_with_insights(equipment_id):
    # ✅ Deterministic: Get and calculate data
    equipment = get_equipment_details(equipment_id)
    history = get_maintenance_history(equipment_id)

    suggested_interval = calculate_optimal_interval(history)
    base_instructions = get_template_instructions(equipment['type'])

    # 🤖 LLM: Generate contextual insights
    if len(history) > 5:  # Only if we have data to learn from
        context = {
            "equipment_type": equipment['type'],
            "recent_failures": history[-5:],
            "pattern": analyze_failure_pattern(history)
        }

        insights = await generate_llm_insights(context)
    else:
        insights = None

    return {
        "suggested_interval": suggested_interval,  # ✅ Calculated
        "work_instructions": base_instructions,    # ✅ Template
        "contextual_recommendations": insights     # 🤖 LLM-generated
    }

LLM Use Cases (the 20%):

Contextual recommendations from historical patterns
Natural language explanations of complex scenarios
Learning from similar cases
Adaptive suggestions based on user behavior

Cost Comparison

Approach	Latency	Cost/1000 calls	Reliability
Pure Logic	50-100ms	$0	99.9%
Hybrid	200-500ms	$1-10	95%+
Full LLM	1-3 sec	$50-200	90%

Recommendation: Start with pure logic. Add selective LLM calls only where deterministic rules can't capture the nuance.

Architecture Pattern: The Three-Layer Stack

Here's how to structure your API ecosystem for agentic AI:

┌─────────────────────────────────────────────────┐
│           AI AGENT LAYER                         │
│  (ChatGPT, Claude, Custom Agents)                │
│  - Handles conversation flow                     │
│  - Makes decisions based on enriched APIs        │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│         SEMANTIC API LAYER                       │
│  (Agent-Ready Endpoints)                         │
│                                                  │
│  ┌─────────────────────────────────────────┐   │
│  │ Goal-Oriented APIs                       │   │
│  │ - Business intent endpoints              │   │
│  │ - Semantic enrichment                    │   │
│  │ - Contextual responses                   │   │
│  └─────────────────────────────────────────┘   │
│                                                  │
│  Intelligence Layer:                             │
│  • 80% deterministic rules                       │
│  • 20% selective LLM calls                       │
└────────────────┬────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────┐
│       MICROSERVICES LAYER                        │
│  (Internal Architecture)                         │
│  - Fine-grained services                         │
│  - Database access                               │
│  - Business logic                                │
└─────────────────────────────────────────────────┘

Key principle: Keep microservices for internal modularity. Expose agent-ready APIs externally.

Real-World Benefits: Before & After

Manufacturing Operations Example

Before (Traditional APIs):

Agent makes 12 API calls to create a maintenance plan
Response time: 25 seconds
Agent must calculate risk, priority, and recommendations
No validation or suggestions
User experience: Slow, basic Q&A

After (Agent-Ready APIs):

Agent makes 3 API calls (75% reduction)
Response time: 8 seconds (68% faster)
Gets pre-calculated insights and recommendations
Proactive validation and suggestions
User experience: Intelligent assistant

ROI:

3x faster task completion
60% reduction in errors
40% increase in user satisfaction
$0 in additional LLM costs (pure logic approach)

Key Takeaways

APIs must evolve from data endpoints to reasoning frameworks if you want to support autonomous agents effectively.
The microservices trap is real: Fine-grained services are great internally, but agents need goal-oriented external APIs.
Intelligence doesn't always require LLMs: 80% of semantic enrichment can be achieved with deterministic business logic.
Start small: Add semantic fields to existing responses before building composite endpoints.
Agents need three things:
- Clarity (business intent, not CRUD operations)
- Context (meaning, not just data)
- Consistency (predictable contracts)

Production-Grade AI Agents: Architecture Patterns That Actually Work

Akshay Gupta — Wed, 05 Nov 2025 21:03:03 +0000

Your AI agent works beautifully in development. Responses are quick, conversations flow naturally, and everything feels magical. Then you deploy to production with real users, and suddenly everything breaks.

Response times spike to 5+ seconds. Agents lose conversation context mid-workflow. Memory usage explodes. Users report inconsistent behavior. Your costs skyrocket.

I've built AI agent systems that handle 100+ concurrent users with sub-2-second response times. Here's what actually works in production—and what fails spectacularly.

The Development vs. Production Gap

In development, you have:

One user (you)
Clean test data
No concurrent requests
Unlimited time to respond
Generous error margins

In production, you face:

Hundreds of simultaneous users
Messy, unpredictable inputs
Race conditions everywhere
Users expect <2s responses
Every error costs trust (and money)

The patterns that work in development often collapse under production load. Here's how to build agents that scale.

Pattern 1: Goal-Oriented Agents with Explicit Completion

The Problem

Most agents don't know when they're done. They keep talking, asking questions, or offering help even after achieving their goal. This creates confused users and wasted tokens.

Consider an agent building a quality plan:

User: "Create a quality plan for Project Alpha"
Agent: asks 8 clarifying questions, gathers data, generates plan
Agent: "I've created your plan. Would you like me to explain each section? Should I also create an SOP? How about maintenance schedules?"

The agent succeeded but doesn't know it. The conversation drifts instead of completing cleanly.

The Solution: Explicit Completion Signals

Design agents with clear goals and completion markers:

SYSTEM_PROMPT = """
You are a Quality Planning Agent.

YOUR GOAL: Create ONE quality plan for the user's project.

WORKFLOW:
1. Gather project requirements
2. Identify quality checkpoints
3. Map inspection criteria
4. Generate the plan using create_quality_plan()
5. Output: [TASK_COMPLETE]

CRITICAL: After successfully creating the plan, you MUST output [TASK_COMPLETE]
This signals that your work is finished.

Do not:
- Offer additional services
- Start new tasks
- Continue the conversation after completion
"""

The orchestrator watches for this signal:

def check_completion(agent_response: str) -> bool:
    return '[TASK_COMPLETE]' in agent_response

def extract_clean_response(agent_response: str) -> str:
    # Remove marker before showing to user
    return agent_response.replace('[TASK_COMPLETE]', '').strip()

Why This Works

✅ Agents know their scope: Each agent has ONE job, not infinite capabilities

✅ Clear boundaries: The agent completes its task and returns control to the orchestrator

✅ Better UX: Users get what they asked for without unnecessary follow-ups

✅ Composability: Completed agents can trigger suggested next actions

Real-World Impact

Before explicit completion:

Average conversation: 18 turns
Task completion rate: 73%
Users confused about status

After explicit completion:

Average conversation: 8-12 turns
Task completion rate: 94%
Clear status for users and system

Pattern 2: Context Isolation by Task

The Problem

Agents accumulate context that becomes noise for future tasks. Consider this scenario:

User creates a quality plan (agent loads machines, materials, specs)
User switches to maintenance scheduling (agent still has quality plan context)
Agent confuses quality checkpoints with maintenance tasks
Results are mixed and incorrect

The context from Task A pollutes Task B. As conversations grow, this gets worse.

The Solution: Project-Based Context Windows

Isolate context to what's relevant for the current task:

class ContextManager:
    def build_agent_context(self, task_type: str, project_id: str) -> dict:
        """
        Load only the context needed for this specific task.
        """
        base_context = {
            'project_name': self.get_project_name(project_id),
            'timestamp': datetime.now()
        }

        # Task-specific context
        if task_type == 'quality_planning':
            return {
                **base_context,
                'machines': self.get_machines(project_id),
                'materials': self.get_materials(project_id),
                'specs': self.get_specifications(project_id)
            }

        elif task_type == 'maintenance_scheduling':
            return {
                **base_context,
                'machines': self.get_machines(project_id),
                'maintenance_history': self.get_history(project_id),
                'upcoming_schedules': self.get_schedules(project_id)
            }

        elif task_type == 'sop_creation':
            return {
                **base_context,
                'workstations': self.get_workstations(project_id),
                'resources': self.get_resources(project_id),
                'takt_time': self.get_takt_time(project_id)
            }

        # Only load what you need
        return base_context

Context Boundaries

Within a session: Agent remembers conversation history for current task only.

Between tasks: Fresh context window when switching tasks.

Cross-task references: Explicit handoffs with minimal context transfer.

Why This Works

✅ Reduced noise: Agent sees only relevant information

✅ Faster responses: Smaller context = lower latency

✅ Lower costs: Fewer tokens per request

✅ Better accuracy: No confusion from irrelevant data

Pattern 3: LLM-Based Intent Routing

The Problem

Users don't announce which agent they need. They just describe their problem:

"I need to plan quality checkpoints" → Quality Planning Agent
"When was Machine A last serviced?" → Maintenance Agent
"Create work instructions for Station 3" → SOP Agent

Keyword matching fails because users phrase things differently. ML classifiers require training data and struggle with new variations.

The Solution: LLM as Router

Use an LLM to understand intent and route to the appropriate agent:

class IntentRouter:
    def __init__(self, llm_client):
        self.llm = llm_client

    async def route(self, user_message: str, context: dict) -> str:
        """
        Analyze user intent and return appropriate agent key.
        """

        routing_prompt = f"""
        Analyze this user message and determine which specialized agent should handle it.

        AVAILABLE AGENTS:

        1. quality_planning - Creates quality plans, inspection checklists, PM plans
           Examples: "create quality plan", "plan inspections", "quality checkpoints"

        2. maintenance_scheduling - Manages preventive maintenance schedules
           Examples: "maintenance schedule", "when to service machines", "PM tracking"

        3. sop_creation - Generates standard operating procedures
           Examples: "create SOP", "work instructions", "procedure for assembly"

        4. issue_tracking - Handles problem reporting and resolution
           Examples: "report issue", "quality problem", "defect tracking"

        5. general - Unclear intent, chitchat, or requests outside scope

        USER MESSAGE: "{user_message}"

        PROJECT SELECTED: {context.get('project_id') is not None}

        Respond with ONLY the agent key (quality_planning, maintenance_scheduling, etc.)
        """

        response = await self.llm.complete(routing_prompt)
        agent_key = response.strip().lower()

        # Validate response
        valid_agents = ['quality_planning', 'maintenance_scheduling', 
                       'sop_creation', 'issue_tracking', 'general']

        if agent_key not in valid_agents:
            return 'general'  # Safe fallback

        return agent_key

Why LLM Routing Works

✅ Zero-shot learning: No training data required

✅ Natural language understanding: Handles variations and synonyms naturally

✅ Easy to extend: Add new agents by updating the prompt

✅ Context-aware: Can consider project state, user history, etc.

✅ Fast enough: 300-500ms routing decision is acceptable

Routing Performance

In production:

Accuracy: 95%+ correct routing
Latency: 400-600ms average
False positives: <3%
Ambiguous handling: Routes to general agent for clarification

The 5% errors are usually genuinely ambiguous requests that need clarification anyway.

Pattern 4: The Orchestrator Pattern

The Problem

Who coordinates multiple specialized agents? If agents call each other directly, you get spaghetti architecture. If they're independent, you can't compose workflows.

The Solution: Central Orchestrator

One orchestrator manages all agents and workflow transitions:

class Orchestrator:
    def __init__(self, session_manager, router, agent_registry):
        self.sessions = session_manager
        self.router = router
        self.agents = agent_registry

    async def handle_message(self, session_id: str, user_message: str, context: dict):
        """
        Main entry point. Routes and coordinates agent execution.
        """

        # Get session state
        session = await self.sessions.get(session_id)

        # Check current mode
        if session['mode'] == 'orchestrator':
            # No active task - route to appropriate agent
            agent_key = await self.router.route(user_message, context)

            if agent_key == 'general':
                return await self.handle_general(user_message)

            # Start new task with specialized agent
            session['mode'] = 'task_active'
            session['active_agent'] = agent_key
            await self.sessions.update(session)

        # Task is active - continue with current agent
        agent = await self.get_agent(session['active_agent'], context)
        response = await agent.process(user_message)

        # Check if task completed
        if self.is_complete(response):
            # Return to orchestrator mode
            session['mode'] = 'orchestrator'
            session['active_agent'] = None
            await self.sessions.update(session)

            # Suggest next actions
            suggestions = self.get_suggestions(session['active_agent'])

            return {
                'response': self.clean_response(response),
                'suggestions': suggestions,
                'task_complete': True
            }

        # Task ongoing
        return {
            'response': response,
            'task_complete': False
        }

Orchestrator Responsibilities

1. Intent Routing

Analyzes user message
Selects appropriate agent
Handles ambiguity

2. State Management

Tracks orchestrator vs. task-active mode
Manages active agent per session
Persists conversation history

3. Task Completion

Detects completion signals
Returns control to orchestrator
Suggests next actions

4. Error Handling

Catches agent failures
Provides graceful degradation
Maintains system stability

State Transitions

[Orchestrator Mode]
       ↓
   User Message
       ↓
   Intent Routing
       ↓
[Task Active Mode] → Agent Processing
       ↓                    ↑
   Task Complete?          |
       ↓ (No)──────────────┘
       ↓ (Yes)
   Suggested Actions
       ↓
[Orchestrator Mode]

Why This Works

✅ Single source of truth: Orchestrator owns session state

✅ Clean agent APIs: Agents only handle domain logic, not coordination

✅ Composability: Easy to add new agents to the registry

✅ Testability: Each component can be tested independently

✅ Debuggability: All routing decisions go through one place

Pattern 5: Off-Topic Detection with Context Preservation

The Problem

Users naturally drift during conversations:

User: "Create a quality plan for Project X"
Agent: "What product are you manufacturing?"
User: "Automotive parts. By the way, when is lunch?"
Agent: "I don't have information about lunch schedules..."

Should the agent:

Stay rigid? (Poor UX)
Answer everything? (Loses focus)
Redirect immediately? (Feels robotic)

The Solution: Conservative Off-Topic Detection

Detect genuine topic switches while allowing natural conversation flow:

class OffTopicDetector:
    async def check(self, user_message: str, active_agent: str, 
                   conversation_history: list) -> tuple[bool, str]:
        """
        Returns: (is_off_topic, suggested_new_agent)
        """

        agent_goals = {
            'quality_planning': 'creating a quality plan or PM plan',
            'maintenance_scheduling': 'scheduling preventive maintenance',
            'sop_creation': 'creating standard operating procedures',
            'issue_tracking': 'reporting and tracking quality issues'
        }

        current_goal = agent_goals.get(active_agent)

        detection_prompt = f"""
        Current Task: {current_goal}

        Recent Conversation:
        {self._format_history(conversation_history[-3:])}

        New User Message: "{user_message}"

        Question: Is this message clearly switching to a DIFFERENT, UNRELATED task?

        Guidelines:
        - Clarifying questions about current task = ON TOPIC
        - Requesting changes to current work = ON TOPIC
        - Small tangents that relate back = ON TOPIC
        - Starting entirely new unrelated task = OFF TOPIC

        Examples:
        ON TOPIC:
        - "Can you explain what you mean by checkpoint?"
        - "Actually, use Machine B instead of Machine A"
        - "Wait, I need to add one more material"

        OFF TOPIC:
        - "Actually, let's work on maintenance scheduling instead"
        - "I need to report a quality issue"
        - "Create an SOP for me"

        Respond: ON_TOPIC or OFF_TOPIC|suggested_agent_key
        """

        response = await self.llm.complete(detection_prompt)

        if response.startswith('OFF_TOPIC'):
            parts = response.split('|')
            suggested_agent = parts[1] if len(parts) > 1 else 'general'
            return True, suggested_agent

        return False, None

Graceful Topic Switching

When off-topic detected, give users choice:

if is_off_topic and suggested_agent:
    return {
        'response': (
            f"I notice you want to switch to {suggested_agent}. "
            f"Would you like to:\n"
            f"1. Complete the current task first\n"
            f"2. Switch now (we can return to this later)\n"
            f"3. Cancel current task"
        ),
        'requires_choice': True
    }

Why Conservative Detection Works

✅ Few false positives: Natural conversation continues smoothly

✅ Clear boundaries: Genuine topic switches are caught

✅ User control: Let users decide how to handle switches

✅ Context preservation: Can return to incomplete tasks later

In testing:

91% of clarifications correctly allowed
97% of topic switches correctly detected
User satisfaction significantly higher than rigid systems

Pattern 6: Tool Call Orchestration and Validation

The Problem

Agents call tools, but tools can fail:

Rate limits
Invalid parameters
Missing data
Timeout errors
Unexpected responses

Poor tool orchestration leads to:

Agent hallucinating tool results
Incomplete workflows
User confusion
Data inconsistencies

The Solution: MCP (Model Context Protocol) Pattern

Create a controlled tool layer between agents and APIs:

class ToolOrchestrator:
    def __init__(self, api_client):
        self.api = api_client
        self.validators = self._setup_validators()

    async def execute_tool(self, tool_name: str, parameters: dict) -> dict:
        """
        Validate, execute, and handle tool calls with proper error recovery.
        """

        # Pre-execution validation
        validation_result = self.validators[tool_name](parameters)
        if not validation_result.valid:
            return {
                'success': False,
                'error': f"Invalid parameters: {validation_result.error}",
                'suggestion': validation_result.fix_suggestion
            }

        # Execute with retry logic
        for attempt in range(3):
            try:
                result = await self.api.call(tool_name, parameters)

                # Post-execution validation
                if self._validate_result(tool_name, result):
                    return {
                        'success': True,
                        'data': result
                    }

            except RateLimitError:
                if attempt < 2:
                    await asyncio.sleep(2 ** attempt)
                    continue
                return {
                    'success': False,
                    'error': 'Rate limit exceeded. Please try again in a moment.'
                }

            except TimeoutError:
                if attempt < 2:
                    continue
                return {
                    'success': False,
                    'error': 'Request timed out. The operation may still complete.'
                }

            except InvalidDataError as e:
                return {
                    'success': False,
                    'error': f'Data validation failed: {str(e)}',
                    'suggestion': 'Please check your input parameters'
                }

        return {
            'success': False,
            'error': 'Maximum retry attempts reached'
        }

Tool Validation Strategy

Pre-execution checks:

Required parameters present
Parameter types correct
Values within expected ranges
Dependencies available

Post-execution checks:

Response structure matches expected format
Data integrity validated
Side effects confirmed
Error conditions handled

Agent Tool Error Handling

Agents receive tool results and adapt:

# In agent system prompt
"""
When using tools:

1. Check tool result success status
2. If failure, read the error message
3. Follow any suggestions provided
4. Retry with corrected parameters if applicable
5. If unable to proceed, explain to user what went wrong

Example:
Tool result: {'success': False, 'error': 'Machine X not found in project'}
Your response: "I couldn't find Machine X in this project. Could you verify 
the machine name or select from: [list available machines]"
"""

Why This Pattern Works

✅ Controlled access: Tools can't be misused by agents

✅ Graceful degradation: Errors don't crash the agent

✅ Clear feedback: Agents understand what went wrong

✅ Retry logic: Transient failures resolved automatically

✅ Security: Input validation prevents injection attacks

Pattern 7: Conversation History Management

The Problem

LLMs have token limits. Long conversations exceed context windows:

20-turn conversation = 8,000+ tokens
System prompt = 1,500 tokens
Tool definitions = 2,000 tokens
Project context = 1,000 tokens
Total: 12,500 tokens (near limit for many models)

What happens at message 21?

The Solution: Smart History Windowing

Keep recent context + summarize old messages:

class ConversationManager:
    def __init__(self, max_full_messages=8):
        self.max_full_messages = max_full_messages

    async def prepare_context(self, session_id: str) -> list:
        """
        Prepare conversation history for agent, managing token budget.
        """

        full_history = await self.get_history(session_id)

        if len(full_history) <= self.max_full_messages:
            return full_history

        # Keep recent messages
        recent = full_history[-self.max_full_messages:]

        # Summarize older messages
        older = full_history[:-self.max_full_messages]
        summary = await self._create_summary(older)

        return [
            {
                'role': 'system',
                'content': f'Previous conversation summary: {summary}'
            },
            *recent
        ]

    async def _create_summary(self, messages: list) -> str:
        """
        Create concise summary of older messages.
        """

        conversation_text = '\n'.join([
            f"{msg['role']}: {msg['content']}" 
            for msg in messages
        ])

        summary_prompt = f"""
        Summarize this conversation in 2-3 sentences, focusing on:
        - Key decisions made
        - Data collected
        - Current progress toward goal

        Conversation:
        {conversation_text}

        Summary:
        """

        summary = await self.llm.complete(summary_prompt)
        return summary.strip()

When to Summarize

Option 1: Fixed window

Keep last N messages (e.g., 8-10)
Summarize everything before that
Simple and predictable

Option 2: Token-aware

Count tokens in current context
Summarize when approaching 80% of limit
More efficient but complex

Option 3: Task-based

Full history during active task
Summarize on task completion
Keeps task context intact

What to Keep vs. Summarize

Always keep:

System prompt
Tool definitions
Last 3-5 messages (current context)
Active task data

Can summarize:

Old clarifying questions
Resolved issues
Completed sub-tasks
General chitchat

Never summarize:

Critical data user provided
Tool call results needed for current task
Error messages that might recur

Real-World Architecture: Putting It Together

Here's how these patterns combine in production:

User Message
     ↓
┌────────────────────┐
│   Orchestrator     │
│  (Entry Point)     │
└─────────┬──────────┘
          │
    Session State?
    ┌─────┴─────┐
    │           │
Orchestrator  Task Active
   Mode         Mode
    │           │
    ↓           ↓
┌─────────┐  ┌──────────┐
│ Intent  │  │ Current  │
│ Router  │  │ Agent    │
│ (LLM)   │  │          │
└────┬────┘  └────┬─────┘
     │            │
     ↓            ↓
┌────────────────────┐
│  Agent Registry    │
│  - Quality Agent   │
│  - Maintenance     │
│  - SOP Agent       │
│  - Issue Tracker   │
└─────────┬──────────┘
          │
          ↓
┌───────────────────┐
│  Context Manager  │
│  (Task-specific)  │
└─────────┬─────────┘
          │
          ↓
┌───────────────────┐
│  Tool Orchestrator│
│  (MCP Pattern)    │
└─────────┬─────────┘
          │
          ↓
┌───────────────────┐
│  Completion Check │
│  [TASK_COMPLETE]  │
└─────────┬─────────┘
          │
    Complete?
    ┌─────┴──────┐
   Yes          No
    │            │
    ↓            ↓
Suggestions   Continue
Return to     with Agent
Orchestrator

Flow Example: Quality Planning

User: "Create a quality plan"
Orchestrator: Routes to Intent Router
Router: Returns 'quality_planning' agent
Orchestrator: Activates Quality Planning Agent
Context Manager: Loads machines, materials, specs
Agent: "What product are you manufacturing?"
User: "Automotive parts"
Agent: Processes, calls tools, generates plan
Agent: "Plan created. [TASK_COMPLETE]"
Orchestrator: Detects completion, returns to orchestrator mode
System: Suggests: "Create SOP?" "Schedule maintenance?"

Key Takeaways

Production-grade agents require structured patterns:

✅ 1. Goal-Oriented Design

Each agent has ONE clear objective
Explicit completion signals
No scope creep

✅ 2. Context Isolation

Task-specific context loading
No cross-contamination
Fresh starts for new tasks

✅ 3. Intelligent Routing

LLM-based intent understanding
95%+ accuracy in production
Handles natural language variations

✅ 4. Central Orchestration

One coordinator for all agents
Clear state management
Composable workflow design

✅ 5. Conservative Topic Detection

Allow natural conversation flow
Catch genuine topic switches
User control over transitions

✅ 6. Validated Tool Execution

MCP pattern for controlled access
Pre and post-execution validation
Graceful error recovery

✅ 7. Smart History Management

Token-aware windowing
Summarization of old context
Preserve critical information

Common Anti-Patterns to Avoid

❌ Autonomous agents with no structure → Agents wander, lose focus, never complete

❌ Shared context across all tasks → Confusion, mixed data, poor accuracy

❌ Keyword-based routing → Brittle, can't handle variations, high error rate

❌ Direct agent-to-agent communication → Spaghetti architecture, hard to debug

❌ Ignoring off-topic detection → Agents follow users down rabbit holes

❌ Trusting tool calls blindly → Cascading failures, poor error messages

❌ Unlimited conversation history → Token limit errors, high costs, crashes

The Bottom Line

Building production-grade AI agents isn't about autonomy—it's about architecture.

What works:

Specialized agents with clear goals
Explicit completion signals
Task-isolated context
LLM-based routing
Central orchestration
Validated tool execution
Managed conversation history

What fails:

Generic autonomous agents
Implicit task completion
Shared global context
Rule-based routing
Direct agent coupling
Unvalidated tool calls
Unlimited history

The agents that work in production have structure. They know their goals, understand their boundaries, and complete tasks reliably.

That's what production-grade means.

About the Author

I build production-grade multi-agent systems for manufacturing, sales, and productivity automation. My agents follow structured workflows with 94% task completion rates, achieving 75% reduction in manual work time.

Specialized in orchestration patterns, context management, and LLM-based routing using CrewAI, Agno, and custom architectures.

Open to consulting and technical partnerships. Let's discuss your agent architecture challenges!

📧 Contact: gupta.akshay1996@gmail.com

Found this helpful? Share it with other AI builders! 🚀

What production challenges are you facing with AI agents? Drop a comment below!

The Orchestrator Pattern: Routing Conversations to Specialized AI Agents

Akshay Gupta — Wed, 05 Nov 2025 20:54:47 +0000

Building one AI agent to handle everything sounds simple. One conversation, one context, one set of instructions. But in practice, generalist agents fail at complex workflows.

They lose focus. They confuse tasks. They can't decide when they're done. They try to be everything and end up being mediocre at most things.

The solution isn't a smarter single agent—it's specialized agents with intelligent orchestration. Each agent does one thing exceptionally well, and an orchestrator routes conversations to the right specialist.

I've built multi-agent systems where 4-6 specialized agents handle distinct workflows, coordinated by a central orchestrator. Here's how to architect orchestration that actually works in production.

The Problem with Single-Agent Systems

Consider a business operations platform that needs to:

Schedule appointments and manage calendars
Generate reports from data
Handle customer support inquiries
Process document requests
Manage task workflows

A single agent handling all of this faces impossible challenges:

Context confusion:
"Schedule a meeting" vs "Schedule a report generation" vs "Schedule a follow-up task"—same verb, completely different actions.

No clear completion:
When is the agent "done"? After scheduling? After confirming? After sending confirmation emails?

Scope creep:
User asks for a report, agent offers to schedule a meeting about the report, then suggests creating tasks based on the report findings. The conversation never ends.

Degraded performance:
The system prompt grows to 5,000+ tokens trying to handle every case. The agent becomes slow and expensive.

Impossible to debug:
When something breaks, you can't isolate which part of the mega-prompt is failing.

The Orchestrator Pattern

Instead of one generalist, build specialized agents:

Scheduling Agent: Handles calendar management only
Reporting Agent: Generates and formats reports only
Support Agent: Answers questions from knowledge base only
Document Agent: Processes document requests only
Task Agent: Manages task creation and tracking only

Each agent has:

One clear goal
Focused system prompt
Specific tools
Explicit completion criteria

The orchestrator sits above all agents and:

Routes user messages to the appropriate agent
Manages conversation state
Detects task completion
Suggests next actions
Handles transitions between agents

Pattern 1: Intent-Based Routing

The Problem

Users don't tell you which agent they need:

"Set up a meeting for next Tuesday" → Scheduling Agent
"Show me last month's numbers" → Reporting Agent
"How do I reset my password?" → Support Agent
"I need the contract from Project Alpha" → Document Agent

You need to understand intent from natural language.

❌ Solution 1: Keyword Matching (Don't Do This)

# Brittle and fails on variations
def route_by_keywords(message):
    message_lower = message.lower()

    if 'meeting' in message_lower or 'schedule' in message_lower:
        return 'scheduling_agent'
    elif 'report' in message_lower or 'numbers' in message_lower:
        return 'reporting_agent'
    elif 'how do i' in message_lower or 'help' in message_lower:
        return 'support_agent'

    return 'general_agent'

Why this fails:

"Can you generate a schedule?" contains "schedule" but needs reporting, not scheduling
"Meeting notes from last quarter" contains "meeting" but needs documents, not scheduling
Doesn't handle synonyms, typos, or context
Brittle and requires constant updates

✅ Solution 2: LLM-Based Router (Recommended)

Use an LLM to understand intent and route appropriately:

class IntentRouter:
    def __init__(self, llm_client):
        self.llm = llm_client

    async def route(self, user_message: str, context: dict) -> str:
        """
        Analyze user intent and return appropriate agent.
        """

        routing_prompt = f"""
        Analyze this user message and determine which specialized agent should handle it.

        AVAILABLE AGENTS:

        1. scheduling_agent - Calendar management, appointments, meetings
           Keywords: schedule, meeting, appointment, calendar, book, available times
           Examples: "Schedule a call", "When am I free?", "Book a demo"

        2. reporting_agent - Data analysis, report generation, metrics
           Keywords: report, data, analytics, numbers, metrics, dashboard
           Examples: "Show me sales data", "Generate quarterly report"

        3. support_agent - Help, troubleshooting, how-to questions
           Keywords: how do I, help, problem, issue, troubleshoot, question
           Examples: "How do I reset password?", "Need help with setup"

        4. document_agent - Document retrieval, file management
           Keywords: document, file, contract, agreement, download, upload
           Examples: "Get me the contract", "Upload the proposal"

        5. task_agent - Task creation, project management, to-dos
           Keywords: task, todo, project, reminder, deadline, assign
           Examples: "Create a task", "Remind me to follow up"

        6. general - Unclear intent, greeting, chitchat, or outside scope
           Use when intent doesn't match any specialized agent

        USER MESSAGE: "{user_message}"

        CONTEXT:
        - Active project: {context.get('project_name', 'None')}
        - User role: {context.get('user_role', 'Unknown')}

        Respond with ONLY the agent key (scheduling_agent, reporting_agent, etc.)
        If truly ambiguous, respond with 'general' and the system will ask for clarification.
        """

        response = await self.llm.complete(routing_prompt)
        agent_key = response.strip().lower()

        # Validate response
        valid_agents = [
            'scheduling_agent', 
            'reporting_agent', 
            'support_agent',
            'document_agent', 
            'task_agent', 
            'general'
        ]

        if agent_key not in valid_agents:
            return 'general'

        return agent_key

Why LLM Routing Works

✅ Zero-shot understanding: No training data needed, works immediately

✅ Natural language processing: Handles variations, synonyms, context naturally

✅ Easy to extend: Add new agents by updating the prompt, no retraining

✅ Context-aware: Can consider user role, active project, conversation history

✅ Fast enough: 300-600ms routing decision is acceptable for most applications

✅ High accuracy: 95%+ correct routing in production with good prompt design

Handling Ambiguous Requests

When intent is unclear, route to general handler that clarifies:

# User: "I need to see something"
# Router returns: 'general'

general_response = """
I'd be happy to help! To direct you to the right specialist, could you clarify what you need?

- 📅 Schedule or view appointments
- 📊 View reports or data
- 📄 Access documents or files
- ✅ Create or manage tasks
- ❓ Get help or support
"""

This prevents wrong routing and improves user experience.

Pattern 2: The Orchestrator State Machine

The Problem

The orchestrator needs to track multiple states:

Is a task currently active?
Which agent is handling it?
Is the user mid-conversation with an agent?
Did the agent complete its work?

Without proper state management, you get:

Messages routed to wrong agents
Tasks interrupted unexpectedly
Completion never detected
User confusion about what's happening

The Solution: Two-Mode State Machine

class Orchestrator:
    def __init__(self, session_manager, router, agent_registry):
        self.sessions = session_manager
        self.router = router
        self.agents = agent_registry

    async def handle_message(
        self, 
        session_id: str, 
        user_message: str, 
        context: dict
    ) -> dict:
        """
        Main entry point. Routes based on current state.
        """

        # Get current session state
        session = await self.sessions.get(session_id)

        # State machine: orchestrator mode or task active mode
        if session['mode'] == 'orchestrator':
            return await self._handle_orchestrator_mode(
                session_id, user_message, context
            )
        else:  # mode == 'task_active'
            return await self._handle_task_active_mode(
                session_id, user_message, context, session
            )

    async def _handle_orchestrator_mode(
        self, 
        session_id: str, 
        user_message: str, 
        context: dict
    ) -> dict:
        """
        No active task. Route to appropriate agent.
        """

        # Route to agent
        agent_key = await self.router.route(user_message, context)

        if agent_key == 'general':
            return await self._handle_general(user_message)

        # Start new task with specialized agent
        session = await self.sessions.get(session_id)
        session['mode'] = 'task_active'
        session['active_agent'] = agent_key
        session['task_started_at'] = datetime.now()
        await self.sessions.update(session_id, session)

        # Forward to agent
        agent = self.agents.get(agent_key)
        response = await agent.process(user_message, context)

        return {
            'response': response,
            'mode': 'task_active',
            'active_agent': agent_key
        }

    async def _handle_task_active_mode(
        self, 
        session_id: str, 
        user_message: str, 
        context: dict,
        session: dict
    ) -> dict:
        """
        Task is active. Continue with current agent.
        """

        agent_key = session['active_agent']
        agent = self.agents.get(agent_key)

        # Forward message to active agent
        response = await agent.process(user_message, context)

        # Check if task completed
        if self._is_complete(response):
            # Task done, return to orchestrator mode
            session['mode'] = 'orchestrator'
            session['active_agent'] = None
            session['last_completed_agent'] = agent_key
            session['task_completed_at'] = datetime.now()
            await self.sessions.update(session_id, session)

            # Clean response and suggest next actions
            clean_response = self._remove_completion_marker(response)
            suggestions = self._get_next_actions(agent_key)

            return {
                'response': clean_response,
                'mode': 'orchestrator',
                'task_complete': True,
                'suggestions': suggestions
            }

        # Task ongoing
        return {
            'response': response,
            'mode': 'task_active',
            'active_agent': agent_key
        }

State Transitions

┌─────────────────────┐
│  ORCHESTRATOR MODE  │
│  (No active task)   │
└──────────┬──────────┘
           │
      User Message
           │
           ↓
    ┌──────────────┐
    │ Intent Router│
    └──────┬───────┘
           │
           ↓
    Agent Selected
           │
           ↓
┌──────────────────────┐
│   TASK ACTIVE MODE   │
│ (Agent processing)   │
└──────────┬───────────┘
           │
      User Message
           │
           ↓
    Forward to Agent
           │
           ↓
    Task Complete?
       ┌───┴───┐
      No      Yes
       │       │
       │       ↓
       │  Return to
       │  Orchestrator
       │
       └──→ Continue

Why This Works

✅ Clear state boundaries: System always knows what mode it's in

✅ No routing confusion: Orchestrator mode routes, task mode forwards

✅ Explicit completion: Agent signals when done, orchestrator detects it

✅ Session persistence: State survives across messages

✅ Debuggable: Can inspect state at any point to understand system behavior

Pattern 3: Explicit Task Completion Signals

The Problem

How does the orchestrator know when an agent finished its work?

Implicit detection fails:

Tool call detection: Agent might call tools in any order
Turn counting: Some tasks need more turns than others
Silence: Agent stops responding, but is it done or stuck?

You need explicit, unambiguous completion signals.

The Solution: Completion Markers

Each agent's system prompt includes explicit completion instructions:

AGENT_SYSTEM_PROMPT = """
You are a {agent_type} specialist.

YOUR GOAL: {goal_description}

WORKFLOW:
{workflow_steps}

CRITICAL: When you have successfully completed your goal, you MUST output:
[TASK_COMPLETE]

This signals to the orchestrator that your work is finished.

Guidelines for completion:
- Confirm user satisfaction before marking complete
- Ensure all required information is collected
- Verify the deliverable meets requirements
- Do NOT continue conversation after completion

Example:
User: "Yes, that looks perfect!"
You: "Great! I've {completed_action}. [TASK_COMPLETE]"
"""

Orchestrator Detection

def _is_complete(self, agent_response: str) -> bool:
    """Check if agent has completed its task."""
    return '[TASK_COMPLETE]' in agent_response

def _remove_completion_marker(self, response: str) -> str:
    """Remove marker before showing to user."""
    return response.replace('[TASK_COMPLETE]', '').strip()

Example: Scheduling Agent Completion

User: "Schedule a meeting with John next Tuesday at 2pm"
Agent: "I'll schedule that meeting. Let me check availability..."
Agent: [calls check_availability tool]
Agent: "Tuesday 2pm works. Should I send the invite?"
User: "Yes, please"
Agent: [calls create_meeting tool]
Agent: "Meeting scheduled! I've sent calendar invites to you and John 
       for Tuesday, March 19th at 2:00 PM. [TASK_COMPLETE]"

[Orchestrator detects completion]
[Returns to orchestrator mode]
[Suggests: "Would you like to create a reminder? 
           Generate a meeting agenda?"]

Why Explicit Markers Work

✅ Unambiguous: No guessing or inference needed

✅ Agent-controlled: Agent decides when it's truly done

✅ Allows confirmation: Agent can ask for user approval before completing

✅ Easy to implement: Simple string matching, no complex logic

✅ Testable: Can verify completion detection in tests

Pattern 4: Off-Topic Detection

The Problem

Users naturally drift during multi-turn workflows:

User: "Schedule a meeting for next week"
Agent: "What day works best for you?"
User: "Actually, can you show me last month's sales report first?"

Should the orchestrator:

Let the agent handle it? (Agent will be confused)
Switch immediately? (Abrupt, might lose context)
Ask the user? (Best approach)

The Solution: Conservative Off-Topic Detection

class OffTopicDetector:
    def __init__(self, llm_client):
        self.llm = llm_client

    async def check_off_topic(
        self, 
        user_message: str, 
        active_agent: str,
        recent_history: list
    ) -> tuple[bool, str]:
        """
        Returns: (is_off_topic, suggested_agent)
        """

        agent_goals = {
            'scheduling_agent': 'scheduling appointments or managing calendar',
            'reporting_agent': 'generating reports or analyzing data',
            'support_agent': 'answering questions or troubleshooting',
            'document_agent': 'retrieving or managing documents',
            'task_agent': 'creating or managing tasks'
        }

        current_goal = agent_goals.get(active_agent, 'general assistance')

        detection_prompt = f"""
        Current Task: {current_goal}

        Recent Conversation:
        {self._format_history(recent_history[-3:])}

        New User Message: "{user_message}"

        Question: Is this message clearly switching to a DIFFERENT, UNRELATED task?

        Guidelines:
        - Clarifying questions about current task = ON TOPIC
        - Requesting changes to current task = ON TOPIC
        - Small tangents that relate back = ON TOPIC
        - Starting entirely new unrelated task = OFF TOPIC

        Examples for scheduling agent:
        ON TOPIC:
        - "Actually, make it 3pm instead of 2pm"
        - "Can you check if the conference room is available?"
        - "Add Sarah to the meeting too"

        OFF TOPIC:
        - "Show me last month's sales report"
        - "I need to retrieve a document"
        - "Create a task for follow-up"

        Respond: ON_TOPIC or OFF_TOPIC|suggested_agent_key
        """

        response = await self.llm.complete(detection_prompt)

        if response.startswith('OFF_TOPIC'):
            parts = response.split('|')
            suggested_agent = parts[1] if len(parts) > 1 else 'general'
            return True, suggested_agent

        return False, None

Handling Off-Topic Requests

When detected, give the user control:

if is_off_topic and new_agent:
    return {
        'response': f"""
I notice you're asking about something different from our current task.

Would you like to:
1. Complete the current task first
2. Switch to {new_agent.replace('_', ' ')} now (we can return to this later)
3. Cancel the current task

Which would you prefer?
        """,
        'requires_user_choice': True,
        'options': ['complete_current', 'switch_now', 'cancel']
    }

Why Conservative Detection Works

✅ Few false positives: Legitimate workflow continues smoothly

✅ User control: User decides how to handle topic switches

✅ Context preservation: Can return to incomplete tasks later

✅ Better UX: No jarring interruptions or rigid boundaries

In production:

92% of clarifications correctly allowed to continue
96% of true topic switches correctly detected
User satisfaction higher than strict or no detection

Pattern 5: Suggested Next Actions

The Problem

Agent completes a task. Now what? Users often need related follow-up actions but don't know what's available.

Poor experience:

Agent: "Meeting scheduled! [TASK_COMPLETE]"
Orchestrator: "Anything else I can help with?"
User: "Um... I guess that's it?"

Better experience:

Agent: "Meeting scheduled! [TASK_COMPLETE]"
Orchestrator: "Meeting scheduled! What would you like to do next?
- 📋 Create agenda for this meeting
- ✅ Set reminder before meeting
- 📧 Draft follow-up email
- 📊 View your full calendar"

The Solution: Context-Aware Suggestions

class NextActionSuggester:

    SUGGESTIONS = {
        'scheduling_agent': [
            'Create agenda for meeting',
            'Set reminder before meeting',
            'View full calendar',
            'Schedule another meeting',
            'Create follow-up task'
        ],

        'reporting_agent': [
            'Schedule review meeting for report',
            'Export report to document',
            'Create tasks based on findings',
            'Schedule automated report updates',
            'Share report with team'
        ],

        'support_agent': [
            'Create task for follow-up',
            'Save solution to knowledge base',
            'Schedule training session',
            'Contact support team directly',
            'View related documentation'
        ],

        'document_agent': [
            'Create task for document review',
            'Schedule discussion about document',
            'Share document with others',
            'Set reminder to update document',
            'Generate report from document'
        ],

        'task_agent': [
            'Schedule time to work on task',
            'Create sub-tasks',
            'Set reminder for deadline',
            'Generate status report',
            'View all active tasks'
        ]
    }

    def get_suggestions(
        self, 
        completed_agent: str, 
        task_context: dict = None
    ) -> list:
        """Get contextual next action suggestions."""

        base_suggestions = self.SUGGESTIONS.get(completed_agent, [])

        # Can further customize based on task_context
        # For example, if meeting scheduled with >5 people, 
        # suggest "Create shared agenda"

        return base_suggestions[:4]  # Return top 4 suggestions

Why This Works

✅ Discoverability: Users learn what's possible

✅ Productivity: Easy to chain related actions

✅ Engagement: Keeps users in the flow

✅ Contextual: Suggestions relevant to what just happened

✅ Optional: Users can ignore if not needed

Pattern 6: Agent Registry and Dynamic Loading

The Problem

Hard-coding agent instances doesn't scale. Adding new agents requires code changes. Can't enable/disable agents per user or deployment.

The Solution: Agent Registry Pattern

class AgentRegistry:
    def __init__(self):
        self.agents = {}
        self.agent_configs = {}

    def register(
        self, 
        agent_key: str, 
        agent_class: type, 
        config: dict
    ):
        """Register an agent with configuration."""
        self.agent_configs[agent_key] = {
            'class': agent_class,
            'config': config,
            'enabled': config.get('enabled', True)
        }

    def get(self, agent_key: str, context: dict = None):
        """Get or create agent instance."""

        # Check if agent exists and is enabled
        if agent_key not in self.agent_configs:
            raise ValueError(f"Agent {agent_key} not registered")

        agent_config = self.agent_configs[agent_key]

        if not agent_config['enabled']:
            raise ValueError(f"Agent {agent_key} is disabled")

        # Check if instance already exists
        if agent_key not in self.agents:
            # Create new instance
            agent_class = agent_config['class']
            config = agent_config['config']

            # Initialize with context if provided
            if context:
                self.agents[agent_key] = agent_class(config, context)
            else:
                self.agents[agent_key] = agent_class(config)

        return self.agents[agent_key]

    def list_available(self) -> list:
        """List all enabled agents."""
        return [
            {
                'key': key,
                'name': config['config'].get('name'),
                'description': config['config'].get('description')
            }
            for key, config in self.agent_configs.items()
            if config['enabled']
        ]

# Usage
registry = AgentRegistry()

registry.register('scheduling_agent', SchedulingAgent, {
    'name': 'Scheduling Assistant',
    'description': 'Manages appointments and calendar',
    'enabled': True
})

registry.register('reporting_agent', ReportingAgent, {
    'name': 'Reporting Assistant',
    'description': 'Generates reports and analyzes data',
    'enabled': True
})

# Get agent when needed
scheduling_agent = registry.get('scheduling_agent', context)

Why Registry Pattern Works

✅ Decoupled: Orchestrator doesn't need to know about agent implementation

✅ Dynamic: Can enable/disable agents at runtime

✅ Configurable: Each agent can have different configuration

✅ Testable: Easy to swap in mock agents for testing

✅ Extensible: Add new agents without modifying orchestrator

Putting It All Together: Complete Architecture

Here's how all patterns combine:

User Message
     ↓
┌──────────────────────────────┐
│      Orchestrator            │
│   (Entry Point)              │
└─────────────┬────────────────┘
              │
         Get Session
              ↓
    ┌─────────────────────┐
    │   Session Manager   │
    │  (State: mode,      │
    │   active_agent)     │
    └─────────┬───────────┘
              │
    Check Current Mode
    ┌─────────┴─────────┐
    │                   │
Orchestrator       Task Active
   Mode               Mode
    │                   │
    ↓                   ↓
┌────────────┐    ┌──────────────┐
│   Intent   │    │  Off-Topic   │
│   Router   │    │  Detector    │
│   (LLM)    │    │              │
└─────┬──────┘    └───────┬──────┘
      │                   │
      ↓                   ↓
   Agent Key        Off-Topic?
      │              ┌────┴────┐
      │             No        Yes
      │              │          │
      │              │    User Choice
      ↓              ↓
┌──────────────────────────────┐
│      Agent Registry          │
│  (Get appropriate agent)     │
└────────────┬─────────────────┘
             │
             ↓
    ┌────────────────┐
    │  Agent Process │
    │  (Handle task) │
    └────────┬───────┘
             │
             ↓
    ┌────────────────────┐
    │ Completion Check   │
    │  [TASK_COMPLETE]?  │
    └────────┬───────────┘
             │
        Complete?
      ┌──────┴──────┐
     Yes           No
      │             │
      ↓             ↓
  Suggestions   Continue
  & Return      with Agent
  Orchestrator

Key Takeaways

Building production orchestration requires:

✅ LLM-Based Intent Routing

Zero-shot understanding of user intent
95%+ accuracy with good prompt design
Easy to extend with new agents
Context-aware routing decisions

✅ State Machine Architecture

Two modes: orchestrator and task active
Clear state transitions
Session persistence
Debuggable behavior

✅ Explicit Completion Signals

Agents signal when done with markers
Orchestrator detects unambiguously
User confirmation before completion
Clean handoff back to orchestrator

✅ Conservative Off-Topic Detection

Allow natural conversation flow
Detect genuine topic switches
Give users control over transitions
Preserve context for return

✅ Contextual Next Actions

Suggest relevant follow-ups
Improve discoverability
Keep users in flow
Optional but valuable

✅ Agent Registry Pattern

Decouple orchestrator from agents
Dynamic enable/disable
Easy to add new agents
Testable and maintainable

Common Anti-Patterns to Avoid

❌ Keyword-based routing → Brittle, high error rate, constant maintenance

❌ No state management → Lost context, routing confusion, poor UX

❌ Implicit completion detection → False positives, tasks never end

❌ No off-topic handling → Agents confused, conversations derail

❌ Hard-coded agent references → Difficult to extend, tightly coupled

❌ No suggested next actions → Dead-end conversations, poor discoverability

❌ Aggressive off-topic detection → Interrupts natural flow, frustrates users

The Bottom Line

Orchestration isn't about building one smart agent—it's about coordinating specialized agents effectively.

What works:

LLM-based intent routing
Clear state machine (two modes)
Explicit completion signals
Conservative off-topic detection
Contextual suggestions
Agent registry pattern

What fails:

Keyword routing
No state tracking
Implicit completion
No off-topic handling
Hard-coded agents
Dead-end conversations

The orchestrator's job is simple: route to the right specialist, detect when they're done, and suggest what's next.

Get this architecture right, and your multi-agent system scales effortlessly.

About the Author

I build production-grade multi-agent systems with intelligent orchestration. My implementations achieve 95%+ routing accuracy and 94%+ task completion rates through LLM-based intent understanding and explicit state management.

Specialized in orchestrator patterns, agent coordination, and scalable multi-agent architectures using CrewAI, Agno, and custom frameworks.

Open to consulting on multi-agent architecture challenges. Let's connect!

📧 Contact: gupta.akshay1996@gmail.com

Found this helpful? Share it with other AI builders! 🚀

What orchestration challenges are you facing? Drop a comment below!

Context Engineering: Giving AI Agents Memory Without Breaking the Token Budget

Akshay Gupta — Wed, 05 Nov 2025 20:52:32 +0000

Your AI agent needs to remember things. User preferences, project details, conversation history, tool results—all of it matters for providing intelligent responses. But every token you send costs money and consumes your context window.

Send too little context, and your agent gives generic, unhelpful responses. Send too much, and you hit token limits, rack up costs, and slow down responses.

I've built agents that manage context for manufacturing operations, sales workflows, and productivity systems. Here's how to give agents the right memory at the right time—without exploding your budget.

The Context Budget Problem

Every LLM has a context window—a maximum number of tokens it can process. Claude Sonnet 4.5 has 200K tokens. GPT-4 has 128K tokens. Sounds like a lot, right?

Here's what actually fits in that budget:

A typical agent's baseline context:

System prompt: 1,500 tokens
Tool definitions: 2,000 tokens
Agent instructions: 1,000 tokens
Subtotal: 4,500 tokens

For a project-based agent, add:

Project machines list: 800 tokens
Materials and specifications: 600 tokens
Historical data: 1,200 tokens
Subtotal: 2,600 tokens

For a 20-turn conversation:

User messages (avg 100 tokens): 2,000 tokens
Agent responses (avg 300 tokens): 6,000 tokens
Subtotal: 8,000 tokens

Total: 15,100 tokens for a moderate conversation.

And you haven't even added:

Retrieved documents
Search results
Previous workflow outputs
Related task context

You're already at 15% of a 100K context window. By turn 50, you're hitting limits. By turn 100, you're out of space.

The naive solution—"send everything always"—fails fast.

Pattern 1: Lazy Context Loading (Just-In-Time)

The Problem

Most agents load all available context upfront:

# ❌ Eager loading - wasteful
def build_context(project_id):
    return {
        'machines': get_all_machines(project_id),      # 50 machines
        'materials': get_all_materials(project_id),    # 200 materials
        'specs': get_all_specs(project_id),            # 100 specs
        'history': get_full_history(project_id),       # 1000 records
        'users': get_all_users(project_id),            # 30 users
        'schedules': get_all_schedules(project_id)     # 500 schedules
    }

# Result: 8,000+ tokens, most unused

The agent receives data it never uses. You pay for tokens that don't contribute to the response.

The Solution: Load Only What's Needed

Give the agent tools to request context when needed:

# ✅ Lazy loading - efficient
def build_minimal_context(project_id):
    return {
        'project_name': get_project_name(project_id),
        'project_type': get_project_type(project_id)
    }

# Agent has tools to fetch more
tools = [
    Tool(
        name='get_machines',
        description='Fetch machines for this project',
        function=lambda: get_machines(project_id)
    ),
    Tool(
        name='get_materials', 
        description='Fetch materials for this project',
        function=lambda: get_materials(project_id)
    ),
    Tool(
        name='search_history',
        description='Search historical records by keyword',
        function=lambda query: search_history(project_id, query)
    )
]

When the Agent Needs Data

Turn 1:

User: "Create a quality plan for automotive parts"
Agent: receives minimal context
Agent: calls get_machines() and get_materials()
Agent: "I see you have 3 machines available..."

Turn 5:

User: "What maintenance was done on Machine A last month?"
Agent: calls search_history(query="Machine A maintenance")
Agent: "Machine A had preventive maintenance on..."

Why This Works

✅ Reduced baseline: Start with ~500 tokens instead of 8,000

✅ On-demand loading: Only fetch what's relevant to the current task

✅ Token efficiency: Pay for what you use, not what you might use

✅ Better relevance: Agent gets focused, pertinent data

Real-World Impact

Before lazy loading:

Average context: 12,000 tokens per request
Cost per conversation: $0.48
Irrelevant data: 60-70%

After lazy loading:

Average context: 4,500 tokens per request
Cost per conversation: $0.18
Irrelevant data: <10%

62% cost reduction with better response quality.

Pattern 2: Task-Specific Context Windows

The Problem

Different tasks need different context. A quality planning agent needs machines and materials. A maintenance agent needs service history. An SOP agent needs workstations and TAKT times.

Loading everything for every task wastes tokens and confuses the agent with irrelevant data.

The Solution: Context Profiles per Agent Type

Define exactly what each agent needs:

class ContextManager:

    CONTEXT_PROFILES = {
        'quality_planning': {
            'required': ['machines', 'materials', 'specifications'],
            'optional': ['previous_plans', 'quality_metrics'],
            'exclude': ['maintenance_history', 'sop_data']
        },

        'maintenance_scheduling': {
            'required': ['machines', 'maintenance_history'],
            'optional': ['upcoming_schedules', 'parts_inventory'],
            'exclude': ['materials', 'quality_specs']
        },

        'sop_creation': {
            'required': ['workstations', 'resources', 'takt_time'],
            'optional': ['existing_sops', 'process_flow'],
            'exclude': ['quality_specs', 'maintenance_history']
        },

        'issue_tracking': {
            'required': ['machines', 'materials'],
            'optional': ['recent_issues', 'resolution_history'],
            'exclude': ['sop_data', 'schedules']
        }
    }

    def build_context(self, agent_type: str, project_id: str) -> dict:
        """
        Build context based on agent's specific needs.
        """
        profile = self.CONTEXT_PROFILES[agent_type]
        context = {}

        # Load required data
        for key in profile['required']:
            context[key] = self._load_data(key, project_id)

        # Load optional data if available (don't fail if missing)
        for key in profile['optional']:
            try:
                context[key] = self._load_data(key, project_id)
            except DataNotFoundError:
                pass  # Optional, skip if unavailable

        # Explicitly exclude irrelevant data

        return context

Context Isolation Benefits

✅ Clarity: Agent sees only relevant information

✅ Speed: Less data to load and process

✅ Accuracy: No confusion from unrelated data

✅ Cost: Fewer tokens per request

✅ Debugging: Easy to see what context each agent receives

Example: Quality Planning vs. Maintenance

Quality Planning Context:

{
    "project_name": "Automotive Assembly Line",
    "machines": [
        {"id": "M1", "name": "CNC Mill", "specs": "..."},
        {"id": "M2", "name": "Lathe", "specs": "..."}
    ],
    "materials": [
        {"id": "MAT1", "name": "Steel", "grade": "304"}
    ],
    "specifications": {
        "tolerance": "±0.01mm",
        "surface_finish": "Ra 1.6"
    }
}

Maintenance Scheduling Context:

{
    "project_name": "Automotive Assembly Line",
    "machines": [
        {"id": "M1", "name": "CNC Mill", "last_service": "2024-01-15"},
        {"id": "M2", "name": "Lathe", "last_service": "2024-02-01"}
    ],
    "maintenance_history": [
        {"machine": "M1", "date": "2024-01-15", "type": "preventive"},
        {"machine": "M2", "date": "2024-02-01", "type": "repair"}
    ],
    "upcoming_schedules": [
        {"machine": "M1", "due": "2024-04-15", "type": "preventive"}
    ]
}

Notice: No overlap in specifications, quality metrics, or workstation data. Each agent gets exactly what it needs.

Pattern 3: Conversation History Windowing

The Problem

LLM conversations grow unbounded. By turn 50, you have:

50 user messages
50 agent responses
Tool calls and results
System messages

This exceeds context limits and makes responses slower and more expensive.

The Solution: Smart Windowing with Summarization

Keep recent messages in full, summarize older ones:

class ConversationWindow:
    def __init__(self, max_full_messages=10):
        self.max_full_messages = max_full_messages
        self.summary_cache = {}

    async def prepare_history(self, session_id: str) -> list:
        """
        Prepare conversation history within token budget.
        """
        full_history = await self.get_full_history(session_id)

        if len(full_history) <= self.max_full_messages:
            return full_history

        # Split into recent and old
        recent_messages = full_history[-self.max_full_messages:]
        old_messages = full_history[:-self.max_full_messages]

        # Check if we already have a summary
        summary_key = f"{session_id}:{len(old_messages)}"
        if summary_key in self.summary_cache:
            summary = self.summary_cache[summary_key]
        else:
            summary = await self._create_summary(old_messages)
            self.summary_cache[summary_key] = summary

        # Combine summary + recent messages
        return [
            {
                'role': 'system',
                'content': f'Previous conversation summary: {summary}'
            },
            *recent_messages
        ]

    async def _create_summary(self, messages: list) -> str:
        """
        Create concise summary focusing on key information.
        """

        # Extract key information to preserve
        decisions = self._extract_decisions(messages)
        data_collected = self._extract_data(messages)
        progress = self._extract_progress(messages)

        summary_prompt = f"""
        Summarize this conversation segment in 3-4 sentences:

        Focus on:
        - Key decisions: {decisions}
        - Data collected: {data_collected}
        - Progress made: {progress}

        Messages:
        {self._format_messages(messages)}

        Create a concise summary that preserves essential context.
        """

        summary = await self.llm.complete(summary_prompt)
        return summary.strip()

What to Preserve in Summaries

Always preserve:

User decisions and choices
Specific data provided (numbers, names, IDs)
Task progress and completion status
Error messages or issues encountered
Tool call results that affect future actions

Can compress:

Clarifying questions and answers
Explanations of concepts
Confirmation messages
General chitchat
Repetitive information

Example Summary

Original (10 messages, 2,000 tokens):

User: "I need to create a quality plan"
Agent: "What product are you manufacturing?"
User: "Automotive brake pads"
Agent: "What materials are you using?"
User: "Steel alloy, grade 304"
Agent: "What machines will you use?"
User: "CNC Mill M1 and Lathe M2"
Agent: "What tolerances are required?"
User: "±0.01mm"
Agent: "Got it. Let me create the plan..."

Summary (150 tokens):

User requested quality plan for automotive brake pads. 
Materials: Steel alloy grade 304. 
Machines: CNC Mill M1, Lathe M2. 
Tolerance requirement: ±0.01mm. 
Plan creation initiated.

Token Savings

Original: 2,000 tokens
Summary: 150 tokens
Savings: 92.5%

Multiply this across a 50-turn conversation and you save thousands of tokens per request.

Pattern 4: RAG (Retrieval-Augmented Generation) for Large Knowledge Bases

The Problem

Some agents need access to large knowledge bases:

500+ product specifications
1,000+ historical maintenance records
200+ standard operating procedures
Complete company documentation

You can't fit this in context. Even with a 200K token window, it's inefficient.

The Solution: Vector Search + Selective Retrieval

Store knowledge in a vector database, retrieve only relevant chunks:

class KnowledgeRetriever:
    def __init__(self, vector_db):
        self.db = vector_db

    async def retrieve_relevant(self, query: str, top_k: int = 3) -> list:
        """
        Retrieve most relevant knowledge chunks for query.
        """

        # Embed the query
        query_embedding = await self.embed(query)

        # Search vector database
        results = await self.db.search(
            vector=query_embedding,
            limit=top_k,
            threshold=0.7  # Similarity threshold
        )

        # Return relevant chunks
        return [
            {
                'content': result.content,
                'source': result.metadata['source'],
                'relevance': result.score
            }
            for result in results
        ]

    async def build_rag_context(self, user_message: str, base_context: dict) -> dict:
        """
        Augment base context with retrieved knowledge.
        """

        # Retrieve relevant documents
        relevant_docs = await self.retrieve_relevant(user_message)

        # Add to context
        augmented_context = {
            **base_context,
            'retrieved_knowledge': relevant_docs
        }

        return augmented_context

When to Use RAG vs. Direct Context

Use direct context when:

Data is small (<2,000 tokens)
Data is frequently needed
Data is structured and predictable
Fast response time is critical

Use RAG when:

Knowledge base is large (>10,000 tokens)
Data is accessed occasionally
Relevance varies by query
Full text search is needed

RAG Implementation Example

Scenario: Agent needs to reference maintenance procedures.

Without RAG (fails):

# Can't fit 200 procedures in context
procedures = load_all_procedures()  # 50,000 tokens
# Context limit exceeded!

With RAG (works):

# User asks about specific machine
user_query = "How do I service Machine A?"

# Retrieve only relevant procedures
relevant = await retriever.retrieve_relevant(user_query, top_k=3)
# Result: 3 procedures, ~1,500 tokens

# Agent sees only what's needed
agent_context = {
    'project': project_data,
    'relevant_procedures': relevant  # Just 3, not 200
}

RAG Architecture

User Query
    ↓
Query Embedding
    ↓
Vector Search
    ↓
Top K Results (by similarity)
    ↓
Relevance Filtering (threshold)
    ↓
Context Augmentation
    ↓
Agent Processing

Vector Database Options

Weaviate:

Good for production scale
Rich filtering capabilities
Self-hosted or cloud

Pinecone:

Managed service
Fast and reliable
Easy to get started

pgvector (PostgreSQL):

Use existing PostgreSQL
Good for moderate scale
No additional infrastructure

When to use each:

pgvector: <100K vectors, already using PostgreSQL
Weaviate: 100K-10M vectors, need rich filtering
Pinecone: Any scale, want managed solution

Pattern 5: Session State vs. Long-Term Memory

The Problem

Agents need two types of memory:

Session memory: Current conversation, temporary
Long-term memory: User preferences, historical decisions, persistent

Treating them the same leads to:

Session data polluting long-term memory
Long-term memory cluttering sessions
Difficulty clearing temporary data
Privacy and data retention issues

The Solution: Separate Storage with Clear Boundaries

class MemoryManager:
    def __init__(self, session_store, long_term_store):
        self.session = session_store      # Redis, TTL: 1 hour
        self.long_term = long_term_store  # PostgreSQL, permanent

    async def get_session_context(self, session_id: str) -> dict:
        """
        Get temporary session data.
        Auto-expires after inactivity.
        """
        return await self.session.get(f"session:{session_id}")

    async def get_long_term_context(self, user_id: str) -> dict:
        """
        Get persistent user data.
        Requires explicit deletion.
        """
        return await self.long_term.query(
            "SELECT preferences, history FROM user_memory WHERE user_id = $1",
            user_id
        )

    async def build_complete_context(self, session_id: str, user_id: str) -> dict:
        """
        Combine session and long-term memory.
        """
        session_data = await self.get_session_context(session_id)
        long_term_data = await self.get_long_term_context(user_id)

        return {
            'current_session': session_data,
            'user_memory': long_term_data
        }

    async def save_to_long_term(self, user_id: str, key: str, value: any):
        """
        Explicitly save important information for future sessions.
        """
        await self.long_term.execute(
            "INSERT INTO user_memory (user_id, key, value) VALUES ($1, $2, $3)",
            user_id, key, value
        )

What Goes Where

Session Storage (Temporary):

Current conversation history
Active task state
Temporary tool results
Draft outputs
Workflow progress

Long-Term Storage (Permanent):

User preferences (language, style)
Project associations
Historical decisions
Learned patterns
Completed task outcomes

Example: Quality Planning

Session memory:

{
    "session_id": "sess_123",
    "task": "quality_planning",
    "current_step": 5,
    "collected_data": {
        "product": "brake pads",
        "materials": ["steel 304"],
        "machines": ["M1", "M2"]
    },
    "draft_plan": {...}
}

Long-term memory:

{
    "user_id": "user_456",
    "preferences": {
        "default_tolerance": "±0.01mm",
        "preferred_machines": ["M1", "M2"],
        "notification_style": "summary"
    },
    "completed_plans": [
        {"project": "Project A", "date": "2024-01-15"},
        {"project": "Project B", "date": "2024-02-20"}
    ]
}

Memory Lifecycle

Session memory:

Created on first message
Updated each turn
Auto-expires after 1 hour of inactivity
Can be explicitly cleared

Long-term memory:

Created on user signup
Updated on explicit events (preferences changed, task completed)
Never expires (except for data retention policies)
Requires user action to delete

Pattern 6: Context Compression Techniques

The Problem

Sometimes you need to reference large documents but can't fit them in context. User uploads a 20-page PDF. You need key information but not everything.

The Solution: Multi-Level Compression

class ContextCompressor:

    async def compress_document(self, document: str, target_tokens: int) -> str:
        """
        Compress document to fit within token budget.
        """

        current_tokens = self.count_tokens(document)

        if current_tokens <= target_tokens:
            return document  # Already fits

        # Level 1: Extract key sections
        if current_tokens < target_tokens * 2:
            return await self._extract_key_sections(document, target_tokens)

        # Level 2: Summarize sections
        if current_tokens < target_tokens * 5:
            return await self._summarize_sections(document, target_tokens)

        # Level 3: Create hierarchical summary
        return await self._hierarchical_summary(document, target_tokens)

    async def _extract_key_sections(self, document: str, target: int) -> str:
        """
        Extract most relevant sections based on headings and keywords.
        """
        sections = self._split_by_headings(document)
        scored_sections = []

        for section in sections:
            score = self._relevance_score(section)
            scored_sections.append((score, section))

        # Take top sections until we hit token limit
        sorted_sections = sorted(scored_sections, reverse=True)
        result = []
        tokens_used = 0

        for score, section in sorted_sections:
            section_tokens = self.count_tokens(section)
            if tokens_used + section_tokens <= target:
                result.append(section)
                tokens_used += section_tokens
            else:
                break

        return '\n\n'.join(result)

    async def _summarize_sections(self, document: str, target: int) -> str:
        """
        Summarize each section independently.
        """
        sections = self._split_by_headings(document)
        summaries = []

        for section in sections:
            summary = await self.llm.complete(
                f"Summarize this section in 2-3 sentences:\n{section}"
            )
            summaries.append(f"**{section.heading}:** {summary}")

        return '\n\n'.join(summaries)

    async def _hierarchical_summary(self, document: str, target: int) -> str:
        """
        Create multi-level summary for very large documents.
        """
        # Split into chunks
        chunks = self._split_into_chunks(document, chunk_size=2000)

        # Summarize each chunk
        chunk_summaries = []
        for chunk in chunks:
            summary = await self.llm.complete(
                f"Summarize key points from this text:\n{chunk}"
            )
            chunk_summaries.append(summary)

        # Summarize the summaries
        combined_summaries = '\n'.join(chunk_summaries)
        final_summary = await self.llm.complete(
            f"Create a comprehensive summary from these section summaries:\n{combined_summaries}"
        )

        return final_summary

Compression Strategies by Document Type

Code files:

Extract function signatures
Keep docstrings
Summarize implementation
Preserve key logic

Reports/Documents:

Keep executive summary
Extract headings and key points
Compress body paragraphs
Preserve conclusions

Data files:

Show schema/structure
Provide sample rows
Summarize statistics
List unique values

Conversations:

Keep decisions and actions
Compress explanations
Preserve outcomes
Remove redundancy

Putting It All Together: The Context Stack

Here's how all these patterns combine in a production system:

User Message
    ↓
┌─────────────────────────┐
│  Base Context Builder   │
│  (Minimal required)     │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Task-Specific Context   │
│ (Profile-based loading) │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Conversation Window     │
│ (Recent + Summary)      │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ RAG Retrieval           │
│ (If knowledge needed)   │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Memory Integration      │
│ (Session + Long-term)   │
└───────────┬─────────────┘
            ↓
┌─────────────────────────┐
│ Final Context Assembly  │
│ (Within budget)         │
└───────────┬─────────────┘
            ↓
Agent Processing

Example: Quality Planning Request

User: "Create a quality plan for brake pads"

Context assembly:

Base context (500 tokens):

{
    "user_id": "user_123",
    "session_id": "sess_456",
    "task": "quality_planning"
}

Task-specific context (2,000 tokens):

{
    "machines": [...],
    "materials": [...],
    "specifications": [...]
}

Conversation window (1,500 tokens):

{
    "summary": "User requested quality plan. Product: brake pads.",
    "recent_messages": [last 5 messages]
}

RAG retrieval (1,000 tokens):

{
    "retrieved_procedures": [
        "Quality planning procedure for automotive parts",
        "Brake pad inspection guidelines",
        "Material specification standards"
    ]
}

Memory (500 tokens):

{
    "user_preferences": {
        "default_tolerance": "±0.01mm"
    },
    "previous_plans": ["Project A", "Project B"]
}

Total context: 5,500 tokens (within 10K budget, leaving room for response)

Key Takeaways

Effective context engineering requires:

✅ 1. Lazy Loading

Start minimal, load on-demand
Use tools for dynamic retrieval
Pay only for what you use

✅ 2. Task-Specific Profiles

Define context needs per agent type
Load only relevant data
Isolate contexts between tasks

✅ 3. Smart Windowing

Keep recent messages in full
Summarize older messages
Preserve critical information

✅ 4. RAG for Large Knowledge

Vector search for relevant chunks
Don't fit everything in context
Retrieve top-k similar items

✅ 5. Separate Memory Types

Session: temporary, auto-expires
Long-term: persistent, explicit
Clear boundaries between them

✅ 6. Compression Techniques

Extract key sections
Summarize large documents
Hierarchical summarization

Common Anti-Patterns to Avoid

❌ Loading everything upfront → Wasted tokens, high costs, context limit errors

❌ No conversation history limits → Unbounded growth, eventual failure

❌ Treating all memory as permanent → Cluttered context, privacy issues

❌ No context compression → Can't handle large documents

❌ Same context for all tasks → Irrelevant data confuses agents

The Bottom Line

Context is your agent's memory—and memory is expensive. The key is giving agents the right information at the right time without exceeding token budgets.

What works:

Lazy loading with on-demand tools
Task-specific context profiles
Smart conversation windowing
RAG for large knowledge bases
Separate session and long-term memory
Context compression for large docs

What fails:

Eager loading of all data
Unlimited conversation history
Mixed memory types
No compression strategy
Universal context for all tasks

Context engineering isn't about cramming everything into the window—it's about strategic selection, smart summarization, and ruthless prioritization.

Get this right, and your agents have perfect memory at sustainable cost.

About the Author

I build production-grade multi-agent systems with optimized context management strategies. My implementations achieve 60% cost reduction while improving response relevance through lazy loading, RAG, and smart windowing.

Specialized in context engineering, token optimization, and cost-effective agent architectures using CrewAI, Agno, and vector databases.

Open to consulting on context management challenges!
📧 Contact: gupta.akshay1996@gmail.com

Found this helpful? Share it with other AI builders! 🚀

What context management challenges are you facing? Drop a comment below!

DEV Community: Akshay Gupta

I scraped 25 AI/tech communities for 6 months. Here's what the data actually says.

The question that started this

Some things the data revealed

1. Hype vs Reality is measurable

2. When Reddit and Hacker News disagree, pay attention

3. "Switched from X to Y" is an underrated signal

4. Every community frustration is a product opportunity

5. You can track a paper's journey from research to production

6. Opinion leaders shift their stances — and that's a leading indicator

7. The job market tells you what's real

8. Where the smart money converges

9. Narratives shift before markets do

How it works (the short version)

Who finds this useful

Try it yourself

What I'd love feedback on

Rethinking API Design for AI Agents: From Data Plumbing to Intelligent Interfaces

The Problem: APIs Built for Humans, Not Machines That Think

What Makes an API "Agent-Ready"?

Example: Equipment Maintenance Status

The Three Pillars of Agent-Ready APIs

1. Clarity: Speak Business Intent, Not Database Schema

2. Context: Add Meaning, Not Just Data

Example: Risk Assessment

3. Consistency: Predictable Contracts Agents Can Trust

The Microservices Trap: When Modularity Becomes Fragmentation

Scenario: Determining Expedited Shipping Eligibility

Two Approaches to Adding Intelligence

Approach 1: Pure Logic (Deterministic Intelligence) ✅

Approach 2: Hybrid (Selective LLM Enhancement) 🎯

Cost Comparison

Architecture Pattern: The Three-Layer Stack

Real-World Benefits: Before & After

Manufacturing Operations Example

Key Takeaways

Further Reading

Production-Grade AI Agents: Architecture Patterns That Actually Work

The Development vs. Production Gap

Pattern 1: Goal-Oriented Agents with Explicit Completion

The Problem

The Solution: Explicit Completion Signals

Why This Works

Real-World Impact

Pattern 2: Context Isolation by Task

The Problem

The Solution: Project-Based Context Windows

Context Boundaries

Why This Works

Pattern 3: LLM-Based Intent Routing

The Problem

The Solution: LLM as Router

Why LLM Routing Works

Routing Performance

Pattern 4: The Orchestrator Pattern

The Problem

The Solution: Central Orchestrator

Orchestrator Responsibilities

State Transitions

Why This Works

Pattern 5: Off-Topic Detection with Context Preservation

The Problem

The Solution: Conservative Off-Topic Detection

Graceful Topic Switching

Why Conservative Detection Works

Pattern 6: Tool Call Orchestration and Validation

The Problem

The Solution: MCP (Model Context Protocol) Pattern

Tool Validation Strategy

Agent Tool Error Handling

Why This Pattern Works

Pattern 7: Conversation History Management

The Problem

The Solution: Smart History Windowing

When to Summarize

What to Keep vs. Summarize

Real-World Architecture: Putting It Together

Flow Example: Quality Planning

Key Takeaways

✅ 1. Goal-Oriented Design