DEV Community: Suraj Khaitan

Building Production-Ready AI Agents with MCP: The Enterprise Blueprint Nobody Talks About

Suraj Khaitan — Sun, 17 May 2026 07:00:22 +0000

A deep technical guide to multi-agent orchestration, knowledge retrieval via Model Context Protocol, hallucination control, and serverless deployment — patterns extracted from real production systems.

The Gap Between Demo and Production

You've seen the demos. A shiny chatbot that answers questions about PDFs, retrieves knowledge from a vector store, and produces fluent responses. It works in the notebook. It impresses in the meeting room. Then you try to ship it.

Six weeks later, the agent hallucinates on a customer query. The vector search retrieves semantically irrelevant chunks. DynamoDB checkpointing breaks under concurrent load. The Lambda cold starts introduce 8-second latency spikes. The LLM picks the wrong knowledge base and confidently answers from the wrong domain.

This is the reality of production GenAI systems. And almost nobody writes honestly about what it actually takes to build them correctly.

This article documents the patterns, decisions, and hard lessons from building a multi-agent knowledge retrieval system for an enterprise use case: multiple specialized knowledge bases, a validation pipeline, a transformation agent, and a stateful chatbot — all wired together through MCP (Model Context Protocol) on a serverless cloud stack.

We'll go from fundamentals to full deployment architecture, with code you can actually use.

Why Most AI Agents Fail in Production

Before we build, let's diagnose. The failures are almost always the same five categories:

1. Retrieval is naïve

Most prototypes use a single vector store with cosine similarity. In enterprise settings, your knowledge is segmented. Safety documentation has different structure and retrieval semantics than software manuals. When you throw everything into one index, precision tanks. The agent retrieves documents that sound relevant but answer the wrong question.

2. The agent has no memory architecture

Session state lives in a dict that gets destroyed between requests. Thread IDs aren't propagated. Conversation history is either unlimited (context window overflow) or absent (agent forgets what it just said).

3. Tool contracts are loose

The LLM calls tools with missing, wrong, or hallucinated arguments. No validation. No schema enforcement. The tool silently returns nothing; the LLM fabricates a response.

4. Multi-agent coordination is an afterthought

One agent processes user queries. A second agent validates documents. A third transforms raw uploads. These agents are deployed independently with no shared message schema, no retry contract, and no shared observability. When one fails, you find out from the user.

5. Deployment is a science project

Lambda packages bloat beyond 50MB. Layers conflict. Cold starts kill latency SLAs. Dependencies are loaded on every invocation instead of being cached at the container level.

Each of these is solvable. But you need a system, not a stack of LangChain tutorials.

What MCP Solves

Model Context Protocol (MCP) is a JSON-RPC-based communication protocol for connecting AI agents to external tools, data sources, and services. Think of it as a standardized API contract between your LLM and the world outside it.

Where most RAG implementations hardcode retrieval calls directly into the agent logic, MCP externalizes them into discrete, versioned, discoverable services. Your agent becomes a client. Your retriever becomes a server. The contract is typed.

{
  "jsonrpc": "2.0",
  "id": "a1b2c3d4",
  "method": "tools/call",
  "params": {
    "name": "hybridQueryTool",
    "arguments": {
      "retriever_input": {
        "query": "What are the safety circuit requirements for servo drives?",
        "kb_id": "kb-regulations"
      }
    }
  }
}

This gives you four things that matter in production:

Decoupling: The retrieval implementation can change without touching the agent
Versioning: MCP endpoints are independently deployable
Observability: You can log, trace, and rate-limit at the protocol layer
Multi-tenancy: Multiple agents can share the same MCP server under different routing keys

Recommended Enterprise Architecture

Here is the full system architecture we'll implement:

┌──────────────────────────────────────────────────────────────┐
│                        API Gateway                           │
│                (JWT / AWS IAM Authentication)                │
└───────────────────────────┬──────────────────────────────────┘
                            │
              ┌─────────────▼──────────────┐
              │        API Lambda          │
              │  (routing, auth, presigned │
              │   URLs, async S3 reads)    │
              └──────┬──────────┬──────────┘
                     │          │
          ┌──────────▼─┐    ┌───▼────────────────┐
          │  Chatbot   │    │  Upload + Transform  │
          │  Agent     │    │  Pipeline (SQS-      │
          │  Lambda    │    │  triggered)          │
          └──────┬─────┘    └──────────┬───────────┘
                 │                     │
          ┌──────▼─────┐        ┌──────▼──────────┐
          │ LangGraph  │        │ Transformation   │
          │ Workflow   │        │ Agent Lambda     │
          │            │        │ (parse → S3)     │
          └──────┬─────┘        └──────────────────┘
                 │                     │ (incidents)
          ┌──────▼─────┐        ┌──────▼──────────┐
          │  MCP Layer │        │  Checker Agent  │
          │            │        │  Lambda (SQS-   │
          │  ┌────────┐│        │  triggered)     │
          │  │ KB-1   ││        └──────┬──────────┘
          │  │ KB-2   ││               │
          │  │ KB-3   ││        ┌──────▼──────────┐
          │  │ ...    ││        │   MCP Layer     │
          │  └────────┘│        │ (domain KB)     │
          └────────────┘        └─────────────────┘
                 │
         ┌───────▼────────┐
         │   DynamoDB     │
         │ (Checkpointing │
         │  / History)    │
         └────────────────┘

Let's build each layer.

The LangGraph Agent Core

LangGraph is the right choice for production agents. It gives you explicit state management, conditional routing, and composable graphs. Here's the complete core pattern.

Data Models First

Type safety is non-negotiable. Define your contract before you write any logic:

from pydantic import BaseModel, ConfigDict, Field
from typing import Any

class AgentMessageRequest(BaseModel):
    message: str = Field(..., description="User message")
    sessionId: str | None = Field(None, description="Optional session ID")
    metadata: dict[str, Any] = Field(default_factory=dict)


class Message(BaseModel):
    model_config = ConfigDict()

    step_id: str | None = Field(None)
    role: str = Field(...)           # "user" | "agent"
    content: str = Field(...)
    structural_content: dict[str, Any] | None = Field(None)
    create_timestamp: str = Field(...)
    metadata: dict[str, Any] = Field(default_factory=dict)

Strong typing catches argument mismatches at the boundary, not deep inside graph execution.

The Tool Definition

This is where MCP integration lives. The @tool decorator makes this function visible to the LLM as a callable:

from langchain.tools import tool

VALID_DOMAINS = {
    "kb-documents",
    "kb-specifications",
    "kb-regulations",
}

@tool
def query_knowledge_base(query: str, domain: str | None = None) -> str:
    """
    Query a specialized knowledge base for domain-specific information.

    Select the most appropriate domain based on the user's question:
    - 'kb-documents': Product manuals, technical guides, API references
    - 'kb-specifications': Hardware and software configuration standards
    - 'kb-regulations': Compliance requirements, safety standards, audit rules

    Args:
        query: Rich contextual search query. More context = better results.
        domain: Target knowledge domain. Required for precision retrieval.

    Returns:
        Formatted knowledge base chunks as a single string.

    Note:
        Query is vectorized for cosine similarity + keyword hybrid search.
    """
    if not isinstance(query, str) or not query.strip():
        return "Invalid query. Please provide a non-empty string."

    if domain and domain not in VALID_DOMAINS:
        return f"Invalid domain '{domain}'. Choose from: {', '.join(VALID_DOMAINS)}"

    try:
        global _LAST_KB_CONTEXT

        kb_context = fetch_from_mcp(query, domain=domain)

        if kb_context:
            _LAST_KB_CONTEXT = kb_context  # store for metadata extraction post-graph

        if kb_context.get("status") == "success" and kb_context.get("context"):
            formatted = "\n---\n".join(kb_context["context"])
            return f"Knowledge Base Results:\n\n{formatted}"

        return "No relevant information found in knowledge base."

    except Exception as e:
        logger.error(f"Tool execution failed: {e}")
        return f"Knowledge base query failed: {str(e)}"

Critical pattern: _LAST_KB_CONTEXT is a module-level global that captures references (file URLs, page numbers) returned by the MCP retriever. These can't travel through the LangGraph message channel cleanly — they're metadata, not conversation content. After the graph completes, you extract them from this global. This works because Lambda containers are single-threaded per invocation.

The Graph Structure

from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph_dynamodb_checkpoint import DynamoDBSaver

def build_graph(checkpointer: DynamoDBSaver | None = None) -> StateGraph:
    graph = StateGraph(state_schema=MessagesState)

    graph.add_node("llm_node", llm_node)
    graph.add_node("tool_node", tool_node)

    graph.add_edge(START, "llm_node")
    graph.add_conditional_edges(
        "llm_node",
        should_continue,       # router function
        ["tool_node", END]
    )
    graph.add_edge("tool_node", "llm_node")  # tool result → back to LLM

    if checkpointer:
        return graph.compile(checkpointer=checkpointer)
    return graph.compile()

The graph is a ReAct loop: LLM reasons → decides whether to call a tool → tool executes → result fed back to LLM → LLM reasons again. This continues until the LLM determines it can answer without calling another tool.

The Router

def should_continue(state: MessagesState) -> str:
    last_message = state["messages"][-1]

    if hasattr(last_message, 'tool_calls') and last_message.tool_calls:
        return "tool_node"

    return END

Simple but critical. If the LLM emits tool calls, route to tool execution. Otherwise, the response is complete.

The LLM Node

from langchain_core.messages import SystemMessage
from langchain_aws import ChatBedrock
import boto3, os

def llm_node(state: MessagesState):
    global _AGENT_SUMMARY
    llm_with_tools = get_llm_with_tools()
    response = llm_with_tools.invoke(
        [SystemMessage(content=get_prompt(agent_summary=_AGENT_SUMMARY or ""))]
        + state["messages"]
    )
    return {"messages": [response]}


def tool_node(state: dict):
    tools_by_name = {tool.name: tool for tool in [query_knowledge_base]}
    result = []

    for tool_call in state["messages"][-1].tool_calls:
        tool_name = tool_call["name"]

        if tool_name not in tools_by_name:
            result.append(ToolMessage(
                content=f"Error: Unknown tool '{tool_name}'",
                tool_call_id=tool_call["id"]
            ))
            continue

        observation = tools_by_name[tool_name].invoke(tool_call["args"])
        result.append(ToolMessage(content=observation, tool_call_id=tool_call["id"]))

    return {"messages": result}

Conversation Checkpointing

Stateless Lambdas need external state. DynamoDB gives you persistent conversation memory:

import time, os
from langgraph_dynamodb_checkpoint import DynamoDBSaver

def get_checkpoint_table() -> DynamoDBSaver | None:
    table_name = os.environ.get("MEMORY_TABLE")
    if not table_name:
        logger.warning("MEMORY_TABLE not set; running stateless")
        return None

    return DynamoDBSaver(
        table_name=table_name,
        max_read_request_units=10,
        max_write_request_units=10,
        ttl_seconds=int(time.time()) + 28 * 86400,  # 28-day TTL
    )

Thread IDs tie conversation turns together. On each request, the graph replays from the last checkpoint, not from scratch:

thread_id = message.metadata.get("thread_id", uuid.uuid4().hex)
config = {"configurable": {"thread_id": thread_id}}

result = graph.invoke(
    {"messages": [HumanMessage(content=message.content)]},
    config
)

Production note: The 28-day TTL prevents unbounded storage growth. Conversations older than 28 days are automatically purged by DynamoDB TTL. Set this to match your retention policy.

Multi-Agent Orchestration Patterns

The chatbot is one of three agents in this system. Here's how multi-agent orchestration actually works in production serverless architectures:

Agent 1: Chatbot Agent
  → Handles real-time user Q&A
  → LangGraph ReAct loop
  → Synchronous API response

Agent 2: Transformation Agent
  → SQS-triggered (file upload events)
  → Parses structured documents → normalized JSON
  → Routes based on document type metadata

Agent 3: Checker / Validation Agent
  → SQS-triggered (per incident)
  → Consults domain knowledge base
  → Appends recommended_action to S3 results

Asynchronous Agent Pipelines via SQS

The transformation agent fires when a user uploads files. SQS decouples the upload from the processing:

def lambda_handler(event, context):
    jobs = {}

    for sqs_record in event['Records']:
        s3_event = json.loads(sqs_record['body'])

        for s3_record in s3_event['Records']:
            bucket = s3_record['s3']['bucket']['name']
            s3_key = unquote_plus(s3_record['s3']['object']['key'])

            # Extract job context from S3 key structure:
            # jobs/{user_id}/{project_name}/{job_id}/docs/{filename}
            parts = s3_key.split('/')
            project_name = parts[2]
            job_id = parts[3]
            user_id = parts[1]

            if job_id not in jobs:
                jobs[job_id] = {'bucket': bucket, 'project_name': project_name}

    # Idempotent: resolve job from S3, not from the event payload
    for job_id, job_data in jobs.items():
        all_files = list_job_files(bucket, user_id, job_data['project_name'], job_id)
        process_job(bucket, user_id, job_data['project_name'], job_id, all_files)

Critical pattern: The trigger file is just a signal. Always list all files from S3 when processing. This makes the pipeline idempotent — reprocessing a job picks up all files regardless of upload order.

Document-Type Routing

def categorize_files(files: list[dict]) -> dict:
    categorized = {'report': None, 'model': None, 'rules': None}

    for file_info in files:
        filename = file_info['filename'].lower()

        if filename.endswith(('.xlsx', '.xls')):
            categorized['report'] = file_info     # structured data → incidents
        elif filename.endswith(('.plczip', '.robzip')):
            categorized['model'] = file_info      # binary model → JSON
        elif filename.endswith('.xml'):
            categorized['rules'] = file_info      # rule definitions → JSON

    return categorized

Each file type has a dedicated parser. The transformation agent orchestrates them in dependency order: report first (to extract metadata needed by subsequent parsers), then model, then rules.

The Checker / Validation Agent

After transformation, individual incidents (one per detected issue) are queued via SQS. The checker agent processes them individually, consulting the domain knowledge base:

def lambda_handler(event, context):
    for record in event.get("Records", []):
        body = json.loads(record["body"])
        s3_record = body["Records"][0]

        bucket_name = s3_record["s3"]["bucket"]["name"]
        object_key = urllib.parse.unquote_plus(s3_record["s3"]["object"]["key"])
        base_path = object_key.partition("incidents/")[0]

        # Read incident JSON from S3
        incident_message = s3.get_object(
            Bucket=bucket_name, Key=object_key
        )["Body"].read().decode("utf-8")

        # Run LangGraph agent: incident → knowledge base → recommendation
        graph = build_graph()
        result = graph.invoke({
            "messages": [HumanMessage(content=incident_message)]
        })

        # Append recommendation and write to results/
        output = json.loads(incident_message)
        output["recommended_action"] = result["messages"][-1].content
        push_to_s3(output, base_path)

Each agent is independently deployable, independently scalable, and independently observable. The shared contract is the S3 path structure and the JSON schema.

MCP Communication Layer

Here is the complete MCP client implementation — the most critical piece of production infrastructure in the entire system:

import httpx
import uuid
import os
import json
import logging
from typing import Any
from httpx import Headers

logger = logging.getLogger(__name__)

def send_mcp_request(
    method: str,
    params: Any = None,
    session_id: str | None = None,
    config_key: str | None = None,
) -> tuple[dict, Headers]:
    """
    Sends a JSON-RPC 2.0 request to the MCP server.
    Resolves the MCP endpoint URL from a secure configuration store.
    """
    headers = {
        "Content-Type": "application/json",
        "MCP-Version": "2024-01-01",
    }
    if session_id:
        headers["MCP-Session-Id"] = session_id

    body = {
        "jsonrpc": "2.0",
        "id": str(uuid.uuid4()),
        "method": method,
    }
    if params is not None:
        body["params"] = params

    if config_key:
        load_config_into_env(config_key)

    mcp_base_url = os.getenv("RETRIEVER_SERVICE_URL")
    if not mcp_base_url:
        raise ValueError("RETRIEVER_SERVICE_URL is not configured")

    mcp_url = mcp_base_url.rstrip("/") + "/mcp"

    try:
        response = httpx.post(mcp_url, json=body, headers=headers, timeout=30)
        response.raise_for_status()
        return response.json(), response.headers
    except httpx.HTTPStatusError as e:
        logger.error(f"MCP server error {e.response.status_code}: {e.response.text}")
        raise
    except httpx.RequestError as e:
        logger.error(f"MCP network error: {e}")
        raise

Dynamic KB Routing

The MCP call is parameterized at runtime. The domain identifier determines which retriever service receives the request:

def fetch_from_mcp(query: str, domain: str | None = None) -> dict[str, Any]:
    try:
        kb_config = json.loads(os.environ.get("KB_CONFIG", "{}"))
        agent_id = kb_config.get("configurations", [{}])[0].get("agent_id")
        kb_type = kb_config.get("configurations", [{}])[0].get("kb_type")
        kb_id = domain or os.environ.get("KB_ID", "default")

        # Dynamic endpoint resolution per knowledge base:
        # /config/{agent_id}/{kb_id}/{kb_type}/RETRIEVER_SERVICE_URL
        config_key = f"/config/{agent_id}/{kb_id}/{kb_type}/RETRIEVER_SERVICE_URL"

        payload = {
            "retriever_input": {
                "query": query,
                "kb_id": kb_id,
            }
        }

        tool_name = kb_config.get("defaults", {}).get("tool_name", "hybridQueryTool")

        resp_json, _ = send_mcp_request(
            method="tools/call",
            params={"name": tool_name, "arguments": payload},
            session_id=os.environ.get("MCP_SESSION_ID"),
            config_key=config_key,
        )

        return parse_kb_response(resp_json)

    except Exception as e:
        logger.error(f"MCP call failed: {e}")
        return {"status": "error", "context": [], "reference": [], "error": str(e)}

Why dynamic endpoint resolution? Each knowledge domain can be served by a different retriever instance — different hardware, different index type, different SLA. By resolving the endpoint from configuration at call-time, you can independently scale, migrate, and update individual knowledge bases without redeploying the agent.

Response Parsing

MCP responses are nested. Parse them defensively:

import ast

def parse_kb_response(resp_json: dict[str, Any]) -> dict[str, Any]:
    try:
        outer = ast.literal_eval(resp_json["result"]["content"][0]["text"])
        body = json.loads(outer["body"])

        context: list[str] = []
        reference: list[dict] = []

        for item in body:
            if item.get("text"):
                context.append(item["text"])
            reference.append({k: v for k, v in item.items() if k != "text"})

        return {
            "status": outer.get("status"),
            "context": context,       # text chunks for LLM consumption
            "reference": reference,   # metadata (file URLs, page numbers)
        }

    except (KeyError, ValueError, SyntaxError, json.JSONDecodeError) as e:
        logger.error(f"Failed to parse KB response: {e}")
        return {"status": "error", "context": [], "reference": [], "error": str(e)}

Separate context from reference. The LLM gets context. The UI gets reference metadata for citation display. Never mix them.

Retrieval + Knowledge Layer

Hybrid Search Configuration

Single-mode retrieval (pure vector or pure keyword) consistently underperforms on technical documentation. Production systems need hybrid:

{
  "knowledge_base": {
    "defaults": {
      "kb_type": "lancedb",
      "retriever_type": "hybrid",
      "tool_name": "hybridQueryTool"
    },
    "configurations": [
      { "kb_name": "kb-documents",     "kb_type": "lancedb" },
      { "kb_name": "kb-specifications", "kb_type": "lancedb" },
      { "kb_name": "kb-regulations",   "kb_type": "lancedb" }
    ],
    "infrastructure": {
      "embedding_model": "amazon.titan-embed-text-v2:0"
    }
  }
}

Why separate knowledge bases per domain?

Precision: Documents have different embedding distributions from regulatory text. Domain-scoped indexes give higher precision at the same k.
Access control: You can enforce per-KB authorization at the MCP layer.
Independent updates: A regulations KB can be re-indexed without touching documents or specifications.
Observability: Per-KB latency and error metrics tell you exactly which domain is degrading.

Context Window Management

Never pass raw retrieval chunks to the LLM. Format them with separators so the LLM can identify chunk boundaries:

if kb_context.get("status") == "success" and kb_context.get("context"):
    formatted = "\n---\n".join(kb_context["context"])
    return f"Knowledge Base Results:\n\n{formatted}"

The --- separator is cheap signal. The LLM treats each chunk as a discrete evidence unit rather than a continuous blob.

Validation & Hallucination Prevention

Hallucination in domain-specific agents isn't just wrong answers — it's wrong answers delivered with high confidence that looks correct to non-experts.

Guard at the Prompt Layer

Your system prompt is the first line of defense:

<INSTRUCTIONS>
3. Information Retrieval
   - Use the retrieval tool only when domain-specific factual information is required.
   - If the knowledge base returns no results or an error, inform the user and advise
     contacting the support team.
   - Do not guess or invent information not found in the Knowledge Base.
</INSTRUCTIONS>

Explicit negative instructions outperform implicit expectations. Tell the model what it must NOT do, not just what it should do.

Guard at the Config Layer

Content filtering runs before and after the LLM:

{
  "guardrail": {
    "filters": {
      "Hate":          "MEDIUM",
      "SEXUAL":        "MEDIUM",
      "Violence":      "MEDIUM",
      "Insults":       "MEDIUM",
      "MISCONDUCT":    "MEDIUM",
      "Prompt Attack": "HIGH"
    }
  }
}

Set Prompt Attack to HIGH. Prompt injection is the most common real attack vector against document-grounded agents.

Guard at the Tool Layer

Validate tool arguments before executing any external call:

@tool
def query_knowledge_base(query: str, domain: str | None = None) -> str:
    if not isinstance(query, str) or not query.strip():
        return "Invalid query. Please provide a non-empty string."

    valid_domains = {"kb-documents", "kb-specifications", "kb-regulations"}
    if domain and domain not in valid_domains:
        return f"Invalid domain '{domain}'. Choose from: {', '.join(valid_domains)}"

    # Only reach external systems after validation passes
    ...

Return descriptive error strings rather than raising exceptions. The LLM can reason about a string error message and self-correct. An unhandled exception terminates tool execution with no recovery path.

Conversation Scope Enforcement

Prevent domain drift through prompt rules:

<KB_RULES>
- Each conversation uses exactly one Knowledge Base.
- The Knowledge Base is selected only at conversation start.
- Switching Knowledge Bases within a conversation is not allowed.
- The selected Knowledge Base is stored in conversation history.
</KB_RULES>

This seems restrictive but it's correct for expert systems. A user working in kb-regulations doesn't want their session drifting into kb-specifications mid-conversation. Scope enforcement is a feature, not a limitation.

LLM Client Caching

Lambda containers are reused across invocations. Cache expensive initialization at the module level:

_LLM_WITH_TOOLS_CACHE = None
_AGENT_SUMMARY = None
_LAST_KB_CONTEXT = None

def get_llm_with_tools() -> Any:
    global _LLM_WITH_TOOLS_CACHE

    if _LLM_WITH_TOOLS_CACHE is not None:
        logger.info("Using cached LLM client")
        return _LLM_WITH_TOOLS_CACHE

    llm = ChatBedrock(
        model_id=os.getenv("MODEL_ID"),
        client=boto3.client("bedrock-runtime", region_name=os.getenv("AWS_REGION")),
    )
    _LLM_WITH_TOOLS_CACHE = llm.bind_tools([query_knowledge_base])
    return _LLM_WITH_TOOLS_CACHE

And critically, reset request-scoped state at the start of every invocation:

def process_message(message: Message) -> Message:
    global _LAST_KB_CONTEXT
    _LAST_KB_CONTEXT = None  # Reset — avoid stale data from previous warm invocation
    ...

This is a subtle but critical bug if missed. Without the reset, the first request on a warm container sets _LAST_KB_CONTEXT. The second request inherits that stale context if the retrieval tool isn't called — returning citations from the previous user's query. This is both a correctness bug and a potential data exposure issue.

Observability & Monitoring

Structured Logging at Every Layer

logger.info(f"Thread: {thread_id} | Domain: {domain} | Query length: {len(query)}")
logger.info(f"MCP response: status={output.get('status')}, chunks={len(output.get('context', []))}")
logger.info(f"Routing decision: {routing_decision} | Tool calls detected: {bool(tool_calls)}")
logger.info(f"Response length: {len(result.content)} chars")

Log the routing decision, not just the outcome. When debugging a wrong answer, knowing which tool was called (or wasn't) is more valuable than the final response text.

Key Metrics to Track

Metric	Why It Matters
KB retrieval latency per domain	Identifies degraded retrieval services
Tool call rate per session	High = LLM confused; zero = retrieval bypassed
Context chunks per query	Low count = poor retrieval quality
Graph iterations per request	High count = possible ReAct loop
Checkpoint read/write failures	Silent data loss in conversation history
Cold start frequency	Proxy for concurrent load spikes

Async Result Aggregation

When users poll for processing results, don't serialize S3 reads:

import asyncio
import aioboto3

MAX_CONCURRENCY = 20

async def aggregate_results(user_id: str, job_id: str, project: str) -> list[dict]:
    prefix = f"jobs/{user_id}/{project}/{job_id}/results/"

    async with aioboto3.Session().client("s3") as s3:
        resp = await s3.list_objects_v2(Bucket=BUCKET, Prefix=prefix)
        keys = [obj["Key"] for obj in resp.get("Contents", []) if obj["Key"].endswith(".json")]

        semaphore = asyncio.Semaphore(MAX_CONCURRENCY)

        async def fetch(key):
            async with semaphore:
                obj = await s3.get_object(Bucket=BUCKET, Key=key)
                content = await obj["Body"].read()
                return json.loads(content)

        results = await asyncio.gather(*[fetch(k) for k in keys])
        return [item for sublist in results for item in (sublist if isinstance(sublist, list) else [sublist])]

The semaphore prevents S3 throttling on large result sets. 20 concurrent reads is a conservative default; tune against your S3 request rate limits.

Deployment & Scaling

Lambda Layer Management

The default Lambda deployment package limit is 250MB unzipped. LangGraph, LangChain, and their transitive dependencies comfortably exceed this. The solution: load layers dynamically from S3 at cold start:

LAYER_FILES = [
    "langgraph-layer.zip",
    "langchain-layer.zip",
    "base-utils-layer.zip"
]
TMP_DIR = "/tmp/layers"

def load_s3_layers() -> None:
    s3 = boto3.client("s3")
    os.makedirs(TMP_DIR, exist_ok=True)

    for layer_file in LAYER_FILES:
        extract_path = os.path.join(TMP_DIR, layer_file.replace(".zip", ""))

        if os.path.exists(extract_path):
            # Already extracted on this warm container — skip download
            _add_to_sys_path(extract_path)
            continue

        archive_path = os.path.join(TMP_DIR, layer_file)
        s3.download_file(BUCKET_NAME, f"layers/{layer_file}", archive_path)

        with __import__("zipfile").ZipFile(archive_path, "r") as zf:
            zf.extractall(extract_path)

        os.remove(archive_path)  # free /tmp space immediately
        _add_to_sys_path(extract_path)


def _add_to_sys_path(extract_path: str) -> None:
    import sys
    for path in [extract_path, os.path.join(extract_path, "python")]:
        if os.path.isdir(path) and path not in sys.path:
            sys.path.insert(0, path)


# Execute at module level — runs once per cold start
load_s3_layers()

The existence check on extract_path is the key optimization. Warm containers have already extracted the layers — skipping download saves 3–8 seconds per warm invocation.

Secure Configuration via Parameter Store

Never hardcode service URLs or credentials. Resolve them at runtime:

def load_config_into_env(config_key: str) -> None:
    ssm = boto3.client("ssm")

    if not config_key.endswith("/"):
        # Exact parameter — direct fetch
        response = ssm.get_parameter(Name=config_key, WithDecryption=True)
        name = response["Parameter"]["Name"].split("/")[-1]
        os.environ[name] = response["Parameter"]["Value"]
        return

    # Path prefix — fetch all parameters under path
    params = []
    next_token = None

    while True:
        kwargs = {
            "Path": "/config/",
            "WithDecryption": True,
            "Recursive": True,
            "MaxResults": 10,
        }
        if next_token:
            kwargs["NextToken"] = next_token

        response = ssm.get_parameters_by_path(**kwargs)
        params.extend(response.get("Parameters", []))
        next_token = response.get("NextToken")
        if not next_token:
            break

    for param in params:
        if config_key.lower() in param["Name"].lower():
            os.environ[param["Name"].split("/")[-1]] = param["Value"]

This pattern lets you rotate service URLs without redeploying Lambda. Update the parameter — the next cold start picks up the new value.

API Authentication

Support both JWT (user-facing) and AWS IAM (service-to-service):

def authenticate_request(event: dict) -> str:
    auth_header = (event.get("headers") or {}).get("Authorization", "")

    if auth_header.startswith("Bearer "):
        token = auth_header.split(" ", 1)[1]
        payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
        return payload.get("sub")

    if auth_header.startswith("AWS "):
        access_key, secret_key, session_token = auth_header.replace("AWS ", "").split(":")
        sts = boto3.client(
            "sts",
            aws_access_key_id=access_key,
            aws_secret_access_key=secret_key,
            aws_session_token=session_token,
        )
        identity = sts.get_caller_identity()
        return identity["UserId"]

    raise Exception("Unsupported auth scheme: must be Bearer JWT or AWS session credentials")

Production Best Practices

1. Fail loudly at configuration time, silently at runtime

Missing MEMORY_TABLE? Log a warning and continue stateless. Missing MODEL_ID? Raise immediately — you cannot operate without an LLM.

2. Never let the agent choose between zero options

If the knowledge base returns empty results, return that fact explicitly: "No relevant information found in knowledge base." — not silence, not a hallucinated answer.

3. Scope your agents tightly

The chatbot does real-time Q&A. The transformation agent parses documents. The checker validates incidents. One agent, one job. Never add a new capability to an existing agent without evaluating whether it belongs there.

4. Make your pipelines idempotent

S3 trigger events can be delivered more than once. Design every pipeline step so re-running it produces the same output. Overwriting an S3 file with the same content is idempotent. Appending to a database without checking for duplicates is not.

5. Test your prompts against adversarial inputs

Prompt injection is real. Test your agent with:

Instructions to ignore previous rules
Requests to reveal the system prompt
Queries that cross domain boundaries deliberately
Empty strings and whitespace-only inputs

6. Log routing decisions, not just outputs

"Routing decision: tool_node (tool calls detected)" — this log line tells you exactly why the agent behaved the way it did. Without it, debugging a wrong answer means reading the entire message history blind.

7. Set explicit TTLs on everything

DynamoDB checkpoints: 28 days. Presigned URLs: 15 minutes. Session tokens: match your security policy. If you don't set TTLs, your tables grow unboundedly and your costs climb without warning.

Lessons Learned: What Actually Went Wrong

Warm Lambda stale global state — The _LAST_KB_CONTEXT pattern is powerful but fragile. Forgetting the reset at invocation start causes the second user on a warm container to see citations from the first user's session. This is both a correctness bug and a potential privacy issue. Reset all request-scoped globals at the top of your handler, every time.

LLM cold-selecting the wrong domain — When the agent selects a knowledge domain on the first message, it does so based only on a brief user string. Users who type a domain name as a quick-select mean "activate this domain," not "answer a question about this topic." We added explicit quick-prompt detection to pre-select the domain before the LLM sees the message:

DOMAIN_LABEL_MAP = {
    "documents":      "kb-documents",
    "specifications": "kb-specifications",
    "regulations":    "kb-regulations",
}

def detect_domain_selection(message: str) -> str | None:
    normalized = message.strip().lower()
    return DOMAIN_LABEL_MAP.get(normalized)

Oversized retrieval context — Passing all retrieved chunks to the LLM without truncation causes two problems: cost (more tokens = more money) and quality (the LLM attends to early chunks more than later ones). Implement a context budget — truncate to N chunks, N tokens, or both.

SQS deduplication gaps — When multiple files in the same job trigger separate SQS events, each Lambda invocation processes only the triggering file unless you explicitly list all files from S3. Always resolve the complete job context from the source of truth (S3), not from the event payload.

DynamoDB checkpoint TTL drift — TTL in DynamoDB is approximate. Items may persist up to 48 hours past their TTL. Don't rely on DynamoDB TTL for hard security expiry. Use it for cost management only.

Final Thoughts

Production AI agents are distributed systems with an LLM in the middle. Every failure mode that applies to microservices — cascading failures, stale state, network timeouts, idempotency violations, auth edge cases — applies here too. Plus a new set: hallucination, domain drift, prompt injection, and retrieval precision.

MCP gives you a structured, evolvable interface between your agents and your knowledge. LangGraph gives you explicit, debuggable workflow graphs. DynamoDB gives you persistent state without managing servers. Serverless gives you scale without capacity planning.

The architecture in this article handles thousands of concurrent users, multiple specialized knowledge domains, asynchronous document processing, and real-time Q&A — all from a small, maintainable codebase.

The patterns are reusable. The lessons are hard-won. The blueprint is yours.

Build the thing. Ship the thing. Learn from the thing.

Key Takeaways

Use MCP as the interface between agents and retrieval services — not direct function calls
Separate knowledge domains into individual knowledge bases for precision and independence
LangGraph graphs give you explicit, debuggable agent workflows — use them over chains
DynamoDB checkpointing with TTLs is the correct pattern for Lambda-based conversation memory
Reset request-scoped globals at the start of every Lambda invocation — warm container state is a real bug class
Hybrid search (vector + keyword) outperforms single-mode retrieval on technical documentation
Multi-agent via SQS decouples real-time agents from async processing pipelines
Idempotent pipelines: resolve job state from S3, not from SQS event payloads
Log routing decisions — the most important diagnostic signal in a ReAct agent
Prompt guardrails + config filters + tool validation = defense in depth against hallucination

If this article helped you, consider following for more practical GenAI engineering content. Building something similar? Share it in the comments.

Cover Image Idea

A clean dark-background technical diagram showing a flow from a user icon → API Gateway → three branching Lambda icons (labeled "Chatbot", "Transform", "Validate") → an MCP protocol node → multiple colored cylinders representing knowledge bases. Blueprint-style. Color palette: deep navy, electric blue, white. Optional: a faint LangGraph state-transition graph overlaid in the background.

Author Bio

Suraj Khaitan is a Senior AI Architect and GenAI Engineer specializing in enterprise-scale AI systems, multi-agent orchestration, and cloud-native LLM deployments on AWS. He designs and ships production RAG pipelines, LangGraph-based agent frameworks, and MCP-connected knowledge systems for complex industrial and enterprise domains.

When he's not debugging warm Lambda containers at 2am, he writes about the engineering realities of AI systems that actually have to work in production.

Connect on LinkedIn | Follow for more engineering and architecture write-ups

Follow for more no-fluff GenAI architecture content.

🧠 I Tried 100 Claude Skills. These Are The Best.

Suraj Khaitan — Sun, 10 May 2026 09:30:55 +0000

From PDF wizards to Slack-GIF generators, I went deep into Anthropic’s new Agent Skills ecosystem — mostly inside Claude Code, where the action really is. Here are the Skills actually worth installing, the Claude Code workflows that have quietly reshaped my dev loop, and the patterns that separate a great Skill from a glorified prompt.

Why I Went Down This Rabbit Hole

When Anthropic dropped Agent Skills in October 2025, my first reaction was: another abstraction layer? My second reaction, after spending a weekend with them inside Claude Code, was: this is how agents actually become useful.

A Skill is deceptively simple — a folder with a SKILL.md file, optional scripts, and reference docs. But the magic is progressive disclosure: Claude only loads what it needs, when it needs it. That means you can hand an agent a 200-page playbook without burning a single token until the moment it’s relevant.

And Claude Code is the place where Skills feel most alive. In the last twelve months it’s gone from a terminal-only experiment to a multi-surface developer environment — terminal, IDE plugin, desktop app, web, iOS, and Slack — powered by Sonnet 4.6 and Opus 4.7, with adoption stories from Ramp, Intercom, Notion, Spotify, Shopify, Figma, Stubhub, and Asana. It’s arguably the fastest-growing AI dev tool on the market right now, and Skills are the layer that turns it from “impressive demo” into “this is how my team ships code.”

So I did the obvious thing. I installed, audited, and stress-tested 100 Skills — from the official anthropics/skills repo, partner Skills, the Agent Skills standard at agentskills.io, and a pile of community contributions on GitHub. Most of the testing happened inside Claude Code, with a few side trips through Claude.ai and the Agent SDK.

This is the shortlist that survived.

TL;DR

Skills ≠ prompts. They’re portable, composable, model-agnostic capability packs.
The best Skills do one thing exceptionally well, lean hard on deterministic code, and have razor-sharp description fields.
My top 10 below cover documents, design, dev workflows, testing, comms, and meta-skills that build other Skills.
Claude Code is the killer host. Plugins, marketplaces, parallel sessions, Routines, and tight Git/Slack integration make it the place Skills shine.
Watch out for trap Skills: bloated SKILL.md files, vague triggers, and Skills that smuggle in untrusted scripts.

A 30-Second Refresher: What Is a Skill?

A Skill is a directory:

my-skill/
├── SKILL.md          # YAML frontmatter + instructions (required)
├── reference.md      # Optional deep-dive context
└── scripts/
    └── do_thing.py   # Optional executable code

The SKILL.md frontmatter only needs two fields:

---
name: my-skill
description: What it does and exactly when Claude should use it
---

At startup, Claude pre-loads only the name and description of every installed Skill. When a task matches, it pulls in the body of SKILL.md. If the body references forms.md, Claude reads that only if needed. Code in the Skill can be executed directly — no token cost for the script body.

This three-tier disclosure (metadata → instructions → bundled assets) is why Skills scale where giant system prompts don’t.

Skills run today across Claude.ai, Claude Code, the Claude Agent SDK, and the Claude Developer Platform, and the format is now an open standard (agentskills.io) for cross-platform portability.

A Detour: Why Claude Code Is Eating the Dev Tool Market

You can’t talk seriously about Skills in 2026 without talking about Claude Code, because that’s where most of the interesting Skill work is happening. A few trends worth naming:

1. It stopped being “just a CLI.”

Claude Code now runs in your terminal, your IDE, a desktop app with parallel task management and visual diffs, the web, iOS, and Slack. The same agent, same context, same Skills — different surface depending on where you happen to be working.

2. Models got dramatically better at long-horizon coding.

Sonnet 4.6 is the everyday workhorse — fast, cheap enough to keep multiple instances running in parallel. Opus 4.7 is the heavy lifter for refactors, migrations, and multi-file architectural changes. The gap between “AI suggested a snippet” and “AI shipped a PR” has basically closed.

3. Plugins and marketplaces.

The /plugin system turned Claude Code into a real ecosystem. You add a marketplace (/plugin marketplace add anthropics/skills), browse, install, and your agent gets new capabilities instantly. This is how Skills are actually distributed at scale today.

4. Routines.

The newest big feature: configure a Claude Code routine once, then trigger it on a schedule, via API, or in response to an event. Nightly dependency upgrades, auto-triage of new GitHub issues, on-merge changelog generation — all become one-time setup. Skills + Routines is the combo I’m most bullish on for 2026.

5. Real customers, real numbers.

Notion’s co-founder Simon Last said it best: “A big part of my job now is to keep as many instances of Claude Code busy as possible.” Ramp reported saving 1–2 days per ML model on Metaflow conversions. Intercom, Spotify, Shopify, Figma, Stubhub, and Asana have all gone public with Claude Code adoption. This isn’t early-adopter buzz anymore — it’s mainstream developer tooling.

6. Pricing finally makes sense.

Claude Code is bundled into Pro ($17–$20/mo), Max 5x ($100/mo), and Max 20x ($200/mo) plans. For the first time, “have the AI keep three parallel branches alive while I review the fourth” is economically sane.

Put it together and you get the real punchline: Claude Code is becoming the operating environment for AI-assisted engineering, and Skills are the package format for that environment.

How I Evaluated 100 Skills

Each Skill got scored on five axes:

Trigger precision — Does Claude pick it up at the right moment, and ignore it otherwise?
Determinism — Does it offload work to code where it should, instead of asking the model to “be careful”?
Token economy — Lean SKILL.md, with detail pushed into bundled files.
Reusability — Useful across multiple workflows, not a one-shot trick.
Safety posture — No surprising network calls, no opaque binaries, dependencies audit cleanly.

Anything that scored under 3/5 on more than two axes got cut. That eliminated about 70% of what I tried.

The Best 10 Claude Skills (Ranked)

1. PDF — The skill that made me a believer

Form filling, field extraction, and reliable text/table parsing without hallucination. The Skill ships with a Python script that reads PDFs and returns structured field metadata, so Claude executes the parser instead of imagining the contents. The forms.md companion file only loads when you’re actually filling a form. This is the canonical example of progressive disclosure done right.

Use it when: Anything PDF — extraction, form filling, batch redaction.

2. DOCX / PPTX / XLSX — Office, finally automated properly

The Office trio is the secret behind Claude’s document-creation features. They generate genuine .docx, .pptx, and .xlsx files (not Markdown pretending to be Word), preserve styles, and handle templates cleanly.

Killer combo: Use xlsx + pptx together to turn a CSV into a board-ready deck in one prompt.

3. skill-creator — The meta-skill

A Skill that helps you write Skills. It enforces the frontmatter contract, suggests good description wording (the part most people get wrong), and scaffolds bundled files. If you’re going to install one Skill before any other, install this.

Pro tip: Pair it with mcp-builder and you’ve got a self-bootstrapping agent toolkit.

4. mcp-builder — Bridge to the wider tool ecosystem

Generates Model Context Protocol (MCP) servers from a description. Skills + MCP is the combo Anthropic has clearly been building toward: Skills teach the workflow, MCP exposes the external tools. This Skill makes that pairing trivial.

5. webapp-testing — Playwright, but Claude drives

Spins up Playwright sessions, navigates flows, captures screenshots, and reports failures with structured output. I replaced an entire smoke-test script with “use the webapp-testing Skill on staging” and it worked first try. Wire it into a Claude Code Routine and you have nightly UI smoke-tests with zero CI YAML.

Caveat: Sandbox the browser. Always.

6. frontend-design — Designs that don’t look AI-generated

Encodes spacing, typography, and layout principles instead of vibes. The Skill nudges Claude to use semantic tokens, consistent scales, and accessible color contrast. Pairs beautifully with…

7. brand-guidelines — Your style guide as a Skill

Drop in your color palette, logo rules, voice and tone, and approved typography. Every artifact Claude generates afterward — slides, docs, web pages — comes back on-brand. This is the Skill enterprises have been quietly desperate for.

8. theme-factory — Design systems on demand

Generates cohesive themes (light/dark, semantic tokens, component variants) you can drop into Tailwind, CSS variables, or design tools. The output is structured JSON, not “here’s a vibe” — meaning it composes with code generators downstream.

9. internal-comms — The Slack-message ghostwriter you didn’t know you needed

Templates for announcements, status updates, incident comms, and exec summaries. The Skill teaches Claude your org’s tone — concise, no jargon, link-heavy, whatever. Saves me ~30 minutes a day on Slack alone.

10. slack-gif-creator — The unserious pick that earned its spot

Generates short, on-message animated GIFs for Slack reactions. Yes, it’s silly. Yes, it has driven measurable team morale gains. The Skill demonstrates how narrow a great Skill can be and still earn its install.

Honorable Mentions

algorithmic-art — Generative SVG/Canvas art with parameterized seeds. Great demo of code-execution Skills.
canvas-design — HTML5 canvas compositions for marketing assets.
doc-coauthoring — Multi-pass editing with diff-style suggestions; pairs with docx.
claude-api — Up-to-date reference for the Claude API itself, including Managed Agents, multiagent, and webhooks. Underrated.
web-artifacts-builder — Builds self-contained HTML artifacts (mini-apps, dashboards). Perfect for demos.
Notion Skills for Claude (partner) — Best partner Skill I’ve tested. Treats Notion like a first-class workspace, not an API surface.

Patterns I Saw in Every Great Skill

After 100 of these, the great ones rhyme:

A description that reads like a router rule.

Bad: “Helps with documents.”

Good: “Use when the user asks to extract form fields, fill, redact, or parse tables from a PDF file.”
Code where code belongs.

The model doesn’t sort lists, parse PDFs, or compute hashes. It calls a script. Cheaper, deterministic, repeatable.
Lean SKILL.md, fat reference.md.

The core file should fit on a phone screen. Push edge cases into bundled files Claude will only open when needed.
One Skill, one job.

Skills that try to do five things trigger at the wrong time and confuse the agent. Split them.
Examples > rules.

A Skill with three concrete worked examples beats a Skill with twenty bullet-pointed rules every time.

Patterns I Saw in Every Bad Skill

The 4,000-token SKILL.md that loads on every adjacent task.
Vague triggers like “use this for productivity tasks.” Productivity is not a category.
Self-reported metadata — Skills that claim to do things their bundled code can’t actually do.
Untrusted network calls baked into scripts with no documentation.
No examples. If I can’t guess the use case from the README, neither can the model.

A Word on Security (Read This Part)

Skills are powerful precisely because they let an agent execute code and follow instructions you didn’t write. That’s also exactly why they can be dangerous.

Before installing any Skill from a less-trusted source:

Read SKILL.md end-to-end. Look for instructions to fetch URLs, exfiltrate files, or call eval-style patterns.
Audit every script. Pin dependencies. Diff updates.
Sandbox execution. Containers, restricted file system access, network egress rules.
Treat partner badges as marketing, not assurance. Verify yourself.

Anthropic’s own guidance is blunt: install only from trusted sources, and audit anything else. That’s the right posture.

How to Try These Yourself

In Claude Code (recommended):

Install Claude Code first — one-liner from the docs:

irm https://claude.ai/install.ps1 | iex   # Windows
# or: curl -fsSL https://claude.ai/install.sh | sh   # macOS/Linux

Then wire up the official Skills marketplace:

/plugin marketplace add anthropics/skills
/plugin install document-skills@anthropic-agent-skills
/plugin install example-skills@anthropic-agent-skills

Then just mention the Skill in a prompt:

“Use the PDF skill to extract the form fields from ./contracts/nda.pdf.”

From there you can drive Claude Code from the terminal, the IDE plugin, the desktop app (with parallel tasks and visual diffs), the web, iOS, or Slack — same Skills, same context.

As a Routine: Once a Skill-driven workflow proves itself, promote it to a Claude Code Routine so it runs on a schedule or in response to GitHub/webhook events. This is where Skills stop being a parlor trick and start replacing scripts.

In Claude.ai: The example Skills are available on paid plans — enable them in settings, then invoke by intent.

Via the Claude API: Upload custom Skills through the Skills API and reference them per request — ideal for embedding into your own product.

Author your own: Start from the template/ folder in anthropics/skills, run it through skill-creator, and iterate against real tasks. The fastest feedback loop is to author the Skill inside Claude Code itself — ask Claude to capture the steps it just took into a reusable SKILL.md.

Final Take: Skills + Claude Code Is the Combo to Watch

Tools (MCP) give agents capability. Skills give agents competence — the procedural knowledge to use those capabilities well, in your context, for your workflows. Claude Code gives both a home: a multi-surface, model-agnostic, plugin-enabled environment that’s gone from CLI curiosity to mainstream developer platform in under a year.

The Skills ecosystem is barely six months old and already feels like the format the industry has been quietly missing. Pair it with Claude Code’s Routines, parallel task management, and IDE/Slack/desktop reach, and you have something genuinely new: an agent that doesn’t just help you code, but learns the way your team works and quietly gets better at it every week.

If you’re building anything with Claude — or any agent that adopts the open standard — start with the ten Skills above, write your eleventh yourself, install Claude Code, and let your agent get genuinely good at the work you actually do.

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

Which Claude Skill has changed your workflow the most? Drop your pick in the comments — I’m always hunting for the next great one.

🚀 I Passed the Claude Certified Architect – Foundations (CCA-F) Exam: My Journey, Lessons, and Study Tactics

Suraj Khaitan — Sun, 26 Apr 2026 06:30:03 +0000

How I navigated Anthropic’s scenario-based certification, what I learned about agentic AI architecture, and why structural thinking beats prompt engineering every time.

The Moment I Decided to Level Up

As someone building GenAI platforms, I’m always looking for ways to deepen my architectural skills—especially as agentic AI moves from buzzword to production reality. When Anthropic launched the Claude Certified Architect – Foundations (CCA-F) exam, I saw a chance to benchmark my knowledge against the best practices shaping the future of AI systems.

Spoiler: I passed! Here’s how I did it, what surprised me, and how you can prepare.

TL;DR (If You Only Read One Section)

Exam: Scenario-based, multiple-choice, 4 out of 6 real-world cases, 5 core domains.
What Matters: Structural, deterministic solutions (schemas, tool boundaries, agent orchestration)—not just clever prompts.
How I Prepped: Official study plan, open-source Q&A, hands-on with Claude Code and MCP, and lots of anti-pattern drills.
Result: Passed on my first attempt. The real win? A new mental model for designing robust, agentic AI systems.

Why the CCA-F Exam Is a Big Deal

The Claude Certified Architect – Foundations exam isn’t just another “AI basics” cert. It’s Anthropic’s first technical credential for solution architects building production apps with Claude. The focus: agentic architecture, tool design, context management, and prompt engineering for real-world reliability.

You get 4 scenario-based cases (from a pool of 6), each testing your ability to make architectural decisions—not just recall facts. The passing score is 720/1000, and the exam is free for Anthropic partners (for now).

My Study Workflow: What Actually Worked

1. Started with the Official Exam Guide

I read the Exam Guide end-to-end. The five domains are:

Agentic Architecture & Orchestration (25%)
Tool Design & MCP Integration (20%)
Claude Code Configuration & Workflows (20%)
Prompt Engineering & Structured Output (20%)
Context Management & Reliability (15%)

Each domain has its own deep-dive page and sample scenarios. I made flashcards for the key patterns and anti-patterns.

2. Followed the 12-Week Study Plan (Condensed)

I didn’t have 12 weeks, but the official study plan is gold. I focused on:

Week 1-2: Agentic loops, subagent orchestration, session management.
Week 3-4: Tool schemas, MCP integration, error handling.
Week 5-6: CLAUDE.md, plan mode, CI/CD integration.
Week 7-8: Prompt engineering, JSON schema, validation-retry loops.
Week 9-10: Context summarization, escalation, provenance.

3. Drilled Q&A from the Community Repo

The avidevelops/claude-architect-exam-prep repo is a treasure trove of scenario-style questions. I worked through every Q&A, focusing on why the right answer was correct (structural fix, not just prompt tweaks).

4. Hands-On with Claude Code and MCP

I set up Claude Code in a sandbox project, wrote custom tools, and experimented with agentic workflows. Practicing with CLAUDE.md, plan mode, and batch APIs made the exam scenarios feel much more concrete.

5. Memorized the Anti-Patterns

The anti-patterns cheatsheet is essential. Many wrong answers on the exam are classic anti-patterns: relying on prompt instructions for business rules, using ambiguous text fields instead of IDs, or trusting self-reported tool metadata.

What the Exam Actually Tests

Scenario 1: Designing a customer support agent with escalation logic (Agent SDK, hooks, error handling)
Scenario 2: Configuring Claude Code for a dev team (CLAUDE.md, plan mode, iterative refinement)
Scenario 3: Multi-agent research system (orchestration, context passing, error propagation)
Scenario 4: Developer productivity tools (tool selection, codebase exploration, MCP)
Scenario 5: Claude Code in CI/CD (batch API, structured output, session isolation)
Scenario 6: Structured data extraction (JSON schema, validation-retry, few-shot prompting)

You’ll get 4 of these, each with multiple-choice questions. The trick: several answers will seem plausible, but only one follows best practices.

My Top 7 Lessons Learned

Structural Fixes Beat Prompt Tweaks

The right answer is almost always a schema change, tool boundary, or deterministic enforcement—not “improve the prompt.”
Machine IDs > Ambiguous Text

Always design tools to use explicit IDs, not freeform strings.
Context Budgeting Is Real

Trim raw content and intermediate chains before passing to downstream agents. Avoid “lost in the middle” effects.
Anti-Patterns Are Exam Traps

If an answer relies on prompt-based enforcement, arbitrary iteration caps, or trusting self-reported metadata, it’s probably wrong.
Parallelize When Possible

For multi-agent tasks, emit parallel tool calls instead of sequential loops.
Enforce Business Rules in Code

Never trust the LLM to enforce critical thresholds—put it in the backend/tool logic.
Review the Key Concepts Cheat Sheet

The community Q&A and official anti-patterns are your best friends.

Gotchas (What Surprised Me)

The exam is tricky: Many MCQs have multiple “technically correct” answers, but only one is robust and production-grade.
You need real-world experience: The test rewards architectural thinking, not just memorization.
Time management: Some scenarios are dense—practice reading and analyzing quickly.

My Exam Day Experience

I registered via the Skilljar portal, got my access, and took the exam online. The interface is clean, and you can flag questions to review later.

I finished with 10 minutes to spare, double-checked my flagged questions, and submitted. A few minutes later, I got the “Congratulations, you passed!” email.

Who Should (and Shouldn’t) Take This Exam

Take it if:

You design or build agentic AI systems with Claude.
You want to prove your skills in production-grade AI architecture.
You enjoy scenario-based, real-world problem solving.

Maybe skip if:

You’re new to agentic AI or haven’t built with Claude/MCP.
You prefer rote memorization over architectural reasoning.

Resources That Helped Me Most

Final Take: It’s About Thinking Like an Architect

The CCA-F exam isn’t about trick questions or obscure trivia. It’s about whether you can design agentic AI systems that are robust, reliable, and production-ready. If you focus on structural solutions, understand the anti-patterns, and practice with real scenarios, you’ll be ready.

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

What’s your biggest challenge with agentic AI architecture? Drop your thoughts below or connect with me for more tips!

🤖 We Gave an AI Agent Our Design System and Let It Build Our Frontend — Here's What Happened

Suraj Khaitan — Sat, 04 Apr 2026 14:41:02 +0000

How a custom GitHub Copilot agent with strict architectural guardrails turned feature delivery from days into hours on a multi-tenant enterprise platform

The Problem Nobody Talks About in Enterprise Frontend

Enterprise frontend development is slow. Not because developers can't write React components — they can — but because 90% of the work isn't writing code. It's alignment.

Which design tokens do I use? Where does this component go? How do I wire the API? What's the naming convention for hooks? Which state manager handles this? How do I handle dark mode? Did I forget the MSW handler for tests?

On our team building an enterprise multi-tenant GenAI platform — managing agents, tools, and knowledge bases across a large manufacturing conglomerate — the friction was even worse. We have:

A custom corporate design system with 360+ Tailwind tokens (no generic gray-500 allowed)
8 feature modules with strict feature-first architecture
OpenAPI codegen that generates TypeScript types from a FastAPI backend
MSW (Mock Service Worker) for development and testing
A 7-tier RBAC system with route-level access guards
Light/dark mode using class-based Tailwind (dark: variants on everything)
i18n for English and German

Every new component is a decision tree. Every junior developer ramp-up takes weeks. Every code review catches the same "you used bg-white instead of bg-background-base" mistake.

So we did something different: we encoded our entire frontend architecture into an AI agent and let it build features for us.

TL;DR (If You Skim, Skim This)

Problem: Enterprise frontend velocity bottlenecked by architectural complexity, design system compliance, and cross-cutting concerns (auth, theming, mocking, i18n).
Move: Built a custom VS Code agent (.github/agents/FrontendAgent.agent.md) that knows our design system, file structure, state management strategy, and API codegen pipeline.
Result: Feature scaffolding that used to take a day now takes minutes. The agent produces design-system-compliant, dark-mode-ready, MSW-wired, type-safe code on the first pass.
Tradeoff: You need to invest upfront in writing precise agent instructions. Vague prompts produce vague code — garbage in, garbage out.

Why Not Just Use Copilot Out of the Box?

We did. Here's what vanilla Copilot (without custom instructions) gave us:

// ❌ What generic Copilot produced
<div className="bg-white dark:bg-gray-900 p-4 rounded-lg shadow-md">
  <h1 className="text-gray-900 dark:text-white text-xl font-bold">
    Tenants
  </h1>
</div>

Every single token is wrong. bg-white should be bg-background-base. text-gray-900 should be text-text-normal. p-4 should be p-400. rounded-lg should be rounded-m. font-bold should be font-bold font-primary.

Multiply that across 18 shared components, 8 feature modules, and hundreds of sub-components, and you're spending more time fixing AI output than you saved generating it.

The realization: an AI assistant is only as good as its context. Generic Copilot doesn't know your design system. It doesn't know your file conventions. It doesn't know that you use TanStack Query with a 5-minute stale time and 2 retries, not SWR or Redux Toolkit Query.

So we gave it all of that context. Explicitly. In a single agent definition file.

The Architecture: A 200-Line Agent That Knows Everything

GitHub Copilot supports custom agents via markdown files in .github/agents/. Ours lives at:

.github/agents/FrontendAgent.agent.md

It's a single file that encodes every architectural decision our team has made. Think of it as a machine-readable engineering handbook — the same document that would take a new hire two weeks to internalize, distilled into structured instructions an AI can execute against.

Here's how we structured it:

1. Design System as Code (Not Suggestions)

We don't tell the agent "try to use our design tokens." We tell it these are the only tokens that exist:

DESIGN SYSTEM & THEMING (MANDATORY)
- Use corporate design tokens only (NO generic Tailwind colors like gray-500/blue-600).
- Always include dark mode variants (class-based: darkMode: 'class').
- Semantic tokens examples:
  - Colors: bg-background-base, bg-background-surface, text-text-normal,
            border-line-weak, bg-action, bg-status-error
  - Spacing: p-400 (16px), m-600 (24px), gap-300 (12px)
  - Typography: text-400, font-primary, font-secondary, font-bold
  - Borders: rounded-m, border-s
  - Transitions: duration-medium-1, ease-in-out
- Reference: src/frontend/THEME_GUIDE.md

The word "MANDATORY" isn't decoration. The agent treats sections labeled as mandatory as hard constraints, not preferences. When it generates a card component now:

// ✅ What the custom agent produces
<div className="bg-background-surface dark:bg-dark-background-surface
                p-400 rounded-m shadow-card
                border border-line-weak dark:border-dark-line-weak
                transition-all duration-medium-1 ease-in-out">
  <h1 className="text-text-normal dark:text-dark-text-normal
                 text-400 font-primary font-bold">
    Tenants
  </h1>
</div>

Every token is from our design system. Dark mode is included. Transitions use our timing tokens. No manual corrections needed.

2. Feature-First File Structure (Encoded, Not Implied)

We explicitly map the file tree so the agent places files correctly:

FRONTEND ARCHITECTURE & CONVENTIONS
- Feature-first organization:
  src/frontend/src/
    features/{feature}/
      api/          // Axios client functions
      components/   // UI components
      hooks/        // Feature hooks
      pages/        // Route-level pages
    components/     // Shared components
    contexts/       // Auth, Theme, Tenant contexts
    lib/            // Utilities
- Import alias: @/ → src/
- Naming: Components = PascalCase, Hooks = camelCase with 'use',
          API files = {feature}Api.ts, Contexts = {Name}Context.tsx

When we ask the agent to build a "knowledge base management feature," it doesn't create a flat KnowledgeBase.tsx in the root. It scaffolds:

src/features/knowledgebase/
├── api/
│   └── knowledgebaseApi.ts
├── components/
│   ├── KnowledgeBaseList.tsx
│   └── CreateKnowledgeBaseDialog.tsx
├── hooks/
│   └── useKnowledgeBases.ts
├── pages/
│   └── KnowledgeBasePage.tsx
└── types/
    └── index.ts

Correct directory. Correct naming. Correct separation of concerns. Every time.

3. State Management: Pick the Right Tool Automatically

We encode our state management decision tree:

STATE & DATA
- Server state: TanStack Query (staleTime 5 min, retries: 2)
- Global auth: UserInfoProvider (contexts/AuthContext.tsx)
- Theme: ThemeProvider
- Local state: useState/useReducer (NO Redux/Zustand)
- Error handling:
  - Wrap TanStack Query errors with Sonner toasts
  - ErrorBoundary component with design tokens
  - "Access Lost" interceptor: clear tenant, redirect, show toast

Now when the agent generates a data-fetching hook, it doesn't reach for useEffect + fetch or SWR. It produces exactly what our codebase expects:

export const useKnowledgeBases = () => {
  const { sessionId } = useAuth();

  return useQuery<KnowledgeBase[], Error>({
    queryKey: ['knowledgebases'],
    queryFn: () => knowledgebaseApi.getKnowledgeBases(sessionId as string),
    enabled: !!sessionId,
    retry: 2,
  });
};

Session-aware. Query-key namespaced. Auth-gated with enabled. Retry count matching our standard. This is exactly what our human-written hooks look like — because the agent learned from the same conventions.

The Secret Weapon: MSW-First Development

Here's where it gets interesting. Our agent doesn't just generate UI components — it generates the entire mock layer alongside them.

MSW-FIRST DEVELOPMENT
- Use MSW (Mock Service Worker) during UI work—dev server and tests.
- Location: src/frontend/src/mocks/
- Handlers:
  - Realistic delays: 300–800ms
  - Simulate ~5% errors
  - Validate required fields and return error shapes consistent with backend

When we ask the agent to build a new feature, the output includes MSW handlers with realistic data:

// Generated MSW handler for knowledge bases
http.get('/api/knowledgebases', async () => {
  // Simulate realistic network delay
  await delay(Math.random() * 500 + 300);

  // 5% error rate simulation
  if (Math.random() < 0.05) {
    return HttpResponse.json(
      { detail: 'Internal server error' },
      { status: 500 }
    );
  }

  return HttpResponse.json({
    items: [
      {
        id: 'kb-001',
        name: 'Production Manual - North Plant',
        type: 'S3',
        status: 'ACTIVE',
        documentCount: 1247,
        lastSynced: '2026-04-03T14:30:00Z',
      },
      // ... more realistic domain-contextualized data
    ],
  });
}),

This means the agent produces runnable features from the first prompt. No waiting for the backend team. No dummy setTimeout hacks. The UI renders with realistic data, realistic latency, and realistic error states immediately.

Backend as Source of Truth: The Codegen Bridge

One of our strongest architectural decisions was making the agent aware of our OpenAPI codegen pipeline:

BACKEND AS SOURCE OF TRUTH (SPEC SYNC)
- Backend is authoritative. FastAPI + Pydantic (code-first).
- Frontend must use generated TypeScript types and API client only.
- Codegen: pnpm api:codegen
- After codegen, run git diff:
  - If there is a diff, surface: "Frontend types are stale relative
    to backend OpenAPI" and include diff summary.

Our codegen setup (openapi-ts.config.ts) generates types, SDK methods, and even TanStack Query hooks directly from the backend's OpenAPI spec:

// openapi-ts.config.ts
import { defineConfig } from '@hey-api/openapi-ts';

export default defineConfig({
  client: '@hey-api/client-fetch',
  input: 'http://localhost:8000/openapi.json',
  output: { path: 'src/client', format: 'prettier' },
  plugins: [
    {
      name: '@tanstack/react-query',
      queryOptions: true,
      mutationOptions: true,
    },
    {
      name: '@hey-api/typescript',
      enums: 'javascript',
    },
  ],
});

When the agent starts a task, it checks whether the generated types are current. If they've drifted, it flags it:

⚠️ SPEC MISMATCH: Frontend types are stale.
  - Missing field: `retryCount` on PromotionEvent
  - New enum value: `ROLLED_BACK` in PromotionStatus
  Running `pnpm api:codegen` to sync...

This prevents the classic "the UI expects a field the API doesn't send" bug that usually surfaces at 11 PM on a Friday.

Autonomy Levels: Controlling the Blast Radius

We don't always want the agent to write production code. Sometimes we want a plan. Sometimes a scaffold. Sometimes the full implementation.

So we built three autonomy levels:

AUTONOMY LEVELS (Default = Level 2)
- Level 1: Plan Only → Step-by-step plan, file paths, component
            signatures. No code changes.
- Level 2: Plan + Scaffold → Create files, stubs, routing/context
            wiring, MSW handlers. Minimal UI with tokens; TODO comments.
- Level 3: Full Implementation → Complete feature including styling,
            tests, mocks, docs, and ready-to-run commands.

Level 1 is for architecture discussions. "How would you build a promotion approval workflow?" The agent produces a plan, lists affected files, and maps component relationships — without touching a single file.

Level 2 (the default) is our workhorse. The agent creates the file structure, wires routes and contexts, sets up MSW handlers, and builds minimal UI with correct tokens. Developers fill in the business logic.

Level 3 is for well-defined features with clear specs. The agent produces everything: components, hooks, API functions, MSW handlers, unit tests, and even the pnpm commands to verify the output.

The Agent Lifecycle: Not Just "Generate Code"

What separates this from a glorified code generator is the end-to-end lifecycle:

END-TO-END AGENT LIFECYCLE
Phase A — Plan
- Outline goals, dependencies, spec sync (codegen), and scope.
- Note any backend spec gaps (SPEC MISMATCH section).

Phase B — Implement
- Apply scaffolding/implementation per autonomy level.
- Add MSW handlers and tests.

Phase C — Validate
- Run typecheck, build, tests; verify codegen freshness.

Phase D — Deliver
- Provide diffs, test plan, run commands, and follow-up concerns.

The agent doesn't just output code and walk away. It:

Plans — analyzing the request against the existing codebase
Syncs — running codegen to ensure types are fresh
Implements — generating code compliant with every convention
Validates — running pnpm frontend:quality (typecheck + lint + format)
Delivers — providing exact commands to test its output

That validation step is key. If the agent generates code with a type error, it catches it in the same session and fixes it. The developer receives working code, not a first draft.

Real Output: What It Looks Like in Practice

Here's a real interaction. We asked the agent:

"Build a deployment management page for the tenant feature. It should show a table of deployments with status badges, and a dialog to trigger new deployments."

The agent produced:

8 files created:

src/features/tenants/pages/DeploymentsPage.tsx
src/features/tenants/components/DeploymentTable.tsx
src/features/tenants/components/DeployAgentDialog.tsx
src/features/tenants/hooks/useDeployments.ts
src/features/tenants/types/deployment.ts
src/mocks/handlers/deployments.ts
src/features/tenants/components/__tests__/DeploymentTable.test.tsx

Every file followed conventions:

Design tokens, not raw Tailwind
Dark mode variants on every element
TanStack Query with proper query keys
MSW handlers with realistic delays and 5% error simulation
Radix Dialog for the deployment trigger
Sonner toasts for success/error feedback
Route guard with RequireDeveloperAccess

Zero manual corrections to the design system usage. One adjustment to a business logic edge case (handling a deployment state we hadn't documented). Total time from prompt to PR-ready code: ~20 minutes including review. Previous estimate for the same feature: 1–2 days.

The Pitfalls (A.K.A. What Bit Us So It Doesn't Bite You)

1. Vague Instructions = Vague Code

Our first agent definition was 40 lines. It produced code that was "close but not quite." The spacing tokens were right but the color tokens were generic. The file structure was feature-first but the naming was inconsistent.

Fix: We expanded to 200+ lines with explicit examples, explicit anti-patterns ("NO generic Tailwind"), and references to real files in the repo. The more specific your instructions, the more accurate the output.

2. The Agent Doesn't Know What Changed Yesterday

If you add a new design token or change a convention and don't update the agent file, it'll use the old pattern. The agent definition is a living document — it needs to be maintained alongside the codebase.

Fix: We added agent definition updates to our PR checklist. Changed a convention? Update FrontendAgent.agent.md in the same PR.

3. MSW Handlers Can Drift from Reality

The agent generates mock handlers based on its understanding of the API. But if the real API has quirks (pagination cursors, non-standard error shapes, optional fields), the mocks might not match.

Fix: We added the SPEC MISMATCH protocol. The agent explicitly flags when it's making assumptions about the API, so developers know which mocks need validation against the real backend.

4. Over-Reliance Kills Understanding

The fastest way to create a team that doesn't understand its own codebase is to let the agent write everything without review. We use the agent as a force multiplier, not a replacement.

Fix: We default to Level 2 (scaffold), not Level 3 (full implementation). Developers fill in remaining business logic, which ensures they understand the code they're shipping.

5. Token Stuffing — There's a Context Window Limit

Our agent instructions are 200+ lines, the theme guide is another 300+, and the copilot instructions are 150+. Some LLMs struggle with this much context.

Fix: We keep the agent file focused on rules and patterns, not exhaustive token lists. The agent references THEME_GUIDE.md for the full token catalogue rather than embedding it inline.

The Numbers

Before the custom agent:

Feature scaffolding: 4–8 hours (file creation, routing, context wiring, mock setup)
Design system violations per PR: 3–5 (wrong tokens, missing dark mode)
Time to first rendered component: 2–4 hours (waiting for mock data setup)
New developer ramp-up: 2–3 weeks to internalize conventions

After the custom agent:

Feature scaffolding: 15–30 minutes
Design system violations per PR: 0–1
Time to first rendered component: Under 10 minutes (MSW handlers generated alongside UI)
New developer ramp-up: Days — they read the agent file and see the patterns

The scaffolding speedup alone is 10–15x. But the real win is consistency. Every feature looks like every other feature. Every hook follows the same pattern. Every mock handler has the same structure. The codebase feels like it was written by one very disciplined developer, not a rotating team of six.

When You Should Not Use This Pattern

Greenfield prototypes — if you're still deciding on conventions, you don't have enough patterns to encode. The agent amplifies consistency; it can't create it from nothing.
Small teams with one frontend developer — if one person owns the entire frontend, the conventions live in their head. The agent adds overhead without proportional benefit.
Frequently changing architecture — if you're rewriting your state management strategy every sprint, the agent definition will always be stale. Stabilize first, then encode.

A Practical Implementation Checklist

If you want to build your own frontend agent:

[ ] Document your design system in a machine-readable format (we use a Tailwind config + theme guide)
[ ] Map your file structure explicitly — feature directories, naming conventions, import aliases
[ ] Encode your state management rules — which tool for which type of state, and why
[ ] Define your API integration pattern — codegen pipeline, client library, error handling
[ ] Include anti-patterns — what NOT to do is as important as what to do
[ ] Add autonomy levels — give developers control over how much the agent does
[ ] Wire in validation — the agent should run your lint/typecheck/build as part of its output
[ ] Reference, don't embed — point to config files rather than duplicating 360 lines of tokens
[ ] Add a lifecycle — plan, implement, validate, deliver — not just "generate code"
[ ] Maintain it like code — update the agent file in the same PR as convention changes
[ ] Start with scaffold mode — let developers fill in business logic to maintain understanding
[ ] Include MSW patterns — mock-first development is essential for frontend agent velocity

The Deeper Insight: Agents Are Architecture Documentation That Executes

The most unexpected benefit wasn't speed. It was documentation.

Our FrontendAgent.agent.md file is the most accurate, most up-to-date description of our frontend architecture. Not because we wrote documentation — we hate writing documentation — but because if the agent file is wrong, the generated code is wrong, and someone fixes the agent file.

It's documentation with a built-in feedback loop. When the agent produces a component with the wrong token, the developer who catches it updates the agent instructions. The next generation is correct. Over time, the agent file converges on a precise description of how the codebase actually works.

Compare that to a Confluence page that was last updated eight months ago.

What's Next: The Agent Becomes the PR Reviewer

We're exploring using the same agent instructions as a code review agent. If the agent knows every convention, it should be able to flag violations in PRs automatically:

"This component uses bg-gray-100 — should be bg-background-surface"
"This hook is in src/components/ — should be in src/features/tenants/hooks/"
"Missing dark mode variant on text-text-normal"
"MSW handler missing for new /api/promotions/:id/approve endpoint"

Same knowledge, different mode. Build in one direction, verify in the other.

Closing: The Best Frontend Engineer on Your Team Doesn't Sleep

An AI agent with the right instructions isn't a replacement for your frontend team. It's the most consistent member of your frontend team. It never forgets a dark mode variant. It never uses the wrong spacing token. It never puts a hook in the wrong directory.

But it also doesn't make product decisions. It doesn't architect from scratch. It doesn't push back on a bad spec.

The sweet spot is composing human judgment with machine consistency. You decide what to build. The agent scaffolds how — following every convention, every token, every pattern your team has established.

And when it's 4 PM on a Friday and the PM says "we need one more feature page before the demo," you can spin up a complete, design-system-compliant, dark-mode-ready, MSW-wired, type-safe scaffold in 15 minutes instead of 4 hours.

That's not magic. That's architecture, encoded.

How are you using AI agents in your frontend workflow? Are you encoding project-specific knowledge, or using generic assistants? I'd love to hear what patterns are working for teams at scale — drop a comment.

Resources

GitHub Copilot: Custom Instructions — how to add project-specific context
MSW: Mock Service Worker — API mocking for browser and Node.js
Hey API: OpenAPI TypeScript Codegen — generate types and clients from OpenAPI specs
TanStack Query: React Query — server state management
Tailwind CSS: Design Tokens — custom theme configuration
Radix UI: Headless Primitives — accessible UI components without default styles

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and AI-augmented engineering workflows

Connect on LinkedIn | Follow for more engineering and architecture write-ups

🚀 I Mass Terminated My Copilot Plans. Here's Why Claude Code Won.

Suraj Khaitan — Sat, 14 Mar 2026 10:32:45 +0000

How an agentic AI in the terminal replaced my IDE plugins, scaffold scripts, and half my Stack Overflow tabs—without ever opening a GUI

The Moment I Realized My Coding Workflow Was a Lie

Every developer eventually hits the same wall:

"I have 4 AI extensions, 12 keyboard shortcuts, and I'm still copy-pasting code between a chatbot and my editor."

Tab-complete autocomplete? Great for variable names. IDE chat panels? Nice for explaining regex. But the moment you need an AI to actually understand your codebase, edit 14 files, run your tests, and fix its own mistakes—the shiny plugins fall apart.

Then I tried something that felt reckless:

I gave an AI full access to my terminal.

Specifically: Claude Code—Anthropic's agentic coding tool that lives in your CLI, reads your repo, writes real code, and executes commands.

I haven't looked back.

TL;DR (If You Only Read One Section)

Problem: AI coding assistants that autocomplete lines can't architect solutions. Chat-based tools require endless copy-paste.
Move: Claude Code operates as an agentic AI inside your terminal—it reads, writes, runs, and iterates autonomously.
Result: Multi-file refactors in minutes. Bug fixes with zero context switching. Git workflows handled conversationally.
Tradeoff: You're trusting an agent with shell access. Guardrails and review discipline matter more than ever.

Why Claude Code Is Trending Right Now

Scroll through any dev community in 2025–2026, and you'll see the same frustration:

"Copilot autocomplete is nice but it doesn't think."
"ChatGPT is smart but it doesn't know my codebase."
"I spend more time prompt-engineering than actual engineering."

Claude Code hits different because it collapses the gap between knowing and doing. It doesn't suggest code in a sidebar—it implements changes directly in your repo, runs your test suite, reads the errors, and fixes them. In a loop. Without you alt-tabbing once.

The industry term is agentic coding. And it's not a buzzword anymore—it's a workflow.

What Even Is Claude Code?

Claude Code is a command-line tool from Anthropic. You install it, point it at a project, and talk to it like a senior developer sitting next to you.

# Install it
npm install -g @anthropic-ai/claude-code

# Start it in your project
cd my-project
claude

That's it. No VS Code extension to configure. No API keys to paste into settings.json. No "select model" dropdown with 47 options.

You get a REPL-like interface where you type natural language, and Claude:

Reads your files and project structure
Plans the changes needed
Writes the code across multiple files
Runs commands (tests, builds, linters)
Iterates if something breaks

It's like pair programming—except your pair never gets tired, never forgets the module structure, and never says "let me think about that" for 45 minutes.

Real Workflows That Made Me a Believer

1) The "Refactor 30 Files" Moment

I needed to migrate an API layer from Axios to a custom fetch wrapper. With traditional AI tools, that's:

Explain the pattern in a chat
Copy the suggestion
Paste it into File 1
Realize it doesn't match my error handling
Re-explain
Repeat 29 more times

With Claude Code:

> Refactor all API calls in src/features/ from axios to use the 
  fetchWrapper in src/lib/api.ts. Preserve error handling patterns. 
  Run the type checker after.

It read every file, understood the existing patterns, made the changes, ran tsc, found 3 type errors, and fixed them. Total time: 4 minutes.

2) The "Debug This Flaky Test" Nightmare

A test was passing locally and failing in CI. The usual investigation: environment differences, timing issues, mock state leaking.

> The test in src/features/agents/__tests__/AgentList.test.tsx is 
  failing in CI with "Unable to find role='button'". It passes locally. 
  Investigate and fix.

Claude Code read the test, read the component, identified a race condition with an async render, added the correct waitFor wrapper, and ran the test suite to confirm. Done in 90 seconds.

3) The "Write the Whole Feature" Sprint

> Create a new feature module for "cost-management" under src/features/. 
  Follow the same pattern as the agents feature: api layer, components, 
  hooks, and route registration. Include a dashboard page with a summary 
  card grid and a data table.

It scaffolded 8 files, wired up the route, created TanStack Query hooks, and built components using our existing design tokens—because it read our codebase first. Not a template. Not a snippet. Actual contextual code.

The Architecture: Why "Terminal-Native" Is the Unlock

Most AI coding tools follow this pattern:

IDE Plugin → Language Server → AI API → Suggestion → Developer copies it

Claude Code follows this one:

Developer → Claude Code (terminal) → reads repo → plans → writes files → runs commands → verifies → done

The key difference: the feedback loop is closed. Claude doesn't suggest and hope. It acts, observes the result, and iterates.

This is the difference between:

A GPS that shows you the route (traditional AI)
A self-driving car that takes you there (agentic AI)

Why the Terminal?

The terminal is the most powerful interface a developer has. It's where you:

Run builds and tests
Manage git
Execute scripts
Install dependencies
Deploy

By living in the terminal, Claude Code has access to the same tools you do. It doesn't need a special plugin API or language server protocol. It just… uses your tools.

The Permission Model: Trust, but Verify

Here's the part that makes security-conscious engineers twitch: this thing can run commands.

Claude Code handles this with a tiered permission system:

Action	Permission
Read files	✅ Automatic
Write/edit files	⚠️ Asks permission (configurable)
Run terminal commands	⚠️ Asks permission (configurable)
Run "safe" commands (ls, cat, grep)	✅ Automatic
Run destructive commands	🛑 Always asks

You can configure it to auto-approve certain patterns:

# Allow all file writes in src/
# Allow test runs without asking
# Always ask before git push

The mental model: it's a junior developer with terminal access. You wouldn't let them git push --force without review, but you'd let them run npm test freely.

Claude Code vs. The Field: An Honest Comparison

Capability	GitHub Copilot	ChatGPT/GPT-4	Cursor	Claude Code
Line-level autocomplete	✅ Excellent	❌ N/A	✅ Good	❌ Not its thing
Multi-file edits	❌ Limited	❌ Copy-paste	✅ Good	✅ Excellent
Codebase awareness	⚠️ Current file	❌ None	✅ Good	✅ Excellent
Runs commands	❌ No	❌ No	⚠️ Limited	✅ Full terminal
Self-corrects errors	❌ No	❌ No	⚠️ Sometimes	✅ Yes (loop)
Works without IDE	❌ No	✅ Yes (browser)	❌ No	✅ Yes (terminal)
Agentic workflow	❌ No	❌ No	⚠️ Emerging	✅ Core design

The nuance: Claude Code isn't trying to replace your autocomplete. It's a different tool for a different job. Use Copilot for line-level flow. Use Claude Code when you need an agent that does work.

The Workflow That Actually Works

After months of daily use, here's my optimized flow:

Morning: Strategic Work with Claude Code

> Review the open PR #142. Summarize the changes and flag 
  any potential issues with our auth middleware.

> Implement the API integration for the new knowledge-base 
  management feature. Follow existing patterns in src/features/agents/.

Afternoon: Tactical Fixes

> Fix all TypeScript errors in src/features/tools/. 
  Run the type checker and show me the results.

> Update the unit tests for UseCaseApi to cover the new 
  delete endpoint. Run them and make sure they pass.

End of Day: Cleanup

> Review all changes I've made today. Create a commit with 
  a conventional commit message.

The shift: I went from writing code to directing code. My job became architecture, review, and decision-making. The implementation became a conversation.

Gotchas (The Part Everyone Discovers at 2 AM)

1) It's Confident, Not Always Correct

Claude Code will make changes with conviction. Sometimes those changes are subtly wrong. Always review diffs before committing. Trust the agent, but verify the output.

2) Context Window Limits Are Real

On massive monorepos, Claude Code can't hold your entire codebase in memory. Mitigations:

Use a CLAUDE.md file to give it project context and conventions
Point it at specific directories rather than the whole repo
Break large tasks into focused steps

3) It Can Get Into Loops

Occasionally, it'll try to fix an error, introduce a new one, fix that, introduce another. When you see this:

Stop it
Give it clearer constraints
Break the task down

4) Cost Awareness

Claude Code uses API credits. Complex multi-file refactors with test loops can add up. Monitor your usage, especially in the "let it run" agentic mode.

The CLAUDE.md File: Your Project's AI Constitution

The secret weapon most people miss: create a CLAUDE.md at your project root.

# CLAUDE.md

## Project Overview
This is a React + FastAPI monorepo for an internal platform.

## Conventions
- Use design system tokens, never raw Tailwind colors
- Follow feature-based file organization under src/features/
- Use TanStack Query for server state
- All API calls go through src/lib/api.ts

## Commands
- `pnpm frontend:dev` - Start frontend
- `pnpm frontend:quality` - Type check + lint
- `pytest` - Run backend tests

## Don'ts
- Never modify shared components without discussing
- Don't install new dependencies without justification
- Don't push directly to main

This file acts as persistent memory. Every time Claude Code starts, it reads this file and follows the rules. It's like onboarding documentation—but for your AI pair programmer.

Who Should (and Shouldn't) Use Claude Code

Use it if:

You work on codebases with 10+ files that need coordinated changes
You're tired of copy-pasting between AI chats and your editor
You want to automate repetitive refactors, test writing, or migrations
You're comfortable reviewing diffs and understanding the code an AI writes

Skip it if:

You mainly need line-level autocomplete (use Copilot)
You're learning to code and need to understand every line you write
Your org prohibits AI tools from accessing source code
You prefer GUI-first workflows and rarely use the terminal

The Bigger Picture: We're Entering the "Agent" Era of Dev Tools

Claude Code isn't an anomaly. It's the leading edge of a shift:

Era 1 — Stack Overflow & Docs (search for answers)

Era 2 — AI Chat (ask for answers)

Era 3 — AI Autocomplete (get suggestions inline)

Era 4 — Agentic AI (delegate tasks to an autonomous agent) ← We are here

The developers who thrive in Era 4 won't be the fastest typists. They'll be the best directors—people who can decompose problems, set constraints, review output, and guide an agent toward the right solution.

The skill isn't "can you write a React component?" anymore.

It's "can you describe what the component should do, review what the agent built, and course-correct in real time?"

Final Take: It's Not About Replacing Developers

Every AI tool gets the same question: "Will this replace me?"

No. But it will replace the version of you that spends 60% of the day on mechanical implementation.

Claude Code doesn't have taste. It doesn't know your users. It can't decide whether a feature should exist. It can't navigate a product meeting, push back on a bad spec, or mentor a junior developer.

But it can turn your architectural decisions into working code faster than any tool I've used. And that's not a threat—it's a superpower.

What's your biggest frustration with current AI coding tools? Is it context awareness, copy-paste fatigue, or something else? Drop your take below.

Resources

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

🚀 Stop Calling STS on Every Request: Redis Caching Patterns That Cut Login Latency by 10x

Suraj Khaitan — Sat, 28 Feb 2026 07:39:04 +0000

How caching sessions and temporary AWS credentials in Redis turned our auth layer from a bottleneck into a near-zero-cost lookup

The Moment We Realized Our Auth Was a DDoS on Ourselves

Every authenticated request in our multi-tenant platform did the same dance:

Validate the user's session
Check their role mappings (tenant, use case, environment)
Call AWS STS to assume the right IAM role
Return temporary credentials so downstream services could talk to S3, DynamoDB, Bedrock, etc.

Steps 1–3 hit the network. Every. Single. Time.

At modest traffic, it was fine. At scale, we were essentially DDoS-ing our own identity layer—STS throttling kicked in, latency spiked, and users saw login spinners that never stopped spinning.

The fix wasn't a new auth framework. It was Redis.

TL;DR (If You Skim, Skim This)

Problem: Per-request STS calls + stateless session validation = slow logins + rate limiting at scale.
Move: Cache session data and STS credentials in Redis with structured keys and smart TTLs.
Result: Sub-millisecond session lookups, ~90% fewer STS API calls, and a warm credential cache that makes subsequent requests feel instant.
Tradeoff: You need a cache invalidation strategy and must handle Redis failures gracefully.

Why This Pattern Is Having a Moment

Three trends are colliding right now:

Multi-tenant platforms are everywhere. Each tenant has its own IAM boundary, its own roles, its own credential scope. That's a lot of AssumeRole calls.
STS has hard rate limits. AWS throttles AssumeRole at ~500 requests/second per account. Hit that in production and you'll learn the meaning of AccessDenied the hard way.
Users expect instant auth. Nobody waits 2 seconds for a login to "warm up." If the first click feels slow, trust evaporates.

Redis sits at the intersection of all three: it's fast enough to feel like memory, persistent enough to survive pod restarts (in clustered mode), and simple enough that the caching logic doesn't become its own microservice.

The Architecture: Two Caches, One Redis

We use Redis for two distinct but related caching concerns:

1. Session Cache (Identity Layer)

When a user logs in (via OIDC), we create a platform session in Redis:

session_data = {
    "userId": "jane.doe@example.com",
    "roles": [
        {
            "TenantId": "acme-corp",
            "UseCaseId": "doc-search",
            "Environment": "prod",
            "RoleName": "USE_CASE_DEVELOPER",
        },
        {
            "TenantId": "acme-corp",
            "UseCaseId": "chatbot",
            "Environment": "dev",
            "RoleName": "USE_CASE_OWNER",
        },
    ],
    "highest_role": "USE_CASE_OWNER",
    "platform_roles": ["USE_CASE_OWNER", "USE_CASE_DEVELOPER"],
    "sts": {},  # STS credentials are added lazily
}

Key format: session:<uuid>
TTL: 1 hour (configurable via env)

This replaces the classic "hit the database on every request" pattern. Once stored, every downstream service validates auth by reading from Redis—not by calling the IdP or querying a user table.

2. STS Credential Cache (AWS Access Layer)

When a user accesses a specific tenant/use-case/environment, we call sts:AssumeRole to get short-lived credentials. These get cached inside the session object:

session_data["sts"]["acme-corp|doc-search|prod|USE_CASE_DEVELOPER"] = {
    "AccessKeyId": "ASIA...",
    "SecretAccessKey": "wJal...",
    "SessionToken": "FwoG...",
    "Expiration": "2026-02-28T19:00:00+00:00",
}

Key format (composite): TenantId|UseCaseId|Environment|RoleName
TTL: Derived from credential expiry minus a 5-minute safety buffer

This means the second time a user touches the same tenant/environment, we skip STS entirely.

The Code: Session Storage

Here's the core of how we store a session after successful OIDC login:

import json
import redis
from redis.connection import ConnectionPool

DEFAULT_TTL_SECONDS = 3600  # 1 hour

# Singleton connection pool — one per process
_connection_pool: ConnectionPool | None = None


def get_redis_pool() -> ConnectionPool:
    global _connection_pool
    if _connection_pool is None:
        _connection_pool = ConnectionPool(
            host=os.environ.get("REDIS_HOST", "localhost"),
            port=int(os.environ.get("REDIS_PORT", "6379")),
            db=0,
            max_connections=50,
            decode_responses=True,
            socket_keepalive=True,
            socket_connect_timeout=5,
            retry_on_timeout=True,
        )
    return _connection_pool


def get_redis_client() -> redis.Redis:
    return redis.Redis(connection_pool=get_redis_pool())


def store_session(
    session_id: str,
    user_id: str,
    roles: list[dict],
    highest_role: str | None = None,
    platform_roles: list[str] | None = None,
    ttl_seconds: int = DEFAULT_TTL_SECONDS,
) -> bool:
    try:
        client = get_redis_client()
        session_data = {
            "userId": user_id,
            "roles": roles,
            "sts": {},
            "highest_role": highest_role,
            "platform_roles": platform_roles or [],
        }
        client.setex(
            f"session:{session_id}",
            ttl_seconds,
            json.dumps(session_data),
        )
        return True
    except redis.RedisError:
        return False

Why setex instead of set + expire? Atomicity. If the process crashes between set and expire, you get a session that never dies. setex is a single atomic operation.

The Code: STS Credential Caching

The real performance win is here—caching the output of sts:AssumeRole:

import boto3
from datetime import datetime

sts_client = boto3.client("sts")
EXPIRATION_BUFFER_SEC = 300  # 5 minutes


def get_sts_credentials(
    session_id: str,
    platform_role: str,
    user_email: str,
    tenant_id: str,
    use_case_id: str,
    environment: str,
    force_refresh: bool = False,
) -> dict:
    # Step 1: Check the cache first
    if not force_refresh:
        cached = get_credentials_from_session(
            session_id, tenant_id, use_case_id,
            environment, platform_role,
        )
        if cached and is_credential_valid(cached):
            return cached  # 🎯 Cache hit — skip STS entirely

    # Step 2: Cache miss — call STS
    role_arn = resolve_role_arn(platform_role)
    resp = sts_client.assume_role(
        RoleArn=role_arn,
        RoleSessionName=f"{tenant_id}-{use_case_id}-{environment}"[:64],
        DurationSeconds=3600,
    )

    creds = resp["Credentials"]
    credential_data = {
        "AccessKeyId": creds["AccessKeyId"],
        "SecretAccessKey": creds["SecretAccessKey"],
        "SessionToken": creds["SessionToken"],
        "Expiration": creds["Expiration"].isoformat(),
    }

    # Step 3: Cache with smart TTL (expire before AWS does)
    expiration = datetime.fromisoformat(credential_data["Expiration"])
    ttl = int(
        (expiration - datetime.now(expiration.tzinfo)).total_seconds()
    ) - EXPIRATION_BUFFER_SEC

    if ttl > 0:
        store_credentials_in_session(
            session_id, tenant_id, use_case_id,
            environment, platform_role, credential_data, ttl,
        )

    return credential_data

The EXPIRATION_BUFFER_SEC = 300 is critical. STS credentials expire at a hard boundary. If you serve a credential that's 10 seconds from death, the downstream AWS call will fail with a confusing ExpiredTokenException. The 5-minute buffer ensures we always refresh before the cliff.

Credential Validity Check

A clean helper that prevents serving stale credentials:

def is_credential_valid(credentials: dict) -> bool:
    expiration_str = credentials.get("Expiration")
    if not expiration_str:
        return False

    expiration = datetime.fromisoformat(
        expiration_str.replace("Z", "+00:00")
    )
    now = datetime.now(expiration.tzinfo)

    buffer_seconds = 300
    return (expiration - now).total_seconds() > buffer_seconds

If the credential is within 5 minutes of expiring, we treat it as expired. Simple, defensive, saves you from debugging ExpiredTokenException at 3 AM.

Session Validation: The Hot Path

Every authenticated API request runs through this:

def validate_session_and_role(
    session_id: str,
    tenant_id: str | None = None,
    use_case_id: str | None = None,
    environment: str | None = None,
) -> dict:
    # Single Redis GET — sub-millisecond
    session_data = get_session(session_id)
    if not session_data:
        raise ValueError("Session not found or expired")

    user_email = session_data.get("userId")
    roles = session_data.get("roles", [])

    result = {
        "valid": True,
        "user_email": user_email,
        "all_roles": roles,
        "highest_role": derive_highest_role(roles),
    }

    # Optional: validate specific tenant/use-case access
    if tenant_id and use_case_id and environment:
        matching_role = find_role_for_context(
            roles, tenant_id, use_case_id, environment
        )
        if not matching_role:
            raise ValueError(
                f"No access to {tenant_id}/{use_case_id}/{environment}"
            )
        result["role"] = matching_role

    return result

This is the difference between "every request takes 200ms to validate" and "every request takes <1ms to validate." The session is already in Redis. The role lookup is a JSON parse + list scan. Done.

The Login Flow: Putting It Together

Browser
  │
  │  GET /auth/userinfo
  ▼
ALB (OIDC authenticate)
  │
  │  verified user → forwarded with OIDC headers
  ▼
Backend Login Handler
  │
  ├─ 1. Decode & verify OIDC token (claims extraction)
  ├─ 2. Map IdP groups → platform roles (7-role hierarchy)
  ├─ 3. Build entitlements (tenant → use_case → env → role)
  ├─ 4. Store session in Redis (session:<uuid>)
  ├─ 5. Return session_id + tenants to frontend
  │
  ▼
Frontend stores session_id
  │
  │  Subsequent API calls include X-Session-Id header
  ▼
Any Backend Service
  │
  ├─ Validate session from Redis (sub-ms)
  ├─ Check role mapping for requested resource
  └─ If STS credentials needed:
       ├─ Check Redis cache first (sub-ms)
       └─ Call STS only on cache miss (~200ms)

The first login is the "expensive" one (~500ms total including STS). Every subsequent request benefits from the cache.

Connection Pooling: Don't Skip This

A surprisingly common mistake: creating a new Redis connection per request.

# ❌ Don't do this
def get_session(session_id):
    client = redis.Redis(host="localhost", port=6379)  # new connection!
    return client.get(f"session:{session_id}")

# ✅ Do this — reuse a connection pool
_pool = ConnectionPool(host="localhost", port=6379, max_connections=50)

def get_session(session_id):
    client = redis.Redis(connection_pool=_pool)
    return client.get(f"session:{session_id}")

Each TCP connection to Redis costs ~1ms to establish. At 1,000 req/s, that's 1 full second of CPU time per second just on handshakes. Connection pooling makes this a non-issue.

Observability: Know Your Hit Ratio

We track cache operations with Prometheus counters:

from prometheus_client import Counter, Gauge

cache_operations_total = Counter(
    "cache_operations_total",
    "Total cache operations",
    ["tenant_id", "service", "operation", "status"],
)

cache_hit_ratio = Gauge(
    "cache_hit_ratio",
    "Rolling cache hit ratio",
    ["tenant_id", "service"],
)

Labels like operation=get_creds and status=hit|miss|expired|error let you build dashboards that answer:

What's our STS cache hit ratio? (target: >85%)
Which tenants have the most cache misses? (may indicate config drift)
Are we seeing Redis errors? (time to check cluster health)

If your hit ratio drops below 80%, something is wrong—either TTLs are too short, sessions are thrashing, or your Redis instance is under memory pressure.

TLS + Secrets Manager: Production Hardening

In production, Redis connections should be encrypted and passwords should never live in env vars:

def _load_password_from_secrets_manager(secret_arn: str) -> str | None:
    """Load Redis auth token from AWS Secrets Manager."""
    sm = boto3.client("secretsmanager")
    resp = sm.get_secret_value(SecretId=secret_arn)
    secret = resp.get("SecretString", "")

    # Support both plain strings and JSON secrets
    if secret.strip().startswith("{"):
        obj = json.loads(secret)
        for key in ("password", "authToken", "token"):
            if key in obj:
                return obj[key]

    return secret.strip()

We also cache the fetched secret in-process—no need to call Secrets Manager on every pool initialization. And we configure TLS via the SSLConnection class from the Redis Python client:

from redis.connection import SSLConnection

pool_kwargs["connection_class"] = SSLConnection

This gives you in-transit encryption for ElastiCache, which is a compliance checkbox you'd rather check early.

Gotchas (A.K.A. What Bit Us So It Doesn't Bite You)

1. Stale Credentials After Role Changes

If a user's role changes (e.g., promoted from USE_CASE_DEVELOPER to USE_CASE_OWNER), the cached session still has the old role mappings. Our fix: invalidate the session on role change and force a re-login.

def invalidate_session(session_id: str) -> bool:
    client = get_redis_client()
    return client.delete(f"session:{session_id}") > 0

2. Redis Goes Down — What Then?

Redis is fast, but it's not invincible. If the Redis cluster is unreachable:

Session validation should fail-closed (reject the request, don't silently allow it)
Log aggressively so ops teams see the outage
Never fall back to "allow all" — that's a security vulnerability disguised as fault tolerance

3. Session Key Collisions

Using predictable keys (like session:<user_email>) opens the door to session hijacking. Use session:<uuid4> — the session ID should be unguessable.

4. Memory Pressure in Multi-Tenant Environments

Each session stores role mappings for every tenant/use-case the user can access. A platform admin with access to 50 tenants has a bigger session object than a single-tenant end user. Monitor Redis memory usage and set maxmemory-policy to volatile-lru so expired keys get evicted first.

5. Binding Token Replay Attacks

If your auth flow uses one-time binding tokens (e.g., for device code flows), mark them as consumed in Redis with a short TTL:

def mark_binding_token_consumed(token: str, ttl: int = 900) -> bool:
    key = f"binding_token:consumed:{token[:16]}"
    get_redis_client().setex(key, ttl, "1")
    return True

def is_binding_token_consumed(token: str) -> bool:
    key = f"binding_token:consumed:{token[:16]}"
    return bool(get_redis_client().exists(key))

When You Should Not Use This Pattern

Single-user apps — if you have 10 users, the extra Redis infrastructure isn't worth it. A signed JWT with short expiry is simpler.
Stateless-only architectures — if your design principle is "no server-side state," Redis sessions are a philosophical violation. (But also: stateless auth at scale has its own costs.)
No AWS roles to assume — if you're not using STS, the credential caching half of this pattern doesn't apply. The session caching half still might.

A Practical Implementation Checklist

[ ] Deploy Redis (ElastiCache Serverless or self-managed cluster with replication)
[ ] Enable TLS in-transit (SSLConnection)
[ ] Store Redis password in Secrets Manager, not env vars
[ ] Use connection pooling (ConnectionPool with max_connections)
[ ] Set session TTL to match your security requirements (we use 1 hour)
[ ] Add 5-minute expiration buffer on STS credential cache
[ ] Implement health_check() — ping Redis on startup and expose /health
[ ] Add Prometheus metrics for cache hit/miss/error rates
[ ] Set maxmemory-policy to volatile-lru on the Redis instance
[ ] Document your invalidation strategy (when do cached sessions get killed?)
[ ] Test Redis-down scenarios (your app should fail-closed, not fail-open)
[ ] Load SSM parameters at startup, not import time (env vars must be populated first)

The Numbers

Before Redis caching:

Login: ~800ms (OIDC + STS + DB lookups)
Subsequent API auth: ~200ms per request (session re-validation + STS)
STS calls: 1 per authenticated request

After Redis caching:

Login: ~500ms (OIDC + STS + Redis write — the STS is cached for next time)
Subsequent API auth: <1ms (Redis GET + JSON parse)
STS calls: 1 per unique tenant/role/env combination per session lifetime

At 10,000 authenticated requests per hour, that's the difference between 10,000 STS calls and ~50. Your AWS bill notices. Your users notice. Your on-call rotation notices.

Closing: The Fastest Auth Call Is the One You Don't Make

Redis isn't just a cache layer for your database queries. It's the foundation of a fast, secure auth perimeter.

The session cache eliminates per-request identity lookups. The STS credential cache eliminates per-request IAM calls. Together, they turn your auth layer from a distributed systems problem into a local memory read.

And when security is fast, developers stop looking for shortcuts around it.

What's your strategy for caching short-lived AWS credentials? Do you cache at the application layer, use credential providers, or something else entirely? Drop a comment — I'm curious what patterns are working for others.

Resources

AWS Docs: STS AssumeRole — rate limits and best practices
Redis: Connection Pooling in the Python client
AWS ElastiCache: In-transit encryption
Prometheus: Client instrumentation for Python

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

🔥 We Deleted Our Login Code: ALB OIDC for Serverless Frontends

Suraj Khaitan — Sun, 08 Feb 2026 07:01:00 +0000

How moving auth to the load balancer with ALB’s authenticate_oidc made our UI simpler, our defaults safer, and our incidents rarer

The Day “Just Store the Token” Stopped Being Funny

At some point, every frontend team gets the same suggestion:

“Just do OAuth in the browser, store the token, and attach it on API calls.”

It works—until it doesn’t.

Because the moment your UI becomes responsible for token storage, refresh logic, callback routes, and logout semantics, your “frontend” quietly turns into an auth product.

We fixed this by doing something that feels almost illegal:

We let the load balancer handle the login.

Specifically: AWS Application Load Balancer (ALB) + authenticate_oidc + a serverless frontend target (Lambda).

TL;DR (If You Only Read One Section)

Problem: App-level OIDC spreads secrets + token handling across every UI route and runtime.
Move: Put OIDC at the edge using ALB authenticate_oidc.
Result: Less auth code in the app, fewer token footguns, and a “secure-by-default” perimeter.
Tradeoff: Local dev + logout semantics require intentional design.

Why This Pattern Is Trending Right Now

Across dev communities lately, the popular themes are consistent:

“Stop overbuilding auth in every app.”
“Move concerns up the stack.”
“Make security the default, not a checklist item.”

Edge-auth patterns (ALB OIDC, gateway authorizers, access proxies) are having a moment because they reduce the number of places a team can accidentally get auth wrong.

The Real Problem: Token Chaos Isn’t One Bug—It’s a Lifestyle

If you do OIDC inside the frontend, you almost inevitably accumulate:

A callback route you must never break
Token storage debates (localStorage vs memory vs cookies)
Refresh token logic (and the day it fails in production)
“Why did it log me out?” issues
Security reviews that keep expanding scope

And the nastiest part is: it’s not one critical bug—it’s a hundred tiny sharp edges.

The Pivot: Authentication at the ALB

When you use authenticate_oidc, the ALB becomes the bouncer:

Unauthenticated requests get redirected to your Identity Provider (IdP)
The ALB completes the OIDC flow
The ALB maintains an authenticated session (cookie-based)
Only authenticated requests reach your target

Your serverless frontend (often a Lambda router / SSR / fallback handler) simply… serves pages.

The vibe shifts from:

“Did we implement OAuth correctly?”

to:

“If I got a 200, I’m logged in.”

The Request Flow in 30 Seconds

Browser
  |
  | GET /anything
  v
ALB (authenticate_oidc)
  |
  | not logged in?
  | 302 -> IdP
  v
IdP (login)
  |
  | 302 -> ALB callback
  v
ALB (sets session cookies)
  |
  | forward
  v
Lambda target (serverless frontend router)

Notice what’s missing:

No client-side token parsing
No callback handler in your React app
No refresh logic scattered across fetch calls

A Minimal, Anonymized CDK Snippet

This is intentionally “shape only” (no real URLs, no org names). The essence is:

1) forward to a Lambda target group
2) wrap it with authenticate_oidc

from aws_cdk import aws_elasticloadbalancingv2 as elbv2
from aws_cdk import aws_elasticloadbalancingv2_targets as targets
from aws_cdk import SecretValue

frontend_tg = elbv2.ApplicationTargetGroup(
    scope,
    "FrontendTg",
    target_type=elbv2.TargetType.LAMBDA,
    targets=[targets.LambdaTarget(frontend_router_lambda)],
)

listener.add_action(
    "FrontendWithOidc",
    priority=100,
    conditions=[elbv2.ListenerCondition.path_patterns(["/*"])],
    action=elbv2.ListenerAction.authenticate_oidc(
        issuer="https://idp.example/",
        authorization_endpoint="https://idp.example/oauth2/authorize",
        token_endpoint="https://idp.example/oauth2/token",
        user_info_endpoint="https://idp.example/oauth2/userinfo",
        client_id="<client-id>",
        client_secret=SecretValue.secrets_manager("/path/to/oidc-secret"),
        next=elbv2.ListenerAction.forward([frontend_tg]),
    ),
)

Quick rules that save pain:

Keep the OIDC secret in a secret manager, not env vars.
Make sure listener priorities don’t collide.
Default to protecting /* unless you truly want public routes.

How This Changed Our Security Posture (In Plain English)

1) “Secure by default” stops being a slogan

With ALB OIDC, every path behind the listener rule becomes authenticated by default. You’re no longer relying on every route guard, every component, and every refactor to “remember auth.”

2) Less token exposure in the browser

The browser is a hostile environment. Reducing token handling in the UI reduces your exposure to:

XSS turning into token theft
accidental logging of sensitive values
copy-paste auth bugs across micro-frontends

3) Fewer app secrets

If your frontend app doesn’t need to “be an OAuth client,” it also needs fewer secrets and fewer complicated deployment rules.

The Subtle but Important Split: Auth vs Authorization

ALB OIDC is excellent at authentication (“who are you?”).

But you still need strong authorization (“what can you do?”):

RBAC: role-based permissions
ABAC: tenant/env/resource scoping

The clean division:

ALB: verify the user is logged in
Backend: enforce permissions and data scope

If you try to do all authorization at the load balancer, you’ll end up with something brittle and hard to evolve.

Gotchas (A.K.A. The Part Everyone Learns in Production)

1) Callback path behavior

ALB uses a callback endpoint (often something like /oauth2/idpresponse). Make sure your routing rules don’t accidentally break it.

2) Claims can get huge

Too many groups/roles/claims can hit header/cookie limits. Mitigations:

keep tokens/claims lean
fetch richer profile data server-side
store heavy identity in your own session store

3) Logout is three separate things

There’s:

app logout
ALB session cookie
IdP session

Define what “Logout” means for your UX and compliance requirements.

4) Local dev can feel weird

Production has ALB OIDC; your laptop doesn’t.

Good local-dev patterns:

inject mocked identity headers in dev
run a lightweight local gateway that simulates “auth at the edge”
keep backend authorization testable without a real IdP

A Practical Rollout Checklist

Verify OIDC endpoints: issuer + authorize + token + userinfo
Store the client secret in a secret manager
Confirm listener rule priority ordering
Ensure callback path is reachable through routing rules
Enforce HTTPS everywhere
Enable ALB access logs
Document logout behavior (what it clears)
Write down the local-dev story (seriously)

When You Should Not Use ALB OIDC

Avoid / reconsider if:

you need complex per-request authorization decisions before forwarding
you don’t have an ALB in the request path (pure CDN with no origin auth)
your org mandates a different gateway or zero-trust access layer

Closing: Make the Safe Path the Easy Path

The benefit of this pattern isn’t novelty.

It’s that you can remove an entire category of mistakes:

less auth code in the UI
fewer ways to leak tokens
consistent enforcement across routes

And when security is the default, teams move faster—because fewer changes require “special auth handling.”

If you’ve done edge auth (ALB OIDC, gateway authorizers, access proxies), what hurt most for you: local dev, logout, or claim size?

Resources

AWS Docs: Application Load Balancer authentication actions (OIDC)
AWS CDK: ListenerAction.authenticate_oidc
OAuth 2.0 / OIDC basics (for understanding redirects, authorization code flow)

About the Author

Suraj Khaitan — Gen AI Architect | Building scalable platforms and secure cloud-native systems

Connect on LinkedIn | Follow for more engineering and architecture write-ups

🧠 RAG in 2026: A Practical Blueprint for Retrieval-Augmented Generation

Suraj Khaitan — Sun, 25 Jan 2026 06:20:17 +0000

How to make LLMs feel “grounded” in your data—without turning your app into a prompt-factory.

Large Language Models are incredible at language, but they still have two awkward traits in production:

They don’t know your private data by default (docs, tickets, code, policies).
They can sound confident even when they’re guessing.

Retrieval-Augmented Generation (RAG) is the most reliable pattern I’ve used to fix both—by giving the model just-in-time access to relevant context at the moment it answers.

This post is a practical, medium-depth tour of RAG: the core architecture, the failure modes, and the “advanced knobs” that actually move quality (reranking, routing, query strategies, and better indexing). I’ll also point you to a great open-source reference implementation that I’ve been using as a sanity check.

🔎 The Core Idea: Don’t Train, Retrieve

Think of RAG as two systems working together:

Retriever: finds the best supporting context for a question.
Generator (LLM): writes the final answer using the retrieved context.

Instead of trying to cram your entire knowledge base into model weights, you keep your knowledge in stores that are good at search (vector DBs, relational DBs, graph DBs), retrieve the best bits, and then let the LLM do what it does best: compose a response.

A good mental model:

RAG = Search + Reasoning

Search brings facts. Reasoning provides coherence.

🏗️ A Clean RAG Architecture (What Actually Matters)

Most RAG diagrams look complex because they include every optional component. Here’s a simple backbone that scales:

Ingest documents (PDFs, web pages, internal wikis, tickets)
Chunk them into retrievable units
Embed chunks into vectors
Index vectors in a vector store
Retrieve top-$k$ chunks for a question
Generate an answer with citations / grounded context

In code, the minimal version feels like:

question -> embed(question) -> similarity_search -> context -> LLM(prompt + context)

If you only build that, you’ll get something working quickly—but you’ll also quickly hit the real-world issues:

Retrieval returns “nearby” chunks that don’t actually answer the question
The best chunk is buried at rank 17
A single query phrasing misses the right terminology
Some questions should query SQL or a graph, not embeddings

That’s where the next layers matter.

📦 Retrieval Isn’t Only Vectors: Pick the Right Store

A mature RAG system doesn’t have to be “vector-only”. Depending on the question, retrieval can come from:

Vector stores: semantic search over unstructured text (docs, emails, transcripts)
Relational DBs: exact structured facts (orders, users, pricing, logs)
Graph DBs: relationships and traversals (org charts, dependency graphs, knowledge graphs)

In practice, you often end up with a hybrid:

Data type	Best retrieval style	Example question
Policies / long docs	Vector search	“What’s our parental leave policy?”
Metrics / records	SQL	“What was churn last quarter in EMEA?”
Relationships	Cypher / graph queries	“Who owns service X and what depends on it?”

This is why modern RAG stacks include things like Text-to-SQL, Text-to-Cypher, and self-query retrievers (where the model generates a structured search query and metadata filters).

🧭 Routing: The “Secret Sauce” for Multi-Source RAG

If you only have one data source, retrieval is straightforward. But the moment you add a relational database, a vector store, and maybe a graph—your first big design decision becomes:

How do I route a user’s question to the right retriever?

Two patterns show up repeatedly:

1) Logical routing

Simple rules or a lightweight classifier.

“If the question mentions revenue, query SQL.”
“If the question mentions ‘policy’, use the handbook index.”

2) Semantic routing

Use embeddings (or a small LLM prompt) to decide which tool to call.

This reduces “tool spam” and usually improves relevance because you retrieve from the right store first.

🧠 Query Strategies That Increase Recall (Without Overfetching)

Most weak RAG answers are not generation problems—they’re retrieval problems.

A single user question is often ambiguous. Strong pipelines expand the query space before retrieving.

Here are query strategies I’ve seen consistently help:

Multi-query

Generate multiple paraphrases of the question and retrieve for each.

Why it works: different phrasing hits different vocabulary.

Step-back questions

Ask a higher-level sub-question first (“What concept is this about?”), then use that to retrieve.

Why it works: reduces lexical mismatch and anchors retrieval.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer document, embed that, and retrieve based on it.

Why it works: the hypothetical answer contains domain language the user may not use.

RAG-Fusion

Retrieve multiple lists (from multi-query, HyDE, etc.) and then fuse rankings (often using Reciprocal Rank Fusion).

Why it works: you get strong recall without blindly increasing $k$.

🥇 Reranking: Fix “The Answer Was in the Context, But…”

If you’ve built a basic RAG system, you’ve likely seen this failure mode:

The right chunk is retrieved
But it’s ranked too low
The LLM focuses on the wrong chunk

Reranking is the clean fix.

A common pipeline looks like:

Retrieve top 20–50 chunks cheaply (vector similarity)
Rerank top candidates with a stronger model (cross-encoder, LLM-based ranker, or a reranker API)
Feed the top 3–8 chunks to the generator

You’ll see reranking approaches referenced as:

Cross-encoder rerankers
LLM ranking (sometimes called RankGPT-style ranking)
RRF (Reciprocal Rank Fusion) when merging multiple retrieval lists

This is one of the highest ROI upgrades in RAG.

🧹 Filter & Compress: The Missing Piece for Long Context

Even if retrieval is good, the final prompt can still be noisy:

repeated information
irrelevant paragraphs
chunks that overlap heavily

That’s where contextual compression comes in: after retrieval, you summarize, extract, or filter down to only what matters.

This is especially important as your data grows and you start using larger $k$ values.

🗂️ Indexing: Where Most Teams Underinvest

Indexing decisions quietly determine your ceiling.

Here are indexing techniques worth knowing (and testing):

Chunk optimization

Chunk size is not a constant. Different document types want different chunking.

Too small → context fragments
Too large → retrieval becomes “blurry”

Semantic splitting

Split on meaning (headings, sections), not arbitrary character counts.

Parent-document retrieval

Store embeddings for child chunks but return a larger “parent” span when answering.

Multi-representation indexing

Index both:

fine-grained chunks for precision
summaries for recall

Specialized embeddings / fine-tuning

If your domain has unique language (legal, medicine, internal code), embeddings matter.

Hierarchical indexing (RAPTOR-like)

Build a tree of summaries from leaves → root so retrieval can happen at multiple abstraction levels.

Token-level retrieval (ColBERT-style)

A stronger retrieval approach when semantics are subtle and bag-of-vector similarity struggles.

You don’t need all of these. But the point is: RAG quality is frequently an indexing problem disguised as an LLM problem.

🔁 Active Retrieval (and Why It’s the Future)

Some questions require the system to work:

ask clarifying questions
reformulate queries mid-flight
retry retrieval when evidence is weak

You’ll sometimes see this category described as active retrieval (including approaches like CRAG / self-correcting retrieval patterns).

The takeaway: the best RAG systems aren’t one-shot. They behave more like a careful researcher.

🧪 A Hands-On Reference: bRAG-langchain

If you want something concrete to learn from (and compare against your own implementation), I recommend checking out the open-source project here:

https://github.com/bRAGAI/bRAG-langchain/

What I like about it:

It walks from baseline RAG → multi-query → routing → advanced indexing → reranking
It’s notebook-driven, so you can test ideas quickly
It keeps the focus on practical patterns (not just theory)

A suggested learning path mirrors the notebook sequence:

Baseline RAG setup
Multi-query improvements
Routing + query construction
Advanced indexing
Retrieval + reranking + fusion

Use it like a “cookbook”: borrow the ideas, not the exact words.

👨‍💻 Code Walkthrough (Inspired by bRAG-langchain)

Below are two rewritten snippets inspired by the project’s notebooks (especially full_basic_rag.ipynb). The goal is to show the shape of a clean RAG pipeline—without dumping an entire notebook into a blog post.

Attribution: the reference implementation that inspired these patterns is bRAG AI: https://github.com/bRAGAI/bRAG-langchain/

1) A minimal LangChain RAG chain (loader → chunks → vectors → retriever → chain)

This is the “boring baseline” that should work before you touch reranking, routing, or fancy indexing.

import os
from dotenv import load_dotenv

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


load_dotenv()  # expects OPENAI_API_KEY, PINECONE_INDEX_NAME, etc.


def join_docs(docs) -> str:
    return "\n\n".join(d.page_content for d in docs)


# 1) Load
docs = PyPDFLoader("path/to/your.pdf").load()

# 2) Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=150)
chunks = splitter.split_documents(docs)

# 3) Embed + index
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    index_name=os.environ["PINECONE_INDEX_NAME"],
)

# 4) Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 5) Generate
prompt = ChatPromptTemplate.from_template(
    """You are a grounded assistant. Use ONLY the context to answer.

Context:
{context}

Question: {question}

If the answer is not in the context, say you don't know.
"""
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)

rag_chain = (
    {"context": retriever | join_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("What is this document about?"))

Why this pattern is nice: retrieval is a pure function of the question, and prompt+LLM are pure functions of {context, question}. That separation makes it easy to add routing, reranking, eval, caching, etc.

2) Multi-query + fusion (high recall without blindly increasing k)

The repo’s later notebooks explore multi-query / fusion and reranking. The key mental model is:

generate multiple query variants
retrieve for each
fuse the ranked lists (so strong hits bubble up)
optionally rerank the merged set

Here’s a compact sketch using Reciprocal Rank Fusion (RRF):

from collections import defaultdict


def rrf_fuse(ranked_lists, *, k: int = 60, top_n: int = 10):
    """Fuse multiple ranked lists using Reciprocal Rank Fusion.

    ranked_lists: list[list[Document]]
    """
    scores = defaultdict(float)
    by_id = {}

    for docs in ranked_lists:
        for rank, doc in enumerate(docs):
            # Prefer a stable ID if you have one; fallback to content hash
            doc_id = doc.metadata.get("id") or hash(doc.page_content)
            by_id[doc_id] = doc
            scores[doc_id] += 1.0 / (k + rank + 1)

    fused = sorted(scores, key=scores.get, reverse=True)
    return [by_id[i] for i in fused[:top_n]]


def generate_queries(question: str) -> list[str]:
    # In practice: use an LLM prompt to produce 3–8 diverse rewrites.
    return [
        question,
        f"Explain {question} with concrete examples",
        f"What are the key concepts behind: {question}?",
    ]


question = "How does RAG reduce hallucinations?"
queries = generate_queries(question)

ranked_lists = [retriever.get_relevant_documents(q) for q in queries]
fused_docs = rrf_fuse(ranked_lists, top_n=6)

answer = rag_chain.invoke(question)  # or rebuild chain to use fused_docs
print(answer)

In production you’d typically rebuild the chain so the “context” comes from fused_docs (and then optionally apply a learned reranker like Cohere Rerank on that smaller candidate set).

✅ A Production Checklist (Short, but Useful)

Before you ship RAG to real users, make sure you can answer:

Evaluation: How will you measure grounded correctness (not just fluency)?
Citations: Can you show which sources supported the answer?
Fallbacks: What happens when retrieval confidence is low?
Security: Are you filtering sensitive docs by user permissions before retrieval?
Freshness: How often is the index updated? (and can you delete data reliably?)
Latency: Can you keep response time acceptable with reranking and multi-query?

Conclusion

RAG isn’t a single technique—it’s a toolbox:

retrieval across the right stores
routing to the right tool
smarter query generation (multi-query, step-back, HyDE)
reranking and fusion
compression for long context
indexing strategies that scale

If you get retrieval right, generation becomes the easy part.

Resources

bRAG LangChain project (hands-on notebooks): https://github.com/bRAGAI/bRAG-langchain/
RAG architecture diagram source material: see RAG_Consolidated.jpg

About the Author

Suraj Khaitan — Gen AI Architect | Building the next generation of AI-powered development tools

Connect on LinkedIn | Follow for more AI and software engineering insights

Tags: #AI #RAG #LLM #LangChain #VectorDatabases #InformationRetrieval #GenerativeAI

Inside Google Antigravity: How AI Pair Programming Actually Works

Suraj Khaitan — Sun, 25 Jan 2026 02:18:37 +0000

A deep dive into the architecture, capabilities, and real-world coding experience with Google's revolutionary AI coding assistant

The Promise of AI-Powered Development

Imagine having a senior engineer who knows every programming language, can refactor your entire codebase in seconds, understands design patterns across frameworks, and never gets tired. That's the promise of Google Antigravity—but how does it actually work in practice?

I recently spent over 10 hours building a complete MacOS desktop simulation in React, including a functional Safari browser, Twitter clone, Spotify player, and even a working Flappy Bird game—all with Antigravity as my pair programmer. This article breaks down the architecture, capabilities, and lessons learned from pushing this AI assistant to its limits.

🎯 The Architecture: Three Core Layers

1. The Reasoning Engine: Claude 4.5 Sonnet

At its core, Antigravity uses Anthropic's Claude 4.5 Sonnet with extended "thinking" capabilities. Unlike traditional code completion tools, Antigravity doesn't just autocomplete—it reasons about your entire codebase.

What makes it special:

200K token context window: Can understand entire codebases, not just snippets
Agentic behavior: Plans, executes, verifies, and iterates autonomously
Tool use: Direct filesystem access, browser control, terminal commands
Multi-turn conversations: Maintains context across hours of development

2. The Tool Layer: Direct System Access

This is where Antigravity diverges from chat-based AI. It can:

// Read and modify files
view_file({ AbsolutePath: "/path/to/component.jsx" })
replace_file_content({ 
    TargetFile: "/path/to/component.jsx",
    TargetContent: "old code",
    ReplacementContent: "new code"
})

// Execute terminal commands
run_command({ 
    CommandLine: "npm run build",
    SafeToAutoRun: true 
})

// Control the browser
browser_subagent({ 
    Task: "Navigate to localhost and take screenshot"
})

// Search the web for real-time data
search_web({ 
    query: "Virat Kohli recent statistics 2026"
})

Real example from my session: When I asked for "real data" about cricket players, Antigravity:

Searched the web for current statistics
Found Virat Kohli's actual tweets from January 2026
Downloaded real images from news sources
Updated the entire app with live data—all autonomously

3. The Task Management System

Antigravity doesn't just execute commands—it manages projects. It uses a sophisticated task boundary system:

# task.md (Auto-generated)
- [x] Desktop Environment
    - [x] Create `Window` component (Draggable, Controls)
    - [x] Create `MenuBar` component
    - [x] Create `Dock` component
- [/] App Adaptation
    - [x] Update Spotify for Desktop
    - [ ] Update Twitter for Desktop

The [/] notation indicates "in progress"—the AI tracks its own state across the conversation.

📄 Real-World Example: The MacOS Desktop Build

Let me walk through how Antigravity handled a complex, evolving requirement.

Initial Request

"Build an iPhone UI simulator with Spotify, Flappy Bird, and Twitter apps"

Phase 1: Planning Mode

Antigravity created an implementation_plan.md:

## Proposed Changes

### OS Components
#### [NEW] IPhoneFrame.jsx
- Notch, rounded corners, status bar
- Home bar for navigation
- App switching logic

#### [NEW] HomeScreen.jsx  
- App grid with icons
- Glassmorphism dock

### Apps
#### [NEW] SpotifyApp.jsx
- Player UI with progress bar
- Playlist view
- Mock playback logic

It then requested approval before writing any code. This human-in-the-loop design prevents wasted effort.

Phase 2: Execution Mode

Once approved, Antigravity wrote 2,000+ lines of React code across 15 files in minutes:

// Example: Auto-generated Window Component
const Window = ({ id, title, children, onClose, zIndex }) => {
  return (
    <motion.div
      drag
      dragMomentum={false}
      style={{ zIndex }}
      className="absolute bg-[#1e1e1e] rounded-xl shadow-2xl"
    >
      <div className="h-[38px] bg-[#2a2a2a] flex items-center px-4">
        {/* Traffic light controls */}
        <div className="flex gap-2 group">
          <div onClick={() => onClose(id)} 
               className="w-3 h-3 rounded-full bg-[#ff5f57]">
            <X size={8} className="opacity-0 group-hover:opacity-100" />
          </div>
        </div>
        <div className="flex-1 text-center text-xs">{title}</div>
      </div>
      <div className="flex-1 overflow-hidden">
        {children}
      </div>
    </motion.div>
  );
};

Phase 3: The Plot Twist

Midway through, I changed my mind:

"Actually, make it MacOS instead of iPhone"

What happened next was remarkable:

Antigravity didn't just rename variables. It:

Identified architectural incompatibilities (mobile touch → desktop mouse)
Updated the entire windowing system to support dragging
Replaced the home screen with a desktop + dock
Converted mobile apps to desktop windows
Preserved all the existing app logic—zero rework

The transition took ~5 minutes and 30 file modifications.

Phase 4: The "It Looks Dull" Crisis

User feedback:

"This is how stupid it looks currently in my laptop, how dumb you are"

🔴 Critical error discovered: Tailwind CSS wasn't installed!

All the styling classes (bg-black, flex, rounded-xl) were being rendered as plain HTML. Antigravity:

Diagnosed the issue from the screenshot
Installed tailwindcss, postcss, autoprefixer
Created config files (tailwind.config.js, postcss.config.js)
Switched to the newer @tailwindcss/postcss plugin when the first attempt failed
Rebuilt and verified the fix

The lesson: Even AI makes mistakes, but it course-corrects fast.

⚡ The Self-Healing Workflow

Here's what surprised me most: Antigravity debugs itself.

Example: The Photo App Failure

User complaint:

"No photos of Virat Kohli, videos not working, X not working"

Antigravity's response:

Search for real data:

search_web({ query: "Virat Kohli recent photos images 2026" })
search_web({ query: "Virat Kohli Twitter tweets recent 2026" })

Extract URLs from search results:

const images = [
  'https://pbs.twimg.com/media/GiC7zqMWsAA0RJL?format=jpg&name=large',
  'https://akm-img-a-in.tosshub.com/indiatoday/images/story/202401/...',
  // Real sources from news articles
];

Update the component with error handling:

<img 
  src={src} 
  onError={(e) => {
    e.target.src = 'https://via.placeholder.com/400?text=Virat+Kohli';
  }}
/>

Switch video player to YouTube embed:

<iframe
  width="100%"
  height="100%"
  src="https://www.youtube.com/embed/[videoId]"
  allow="autoplay; encrypted-media"
  allowFullScreen
/>

All of this happened without me specifying how to fix it. The AI inferred the problems, researched solutions, and implemented them.

🎨 The "Real World Data" Challenge

One unique aspect of this session: I demanded real data, not lorem ipsum placeholders.

The ask:

"Get the data from real world on Lauren Bell or Virat Kohli"

What Antigravity did:

Live web search for current cricket statistics (January 2026)
Extracted structured data:

const stats = {
  odi: {
    matches: 311,
    runs: 14797,
    average: '58.72',
    centuries: 54  // Real number from ICC
  }
}

Found actual tweets:

{
  content: 'Congratulations @NSaina on a legendary career...',
  timestamp: 'Jan 23',
  likes: '156K'  // Actual engagement numbers
}

Sourced real images from Wikipedia, news outlets, and social media

The result? A Twitter clone showing Virat Kohli's actual January 2026 timeline, not synthetic mock data.

🔧 The Technical Challenges It Solved

Challenge 1: Tailwind CSS Configuration Hell

Most developers spend hours debugging Tailwind setup. Antigravity:

Installed dependencies via npm
Generated config files with proper ES module syntax
Switched to @tailwindcss/postcss when v3 syntax failed
Verified the build (CSS jumped from 1.17KB → 18.21KB)

Time saved: ~2 hours of Stack Overflow research

Challenge 2: State Management Across Windows

The MacOS desktop needed to track:

Window positions (x, y coordinates)
Z-index stacking order
Focus state
Open/closed status

Antigravity's solution was elegant:

const [windows, setWindows] = useState({});
const [maxZIndex, setMaxZIndex] = useState(1);

const focusWindow = (id) => {
  setWindows(prev => {
    const nextZ = maxZIndex + 1;
    setMaxZIndex(nextZ);

    const newState = { ...prev };
    newState[id] = { ...prev[id], zIndex: nextZ, isActive: true };

    // Deactivate others
    Object.keys(newState).forEach(key => {
      if (key !== id) newState[key].isActive = false;
    });
    return newState;
  });
};

No memory leaks, proper immutability, clean logic—first attempt.

Challenge 3: Framer Motion Animations

Creating smooth iOS-style animations requires understanding physics:

<motion.div
  initial={{ y: '100%' }}
  animate={{ y: 0 }}
  exit={{ y: '100%' }}
  transition={{ type: 'spring', damping: 25, stiffness: 300 }}
>
  {/* App content */}
</motion.div>

Antigravity nailed the spring physics parameters on the first try. That's hundreds of hours of animation experience distilled into working code.

💡 Key Insights: How to Work with Antigravity

After 10+ hours, here's what I learned:

✅ Do:

Start broad, iterate narrow: "Build a MacOS desktop" → "Add real Virat Kohli data"
Give honest feedback: When I said "this looks stupid," it fixed the actual problem
Let it research: It found better data sources than I would have Googled
Trust the task breakdown: The auto-generated task.md was better organized than my mental model

❌ Don't:

Micromanage implementation: It knows Tailwind/Framer Motion better than most humans
Assume it knows your aesthetic: "Dull" was subjective—I had to specify what was missing
Skip the planning phase: Approving implementation_plan.md saves rework

🚀 The Broader Implications

This isn't just a better autocomplete. Antigravity represents a fundamental shift:

Traditional Coding	Antigravity Coding
Write line-by-line	Describe outcomes
Debug syntax errors	Solve logic problems
Think about how to implement	Think about what to build
Spend hours on config	Spend hours on design

The bottleneck shifts from syntax to ideas.

What This Means for Developers

Junior developers: Can build production-quality apps without years of framework expertise

Senior developers: Can prototype 10x faster and focus on architecture

Teams: Can iterate on features in hours instead of sprints

🎯 Real-World Performance Metrics

From my session:

Metric	Value
Total files created	22
Lines of code written	~2,500
Build errors fixed autonomously	6
Web searches performed	12
User "it's broken" complaints	5
Times AI fixed itself	5
Final build status	✅ Successful

Development time: 10 hours (with learning curve)

Estimated manual time: 40-60 hours for equivalent quality

💬 The Human Element

Despite all this automation, I was still essential. I provided:

Vision ("I want a MacOS desktop")
Taste ("This looks dull, add real data")
Domain knowledge ("Use Virat Kohli cricket stats")
QA feedback ("Photos not working")

Antigravity is a multiplier, not a replacement. It executes at AI speed on human-defined goals.

🔮 What's Next?

Imagine this technology in 12 months:

Multimodal understanding: Show it a design mockup, get working code
Codebase memory: Remember every project you've built
Autonomous testing: Self-write test suites and fix failures
Cross-platform deployment: "Make this work on iOS" → done

We're at iPhone 1 levels of maturity. The next decade will be wild.

Conclusion

Google Antigravity isn't magic—it's the first truly agentic coding assistant. It plans, executes, debugs, and learns from feedback in a continuous loop.

The key insight: Good AI assistance isn't about eliminating coding; it's about eliminating the tedious 80% so you can focus on the creative 20%.

After this experience, I can't imagine going back to traditional development. Not because I've forgotten how to code, but because I've remembered why I started coding in the first place: to build things, not to fiddle with configs and syntax.

The future of programming isn't code-free. It's friction-free.

Have you used AI coding assistants? What's been your experience with autonomous agents vs. autocomplete? Share your thoughts in the comments!

Resources

Anthropic Claude: The reasoning model powering Antigravity
Framer Motion: React animation library used in examples
Tailwind CSS: Utility-first CSS framework
Vite: Build tool for the demo project
Real-time web search: Powered by Google Search API

Official Google Antigravity Resources

Google Antigravity Overview - Comprehensive introduction and use cases
Download Google Antigravity - Official download page
Getting Started with Google Antigravity - Official Google Codelabs tutorial

About the Author

Suraj Khaitan — Gen AI Architect | Building the next generation of AI-powered development tools

Connect on LinkedIn | Follow for more AI and software engineering insights

Tags: #AI #GoogleAntigravity #SoftwareEngineering #React #WebDevelopment #AIAssistant #ClaudeAI #DeveloperTools #Programming #MachineLearning

Retrieval-Augmented Generation (RAG) Agents: How to Build Grounded, Tool‑Using GenAI Systems

Suraj Khaitan — Sun, 28 Dec 2025 09:41:41 +0000

If you’ve built a demo where an LLM answers questions over your docs, you’ve built RAG.

If you’ve tried to ship it—and suddenly you’re dealing with missing citations, prompt injection, inconsistent tool calls, and “why did it say that?”—you’re building a RAG agent.

This post is a practical blueprint for designing a GenAI RAG agent that is:

grounded in evidence (with citations),
capable of multi-step work (tools + loops),
safe (guardrails + authorization),
observable (traces + evals),
and maintainable (clear contracts, not prompt spaghetti).

Everything here is generic and vendor-agnostic. The code snippets are intentionally simplified patterns inspired by production agent wrappers (tool calling, memory summaries, guardrail checks), without any client/project identifiers.

RAG vs. RAG agents
A reference architecture you can ship
Retrieval that actually works
Context engineering (the underrated part)
Tool use: the difference between “agent” and “chatbot”
Memory: short-term chat vs. long-term summaries
Guardrails: prompt injection, data leaks, and safe tool calls
Verification: how you earn user trust
Evaluation + observability
A shipping checklist

RAG vs. RAG agents

RAG (single-shot) is typically:

take a question,
retrieve relevant passages,
generate an answer.

A RAG agent is a system that can iterate:

understand the goal,
decide next steps,
retrieve evidence (possibly multiple times),
call tools (search, ticketing, DB lookups, workflows),
verify results,
respond with citations.

A simple example:

User: “Summarize the refund policy and open a support ticket if I’m eligible.”

A RAG agent might:

retrieve the policy pages,
determine eligibility criteria,
ask a follow-up (purchase date),
call a tool to create a ticket,
and return a final answer with citations + the ticket ID.

This is where architecture matters: once your system can act, you need stronger controls than a prompt.

A reference architecture you can ship

Here’s a diagram-friendly mental model:

User
  ↓
Orchestrator (routing + policy)
  ├─ Retriever (vector / keyword / hybrid)
  ├─ Reranker (optional)
  ├─ Context Builder (dedupe, trim, cite)
  ├─ LLM Reasoner (constrained)
  ├─ Tool Runner (allowlist + authz)
  ├─ Memory (session + long-term summary)
  ├─ Guardrails (input/output moderation + injection defenses)
  └─ Observability (traces, logs, evals)
  ↓
Answer + Citations + Actions

The key move is to treat RAG agents as systems:

Retrieval is a component (not magic).
Tool execution is a component (not “LLM will behave”).
Memory is a component (not just “add the chat history”).
Verification is a component (not “hope the model is careful”).

Putting it together: an end-to-end request handler

Below is a simplified “agent wrapper” flow you can adapt. It mirrors how production systems typically work: apply guardrails, hydrate memory, initialize a tool session, run the agent loop, persist summaries, and return a structured response.

from dataclasses import dataclass
from typing import Any


@dataclass
class AgentRequest:
    user_id: str
    session_id: str | None
    message: str
    metadata: dict[str, Any]


@dataclass
class AgentResponse:
    session_id: str
    answer: str
    citations: list[dict[str, Any]]
    actions: list[dict[str, Any]]
    metadata: dict[str, Any]


def handle_request(req: AgentRequest) -> AgentResponse:
    # 1) Establish session
    session_id = req.session_id or new_session_id()

    # 2) Apply INPUT guardrails (block early if needed)
    filtered_message, gr_in = apply_guardrails(guardrails_client(), req.message, source="INPUT")
    if gr_in.get("intervened"):
        return AgentResponse(
            session_id=session_id,
            answer="Your request can’t be processed due to safety policies.",
            citations=[],
            actions=[],
            metadata={"guardrails": {"input": gr_in}},
        )

    # 3) Initialize tool session (for tool servers that require it)
    tool_session_id = initialize_tool_session()

    # 4) Hydrate long-term memory summary (keep it compact)
    summary = load_agent_summary(store(), req.user_id, session_id)
    if summary:
        filtered_message += "\n\nAgent memory (summary): " + summary

    # 5) Retrieve evidence and run the agent loop (tight budgets)
    loop_budget = 3
    citations: list[dict[str, Any]] = []
    actions: list[dict[str, Any]] = []

    for _ in range(loop_budget):
        query = rewrite_query(filtered_message)
        retrieved = retrieve(query, filters=req.metadata)
        context = build_context(retrieved)

        step = reasoner_llm().next_step(
            user_message=filtered_message,
            context=context,
            allowed_tools=tool_allowlist(),
        )

        if step.type == "final":
            citations = step.citations
            answer = step.answer
            break

        if step.type == "tool_call":
            validate_tool_call(step.tool_name, step.arguments, req.user_id)
            tool_result = tool_call(step.tool_name, step.arguments, session_id=tool_session_id)
            actions.append({"tool": step.tool_name, "result": tool_result})
            filtered_message += "\n\nTool result: " + safe_json(tool_result)

    # 6) Persist updated memory summary (async is fine)
    new_summary = summarize_for_memory(filtered_message, answer)
    write_agent_summary(store(), req.user_id, session_id, new_summary, updated_at=iso_now())

    # 7) Apply OUTPUT guardrails (don’t leak sensitive data)
    answer, gr_out = apply_guardrails(guardrails_client(), answer, source="OUTPUT")

    return AgentResponse(
        session_id=session_id,
        answer=answer,
        citations=citations,
        actions=actions,
        metadata={"guardrails": {"input": gr_in, "output": gr_out}},
    )

The big takeaway: agent behavior should be constrained by system code (budgets, allowlists, authz), not by “hoping the prompt is strong enough.”

Retrieval that actually works

Most RAG failures are retrieval failures wearing an LLM costume.

1) Prefer hybrid retrieval

Vector search is great for semantic similarity, but it misses:

exact identifiers,
error codes,
product/version strings,
proper nouns,
and “must match” phrases.

A reliable baseline is hybrid retrieval:

keyword/BM25 for exactness,
vectors for semantics,
metadata filters for correctness.

2) Use metadata filters early

Even perfect embeddings won’t save you if you retrieve the wrong edition.

Filter by things like:

product/version,
region/locale,
document type,
effective date,
access control labels.

3) Query rewriting is not optional

A user question is not always a good search query.

Example:

user: “Can I expense travel?”
better retrieval query: “travel expense policy eligible expenses exceptions receipts approval limit”

In production, you typically want the agent to create a search query (or several) and then retrieve.

4) Rerank if top‑k is noisy

If you retrieve 20 passages and 12 are “kinda related,” you’ll see:

diluted context,
token blowups,
worse answers.

A small reranker step can dramatically improve precision.

Context engineering (the underrated part)

The retrieval step isn’t finished when you get a list of chunks.

Your context builder should:

deduplicate near-identical chunks,
keep section titles + timestamps,
extract only the relevant span (not the entire page),
preserve stable source IDs for citations,
and respect a strict token budget.

A practical recipe:

retrieve k=20
rerank to top=6–8
extract salient spans (quotes)
build context with citations

A citation-friendly context format

[Source: doc-17 | “Refund Policy” | Section: Eligibility | Updated: 2025-01-10]
"Refunds are available within 30 days if …"

[Source: doc-23 | “Exceptions” | Section: Digital goods | Updated: 2024-11-02]
"Digital purchases are non-refundable unless …"

This makes it easy to:

cite sources in the final answer,
enforce “no citation → no claim,”
and debug retrieval issues.

Tool use: the difference between “agent” and “chatbot”

Tool use is where a lot of “agents” go sideways in production.

The safe pattern is:

the model proposes a tool call,
your system validates it (allowlist + schema + authz),
your system executes it,
the model receives the result,
the agent decides next steps.

A generic tool-call client (JSON‑RPC style)

This snippet shows a minimal pattern for a tool server with session headers and timeouts.

import os
import uuid

import httpx

TOOL_SERVER_URL = os.environ["TOOL_SERVER_URL"]


def call_tool_server(method: str, params: dict | None = None, session_id: str | None = None) -> tuple[dict, dict]:
    headers = {
        "Content-Type": "application/json",
        "Tool-Protocol-Version": "2024-01-01",
    }
    if session_id:
        headers["Tool-Session-Id"] = session_id

    body = {
        "jsonrpc": "2.0",
        "id": str(uuid.uuid4()),
        "method": method,
        "params": params or {},
    }

    resp = httpx.post(TOOL_SERVER_URL, json=body, headers=headers, timeout=30)
    resp.raise_for_status()
    return resp.json(), dict(resp.headers)


def initialize_tool_session() -> str | None:
    _, headers = call_tool_server("initialize")
    return headers.get("Tool-Session-Id")


def tool_call(name: str, arguments: dict, session_id: str) -> dict:
    result, _ = call_tool_server(
        "tools/call",
        params={"name": name, "arguments": arguments},
        session_id=session_id,
    )
    return result

This is not “agent logic”—it’s infrastructure. Keep it boring.

Tool allowlists and schemas

Before executing a tool call, validate:

tool name is in an allowlist,
arguments conform to a schema,
the user is authorized for the action,
budgets (max calls / max latency) aren’t exceeded.

That validation should happen outside the model.

Memory: short-term chat vs. long-term summaries

A common mistake is to keep appending the full conversation forever.

That creates:

token bloat,
privacy risk,
and “the model latched onto something from 40 turns ago.”

A more robust approach:

short-term memory: last N turns (recent, high-fidelity)
long-term memory: a periodically updated summary (compact, durable)

A generic long-term summary write/read pattern

The snippet below demonstrates a safe pattern:

store a summary keyed by user_id + session_id,
update it after each response,
read it at session start to prime the agent.

from dataclasses import asdict, dataclass
from typing import Any


@dataclass
class AgentSummaryRecord:
    user_id: str
    session_id: str
    updated_at: str
    summary: str


class KeyValueStore:
    def put(self, key: dict[str, str], item: dict[str, Any]) -> None: ...
    def get(self, key: dict[str, str]) -> dict[str, Any] | None: ...


def write_agent_summary(store: KeyValueStore, user_id: str, session_id: str, summary: str, updated_at: str) -> None:
    record = AgentSummaryRecord(
        user_id=user_id,
        session_id=session_id,
        updated_at=updated_at,
        summary=summary,
    )
    store.put({"user_id": user_id, "session_id": session_id}, asdict(record))


def load_agent_summary(store: KeyValueStore, user_id: str, session_id: str) -> str | None:
    item = store.get({"user_id": user_id, "session_id": session_id})
    if not item:
        return None
    return str(item.get("summary") or "")

What belongs in the summary?

A good long-term summary is not a transcript. It’s:

user preferences (explicit),
stable facts the user confirmed,
open tasks,
and important constraints.

Avoid storing:

secrets,
raw documents,
PII that doesn’t need to persist.

Guardrails: prompt injection, data leaks, and safe tool calls

If your agent reads documents from the outside world (PDFs, web pages, tickets), assume those documents can contain hostile instructions.

Treat retrieved content as untrusted input

A simple, effective policy:

retrieved text may contain facts,
but it may not issue instructions,
and it may not override system rules.

Apply input/output guardrails as a service

Many orgs implement “guardrails” as a separate layer that:

screens user inputs,
screens model outputs,
optionally redacts/blocks content,
returns structured metadata (“intervened”, category, severity).

Here is a generic wrapper pattern:

import json
from typing import Any


class GuardrailsClient:
    def apply(self, *, content: str, source: str) -> dict[str, Any]:
        """source is typically 'INPUT' or 'OUTPUT'."""
        raise NotImplementedError


def apply_guardrails(guardrails: GuardrailsClient, payload: str | dict[str, Any], source: str) -> tuple[str | dict[str, Any], dict[str, Any]]:
    is_structured = isinstance(payload, dict)
    text = json.dumps(payload) if is_structured else payload

    resp = guardrails.apply(content=text, source=source)

    # Generic interpretation of a guardrails response
    action = str(resp.get("action", "NONE")).upper()
    filtered = resp.get("filtered_content", text)
    intervened = action in {"BLOCK", "INTERVENED"}

    resp["intervened"] = intervened

    if is_structured:
        try:
            return json.loads(filtered), resp
        except Exception:
            return {"raw_output": filtered}, resp

    return filtered, resp

Two practical tips:

If guardrails intervene, return a safe, deterministic response (don’t ask the LLM to “explain the policy violation”).
Run guardrails on tool outputs too if they can contain sensitive data.

Tool safety is guardrails + authorization

Guardrails can help with content risk, but tool safety requires:

server-side authorization,
immutable audit logs,
strict budgets.

Never rely on the model to “do the right thing.”

Verification: how you earn user trust

RAG agents gain adoption when users can verify.

Enforce “no citation → no claim”

A strong system rule:

If the agent can’t cite a source for a statement, it must label it as uncertainty or ask a follow-up.

Quote-first answering

A practical approach:

extract supporting quotes from retrieved sources,
write the answer in your own words,
attach citations.

This reduces hallucinations because the model is anchored to evidence.

Structured outputs for actions

When tools are involved, do not bury results inside prose.

Use an explicit response contract:

{
  "answer": "...",
  "citations": [
    {"source_id": "doc-17", "title": "Refund Policy", "section": "Eligibility", "quote": "..."}
  ],
  "actions": [
    {"type": "create_ticket", "status": "success", "ticket_id": "INC-456"}
  ],
  "confidence": "medium",
  "follow_ups": ["What was the purchase date?"]
}

That contract makes downstream UX and testing much easier.

Evaluation + observability

If you can’t measure it, you’ll end up debating prompts.

What to log (minimum viable traces)

For each request, capture:

rewritten search query (or queries),
retrieval results (source IDs + scores),
reranking results,
final context length,
tool calls (name + args hash + latency + status),
guardrails action metadata,
citations returned.

This is how you answer: “Why did it say that?”

What to measure (starter metrics)

Citation coverage: % of answers with ≥1 citation
Groundedness: evaluator score or “supported claims ratio”
Retrieval precision: are top citations actually relevant?
Escalation rate: how often the agent says “I don’t know” or hands off
Tool failure rate: how often tool calls fail/time out
Latency: p50/p95 end-to-end and retrieval/tool breakdown

Offline evaluation set

Build a small eval dataset (even 50–200 questions) with:

expected source documents,
disallowed sources,
expected follow-up questions,
red-team prompts for injection.

Iterate retrieval first, then prompting.

A shipping checklist

If you want a pragmatic sequence that reduces risk:

Ship RAG with citations (even if answers are short)
Add hybrid retrieval + metadata filtering
Add reranking if top‑k is noisy
Add a context builder (dedupe + span extraction)
Add guardrails (input + output)
Add tool runner (allowlist + schema + authz)
Add a tight agent loop (max 2–3 iterations)
Add verification (no citation → no claim)
Add tracing + offline evals

This order helps you avoid “agent chaos” before your foundations are stable.

Closing thoughts

A RAG agent is best thought of as a retrieval system with an LLM interface—not the other way around.

If you invest in retrieval quality, context building, tool safety, and verification, you get a system users trust.

If you skip those and jump straight to “agent prompts,” you get a system that demos well and pages you at 2am.

About the Author:

Written by Suraj Khaitan

— Gen AI Architect | Working on serverless AI & cloud platforms.

I Built 100+ Gen AI Agents: Architecture, Patterns, and Code You Can Reuse

Suraj Khaitan — Sat, 06 Dec 2025 14:47:37 +0000

If you’ve tinkered with Gen AI agents, you know the gap between cool demos and dependable systems is big. This article distills what actually works when going from a single-script agent to production-ready, multi-agent pipelines. It’s based on reusable patterns from typical agent service modules and use-case templates, adapted into generic snippets you can copy into your own stack.

Note: This is a vendor-agnostic, client-agnostic write-up. No company-specific details. All code is illustrative and production-friendly.

Why Agentic Architectures Matter
Autonomy without chaos: Agents plan, act, and reflect, but need guardrails.
Tool use is essential: Real utility comes from reliable integration with data, APIs, storage, and retrieval.
Memory and context: Short-term scratchpads plus durable episode/task memory improve success rates.
Orchestration beats monoliths: Separate concerns (planning, execution, observation, correction).
A Minimal Agent: Plan–Act–Observe–Reflect
This skeleton shows a single agent loop that plans, executes tools, observes results, and reflects to update its strategy.

from typing import Callable, Dict, Any, List

class Tool:
def init(self, name: str, runner: Callable[[Dict[str, Any]], Dict[str, Any]]):
self.name = name
self.run = runner

class Memory:
def init(self):
self.events: List[Dict[str, Any]] = []

def add(self, event: Dict[str, Any]):
    self.events.append(event)

def last(self, n: int = 5) -> List[Dict[str, Any]]:
    return self.events[-n:]

class Agent:
def init(self, planner: Callable[[str, List[Dict[str, Any]]], Dict[str, Any]],
reflector: Callable[[List[Dict[str, Any]]], str],
tools: Dict[str, Tool], memory: Memory):
self.planner = planner
self.reflector = reflector
self.tools = tools
self.memory = memory

def step(self, goal: str) -> Dict[str, Any]:
    plan = self.planner(goal, self.memory.last())
    tool_name = plan.get("tool")
    args = plan.get("args", {})
    result = self.tools.get(tool_name, Tool("noop", lambda _: {"error": "Unknown tool", "done": False})).run(args)
    event = {"goal": goal, "plan": plan, "result": result}
    self.memory.add(event)
    feedback = self.reflector(self.memory.last())
    return {"event": event, "feedback": feedback}

def run(self, goal: str, max_steps: int = 5) -> List[Dict[str, Any]]:
    trace = []
    for _ in range(max_steps):
        trace.append(self.step(goal))
        if trace[-1]["event"]["result"].get("done"):
            break
    return trace

Key idea: keep the loop simple and pure. Inject model/planner/reflector functions rather than hard-coding vendor calls.

Tools: Keep Interfaces Consistent
from typing import Dict, Any

def search_tool(params: Dict[str, Any]) -> Dict[str, Any]:
query = params.get("query", "")
# Replace with your search implementation (API, vector DB, etc.)
return {"items": [f"Result for: {query}"], "done": False}

def write_file_tool(params: Dict[str, Any]) -> Dict[str, Any]:
path = params.get("path")
content = params.get("content", "")
if not path:
return {"error": "Missing path", "done": False}
try:
with open(path, "w", encoding="utf-8") as f:
f.write(content)
return {"ok": True, "done": True}
except Exception as e:
return {"error": str(e), "done": False}
Plugging in LLMs for Planning and Reflection
Use any LLM provider. The important part is contract shape.

from typing import List, Dict, Any

def planner_llm(goal: str, recent_events: List[Dict[str, Any]]) -> Dict[str, Any]:
# Prompt craft is omitted; produce tool + args plan
# Simple heuristic plan (replace with LLM call)
if "write" in goal.lower():
return {"tool": "write_file", "args": {"path": "output.txt", "content": goal}}
return {"tool": "search", "args": {"query": goal}}

def reflector_llm(recent_events: List[Dict[str, Any]]) -> str:
# Summarize last results and propose improvements
return f"Reflect: {len(recent_events)} events processed. Consider narrowing the query or validating outputs."
Wire It Up
from agent_loop import Agent, Memory, Tool
from tools import search_tool, write_file_tool
from llm_adapters import planner_llm, reflector_llm

def build_agent() -> Agent:
tools = {
"search": Tool("search", search_tool),
"write_file": Tool("write_file", write_file_tool),
}
memory = Memory()
return Agent(planner_llm, reflector_llm, tools, memory)

if name == "main":
agent = build_agent()
trace = agent.run("Write a short note about agent patterns", max_steps=3)
for t in trace:
print(t)
Multi-Agent Pattern: Coordinator + Specialists
When tasks are complex, split into roles: Planner, Researcher, Implementer, Reviewer. The coordinator decomposes, routes, and reconciles.

from typing import Dict, Any, List
from agent_loop import Agent, Memory

class Coordinator:
def init(self, agents: Dict[str, Agent]):
self.agents = agents

def run(self, goal: str) -> List[Dict[str, Any]]:
    # naive decomposition; replace with LLM planner
    subtasks = [
        {"role": "researcher", "goal": f"Find info: {goal}"},
        {"role": "implementer", "goal": f"Draft output for: {goal}"},
        {"role": "reviewer", "goal": f"Check draft for: {goal}"},
    ]
    trace = []
    for st in subtasks:
        agent = self.agents.get(st["role"]) or self.agents.get("implementer")
        trace.append(agent.run(st["goal"], max_steps=2))
    return trace

def make_specialist(planner, reflector, tools) -> Agent:
return Agent(planner, reflector, tools, Memory())
Use Case Templates
Your codebase’s templates often include:

RAG agents: Retrieval-Augmented Generation using chunking, embeddings, and retrievers.
ReAct agents: Emphasizing step-by-step reasoning and tool use.
Text extraction agents: Focused on parsing documents and transforming unstructured data.
Example: a generic RAG tool for an agent.

from typing import Dict, Any, List

def rag_query(params: Dict[str, Any]) -> Dict[str, Any]:
question = params.get("question", "")
# Plug in your embedder, vector store, and reader components
# docs = retriever.search(question)
# answer = reader.synthesize(question, docs)
docs: List[str] = ["Doc A", "Doc B"]
answer = f"Answer synthesized for: {question} using {len(docs)} docs"
return {"answer": answer, "sources": docs, "done": True}
Then mount it as a tool:

from agent_loop import Agent, Memory, Tool
from llm_adapters import planner_llm, reflector_llm
from simple_rag_tool import rag_query

tools = {"rag": Tool("rag", rag_query)}
agent = Agent(planner_llm, reflector_llm, tools, Memory())
trace = agent.run("What is agentic RAG?", max_steps=1)
Guardrails and Safety
Input validation on tools (types, ranges, allowlists).
Sandboxed execution for file/network operations.
Rate limiting and circuit breakers for external APIs.
Observability: structured logs and traces per agent step.
Testing Strategy
Test agents like workflows:

Unit-test tools with deterministic inputs/outputs.
Mock LLM planners/reflectors to stabilize tests.
Scenario tests for end-to-end goals (success criteria + timeouts).
from agent_loop import Agent, Memory, Tool

def planner_stub(goal, _):
return {"tool": "echo", "args": {"text": goal}}

def reflector_stub(_):
return "reflect"

def echo_tool(params):
return {"echo": params.get("text", ""), "done": True}

def test_agent_runs_one_step():
tools = {"echo": Tool("echo", echo_tool)}
agent = Agent(planner_stub, reflector_stub, tools, Memory())
trace = agent.run("hello", max_steps=3)
assert trace[-1]["event"]["result"].get("done") is True
Deployment Tips
Package agents as stateless workers with externalized memory (DB/object store).
Use queues for long-running tasks; record step traces for resumability.
Keep prompts modular and versioned; migrate gradually.
Wrap-Up
Agentic systems shine when you architect for reliability, testability, and observability. Start with a clean loop, consistent tool interfaces, memory separation, and optional multi-agent coordination. Then plug in your LLM vendor and domain-specific tools. Template folders for agents (e.g., foundation, RAG, text extraction) are a solid foundation you can adapt.

10 Open-Source Agent Projects to Explore
Here are widely used, open-source agent frameworks and projects you can learn from and adapt. Each highlights different patterns: planning, tool use, multi-agent collaboration, memory, and orchestration.

Auto-GPT — Autonomous task-driven agent built on GPT models; showcases long-horizon planning and tool use. Link: https://github.com/Significant-Gravitas/AutoGPT
BabyAGI — Lightweight task management loop (create, prioritize, execute) with vector memory; great for understanding minimal agent cycles. Link: https://github.com/yoheinakajima/babyagi
Microsoft AutoGen — Framework for multi-agent conversations and collaboration with tooling and customization; strong for role-based agent teams. Link: https://github.com/microsoft/autogen
CrewAI — Python framework for multi-agent workflows with roles, tools, and processes; emphasizes structured collaboration. Link: https://github.com/joaomdmoura/crewai
LangGraph — Graph-based orchestration for agent loops, memory, and control; ideal for building reliable, inspectable agent pipelines. Link: https://github.com/langchain-ai/langgraph
LangChain Agents — Tool-using agents with planners, executors, and memory; integrates with a vast ecosystem of tools and vector DBs. Link: https://python.langchain.com/docs/modules/agents
OpenAI Agents SDK — Defines agents with tools and resources and handles orchestration; useful for standardized tool schemas and governance. Link: https://github.com/openai/openai-agents-python
CAMEL — Role-playing multi-agent framework with task decomposition and negotiation; useful for research on collaboration dynamics. Link: https://github.com/camel-ai/camel
AgentGPT (Web) — Browser-based autonomous agent setup for quick experiments; helpful to visualize prompts and iterative action loops. Link: https://github.com/reworkd/AgentGPT
ReAct Pattern Implementations — Combines reasoning traces with tool actions; many open implementations to learn prompt design and action validation. Link: https://arxiv.org/abs/2210.03629
Use these as references to pressure-test your design choices: planning reliability, tool APIs, memory schema, observability, and recovery strategies.

About the Author
Written by Suraj Khaitan — Gen AI Architect | Working on serverless AI & cloud platforms.

From Static Docs to Living Knowledge: Building an STS‑Aware Retrieval‑Augmented Agent Backend

Suraj Khaitan — Sun, 30 Nov 2025 14:14:19 +0000

We’ve all seen impressive GenAI demos. Yet, in day‑to‑day engineering, the questions are softer but more real: How do we keep answers trustworthy? How do we respect access boundaries without slowing teams down? This article offers a practical, human‑centered path—from raw documents and images to a secure, explainable knowledge layer—powered by session‑aware authorization (STS) and a simple agent + tools pattern.

Tone and structure are inspired by thoughtful architecture writing like “Architecture of AI‑Driven Systems” on Python Plain English, focusing on clarity, trade‑offs, and gentle guidance rather than hype.

Why This Matters

RAG without authorization is a liability. Enterprise data needs session‑scoped controls, revocation, and auditability.
Accuracy is not enough; answers must be explainable and reproducible across versions.
Multimodal inputs (PDFs, images) require consistent ingestion and normalization before indexing.

Architecture Snapshot

Knowledge Base Services: ingestion, chunking, embedding, indexing (vector + graph), retrieval, and an STS manager for authorization.
Agent Services: an agent wrapper orchestrates LLMs, tools, and guardrails; file upload and history modules support UX continuity.
Tool Services: domain tools (retriever, SQL, custom) invoked by agents.

Flow: Upload → Initialize → Read/Image2Text → Chunk → Embed → Index (Vector + Graph) → Retrieve → STS Filter → Agent Compose → Response with citations.

RAG Architecture :

![RAG Architecture]
![ ](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/skbtam21zt7piw76ey6f.png

Multimodal Ingestion

Key components:

Reading: parse PDFs/text, normalize content, attach metadata (doc_id, page).
Image‑to‑Text: extract text from images, unify format.
Initialization: bootstraps pipelines, configs, and version stamps.

Design tips:

Normalize MIME and metadata early; downstream pipelines assume clean structure.
Batch I/O with retries; track ingestion version to reproduce embeddings.

Smart Chunking for Better Retrieval

Chunking: semantic and rule‑based chunkers.
Keep chunks small enough for LLM context but rich in metadata (section, page, hierarchy).
Add relationship edges to support graph queries (e.g., section→subsection).

Embeddings + Dual Indexes

Embeddings: choose model, normalize vectors, stamp versions.
Vector Indexing: push to a vector store for semantic search.
Graph Indexing: persist relationships and provenance.

Why two indexes?

Vector search finds semantically related content.
Graph retrieves lineage and context (citations, related sections), improving explainability.

Retrieval Orchestration

Vector and Graph Retrievers: specialized retrievers.
Hybrid Retrieval Orchestrator: fuses results from both stores.

Pattern:

Try semantic (vector) for recall.
Expand via graph for context and provenance.
Fuse, rank, and return with metadata for STS filtering.

STS‑Aware Authorization

STS Manager: resolves session→permissions, applies policies, and filters retrieval candidates.
Enforce authorization before the agent composes answers; never let tools see disallowed content.

Benefits:

Session‑scoped access, policy revocation, and audit trails.
Prevents prompt injection using forbidden context.

Agents + Tools: The Execution Layer

Agent Wrapper: wires LLM prompts, tools, and guardrails; manages tool selection.
Tools: retriever and SQL tools for controlled data access.
Compose answers with citations sourced from graph metadata.

Execution pattern:

Agent decides → Tool executes → STS filters → Agent composes → Return answer + sources.

Observability, Versioning, and Deletion

knowledge_base_services/deletion/: right‑to‑be‑forgotten and data lifecycle.
agent_services/history_services/: conversational trace for monitoring and explainability.
Index/embedding version stamps to reproduce runs.

Example Flow (Generic Pseudo‑Code)

# 1) Ingest + Normalize
content_items = reading.read_batch(files)
image_text = image_to_text.extract(images)
normalized = initializer.normalize(content_items + image_text)

# 2) Chunk + Embed
chunks = chunker.semantic(normalized)
vecs = embedder.batch_embed(chunks)

# 3) Index (Vector + Graph)
vector_index.write(vecs, chunks)
graph_index.link(chunks, relations=...)

# 4) Retrieve with STS filter
candidates = hybrid_retriever.search(query, k=10)
authorized = sts.filter(session, candidates)

# 5) Agent + Tools compose
answer = agent.run(
  query=query,
  tools=[retriever_tool, sql_tool],
  context=authorized,
  with_citations=True
)

Gentle guidance: keep module names and interfaces simple. Start with clear, testable boundaries—ingest, chunk, embed, index, retrieve, filter, compose—and iterate. Good names reduce cognitive load and make onboarding kinder.

What to Showcase in the Post

Ingestion dispatch by MIME and metadata (reading, image‑to‑text, initialization).
Semantic chunker attaching rich metadata (chunking).
Batched embeddings + vector indexing with versioned names (embeddings, vector index).
Hybrid retrieval orchestrator with fusion and fallbacks (vector retriever, graph retriever).
STS filter gating results before agent sees them (STS manager).
Agent tool wiring and citation composition (agent wrapper, tools).

Benchmarks and Learnings

Track latency across stages: ingestion, embedding, indexing, retrieval, STS filtering, agent composition.
Measure precision@k and citation correctness.
Common pitfalls: over‑aggressive chunking, stale embeddings after content updates, authorization drift.

Quick Demo Hooks

Consider adding a minimal script that:

Loads a sample doc + image.
Runs ingestion→chunk→embed→index.
Executes a hybrid retrieval for a test query.
Applies STS filter for two different sessions.
Prints answer with citations and filtered item counts.

Optional starting point:

# Create a tiny virtual environment and run the demo
python -m venv .venv; .\.venv\Scripts\Activate.ps1
python demo/sts_rag_demo.py

Closing Checklist for Enterprise‑Grade RAG

Ingestion discipline with consistent metadata.
Chunking strategy matched to content structure.
Dual index (vector + graph) for recall + explainability.
Retrieval orchestration with fusion and fallbacks.
STS enforcement before agent composition.
Observability: versions, histories, and deletion paths.

About the Author

Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.