DEV Community: WonderLab

Agent Series (2): ReAct — The Most Important Agent Reasoning Paradigm

WonderLab — Sat, 23 May 2026 00:40:05 +0000

You Think Your Agent Is "Thinking." It's Actually Just Predicting Tokens.

Here's a scenario that happens more often than you'd think.

You ask an Agent to write a competitive analysis report. It confidently outputs three professional-looking pages — complete with data, conclusions, and strategic recommendations.

There's just one problem: every number comes from its training data, which may be a year old. It didn't search. It didn't verify. It just generated text that sounds authoritative.

That's not thinking. That's fluent hallucination.

Chain-of-Thought (CoT) has the same fundamental problem. CoT prompting tells the model to "reason step by step" before answering, and it genuinely does improve accuracy on many tasks. But the model is still reasoning entirely within language space. It can generate a very coherent chain of thought that leads to a completely wrong answer — because its only information source is training data.

ReAct was built to solve this.

ReAct: Reasoning + Acting, Interleaved

In 2022, researchers from Princeton and Google published ReAct: Synergizing Reasoning and Acting in Language Models.

The core idea is elegantly simple: let the model alternate between reasoning and acting, rather than reasoning first then acting, or acting without reasoning.

The concrete form is a three-part loop:

Thought  →  Action  →  Observation
   ↑                         │
   └─────────────────────────┘

Thought: What the model is "thinking" — current analysis, what to do next, why
Action: The actual tool call and parameters
Observation: The real result returned by the tool

The critical mechanism: Observation is fed back into the model as new context, allowing it to reason based on actual results. This creates the "think → act → observe → think again" loop.

This one loop fixes CoT's core flaw: the model is no longer reasoning in isolation. It can interact with the real world and update its reasoning based on real feedback.

A Concrete Example: Watching an Agent "Think"

I built a complete ReAct Agent demo using LangGraph + GLM-4-Flash with two tools: calculator (safe math evaluator) and web_search (Bing search).

Code: agent-01-react-agent/react_agent.py

Here's an actual execution trace — Demo 3: search for the areas of Beijing and Shanghai, then calculate the difference.

════════════════════════════════════════════════════════════
  Demo 3 ▸ Multi-Round Search (Same Tool, Multiple Calls)
════════════════════════════════════════════════════════════

[User Question]
  First search for Beijing's area, then Shanghai's area,
  then calculate how much larger Beijing is in km².
────────────────────────────────────────────────────────────

[Step 1] THOUGHT → ACTION
  Action  : web_search(query='北京面积 平方公里')

  Observation : • Beijing area: Total area 16,410.54 km²...
────────────────────────────────────────────────────────────

[Step 2] THOUGHT → ACTION
  Action  : web_search(query='上海面积 平方公里')

  Observation : • Shanghai area: Land area approximately 6,340.5 km²...
────────────────────────────────────────────────────────────

[Step 3] THOUGHT → ACTION
  Action  : calculator(expression='16410.54 - 6340.5')

  Observation : 10070.04
────────────────────────────────────────────────────────────

[Final Answer]
  Beijing's area is approximately 16,410.54 km², Shanghai's is
  approximately 6,340.5 km². Beijing is about 10,070.04 km² larger.
════════════════════════════════════════════════════════════

Notice what happened here:

The Agent decided on its own to search Beijing first, then Shanghai, then calculate — no hardcoded execution order
Each search result (Observation) was read by the model and used to determine the next step
The final calculation used real numbers extracted from real searches

This is ReAct's value: the execution path is planned dynamically at runtime, not hardcoded by the developer in advance.

ReAct vs. Chain-of-Thought: A Direct Comparison

Aspect	Chain-of-Thought	ReAct
Information source	Training data only	Training data + tool results
Execution path	Reasoning in language space	Think → real action → observe results
Can access real-time data	✗	✓ (via tools)
Can execute computation/code	✗	✓ (via tools)
Reasoning verifiable	Hard to verify	Each Observation is a real result
Risk of side effects	Low (no actions)	High (requires safety boundaries)

One sentence summary: CoT makes the model think clearly. ReAct makes it think while doing.

Building a ReAct Agent with LangGraph

Here's the core implementation. The code uses LangGraph's create_react_agent — one of the cleanest ReAct implementations available.

1. Safe Calculator Tool

import ast
import operator
from typing import Any
from langchain_core.tools import tool

_SAFE_OPS: dict[type, Any] = {
    ast.Add:  operator.add,
    ast.Sub:  operator.sub,
    ast.Mult: operator.mul,
    ast.Div:  operator.truediv,
    ast.Pow:  operator.pow,
    ast.Mod:  operator.mod,
    ast.USub: operator.neg,
}

def _eval_ast(node: ast.AST) -> float:
    if isinstance(node, ast.Constant):
        return float(node.value)
    if isinstance(node, ast.BinOp):
        op_fn = _SAFE_OPS.get(type(node.op))
        if op_fn is None:
            raise ValueError(f"Unsupported operator: {type(node.op).__name__}")
        return op_fn(_eval_ast(node.left), _eval_ast(node.right))
    if isinstance(node, ast.UnaryOp):
        op_fn = _SAFE_OPS.get(type(node.op))
        return op_fn(_eval_ast(node.operand))
    raise ValueError(f"Unsupported AST node: {type(node).__name__}")

@tool
def calculator(expression: str) -> str:
    """Evaluate a math expression. Supports + - * / ** % and parentheses."""
    try:
        tree = ast.parse(expression.strip(), mode="eval")
        result = _eval_ast(tree.body)
        if result == int(result):
            return str(int(result))
        return f"{result:.6g}"
    except (ValueError, SyntaxError, ZeroDivisionError) as e:
        return f"Calculation error: {e}"

Why not just use eval()?

eval("__import__('os').system('rm -rf /')") — that line will execute a deletion on your machine. Tools are the Agent's "hands." Once an attacker manipulates the LLM through prompt injection, eval() becomes a direct path to your system.

AST parsing only allows math operation nodes — everything else is rejected. This is the foundational principle of safe tool design.

2. Web Search Tool

import requests
from bs4 import BeautifulSoup
from urllib.parse import quote

_BING_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

@tool
def web_search(query: str) -> str:
    """Search the web and return the 3 most relevant snippets."""
    try:
        url = f"https://www.bing.com/search?q={quote(query)}&setlang=zh-CN"
        resp = requests.get(url, headers=_BING_HEADERS, timeout=10)
        resp.raise_for_status()

        soup = BeautifulSoup(resp.text, "html.parser")
        snippets = []
        for li in soup.find_all("li", class_="b_algo")[:4]:
            h2 = li.find("h2")
            title = h2.get_text(strip=True) if h2 else ""
            p = li.find("p")
            body = p.get_text(strip=True) if p else ""
            if title or body:
                snippets.append(f"• {title}: {body}"[:200])

        return "\n".join(snippets[:3]) if snippets else "No results found."
    except requests.RequestException as e:
        return f"Search failed: {e}"

3. Building the Agent

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
# LangGraph V1.0 moved create_react_agent to chat_agent_executor submodule
from langgraph.prebuilt.chat_agent_executor import create_react_agent

load_dotenv()

llm = ChatOpenAI(
    base_url="https://open.bigmodel.cn/api/paas/v4",
    api_key=os.getenv("LLM_API_KEY"),
    model="glm-4-flash",
    temperature=0,
)

agent = create_react_agent(
    model=llm,
    tools=[calculator, web_search],
)

result = agent.invoke(
    {"messages": [("user", "How much larger is Beijing than Shanghai in km²? Search and calculate.")]},
    config={"recursion_limit": 20},
)
print(result["messages"][-1].content)

Three core lines: define tools → bind LLM → run. LangGraph handles all the message routing, tool call dispatch, result injection, and loop control under the hood.

The correct import path for create_react_agent

LangGraph V1.0 moved this function to langgraph.prebuilt.chat_agent_executor. Importing from langgraph.prebuilt triggers a LangGraphDeprecatedSinceV10 warning. Use the new path:

# ✅ Recommended
from langgraph.prebuilt.chat_agent_executor import create_react_agent

# ⚠️ Triggers deprecation warning
from langgraph.prebuilt import create_react_agent

How the Message Flow Actually Works

To truly understand ReAct, you need to see the underlying message sequence. Here's what the LLM receives at the start of each cycle:

Context passed to LLM at round N:
┌─────────────────────────────────────────────────────┐
│ [System]  You are an assistant with these tools:    │
│           calculator, web_search                    │
│                                                     │
│ [Human]   Question: How much larger is Beijing?     │
│                                                     │
│ [AI]      (tool call) web_search("Beijing area")   │  ← Round 1 Action
│ [Tool]    Beijing area: 16,410 km²                 │  ← Round 1 Observation
│                                                     │
│ [AI]      (tool call) web_search("Shanghai area")  │  ← Round 2 Action
│ [Tool]    Shanghai area: 6,340 km²                 │  ← Round 2 Observation
│                                                     │
│ ← LLM decides what to do next here →               │
└─────────────────────────────────────────────────────┘

Each cycle, the entire history is passed to the LLM. The model "sees" all previous thoughts and observations, then decides:

Continue calling tools (more information needed)
Stop and deliver a final answer (enough information gathered)

This is why it's called a loop — the model itself is the loop's termination condition. It decides when to stop.

When Things Go Wrong: Failure Modes and Guards

The same "decide when to stop" design that makes ReAct powerful also introduces a risk: if the model misjudges, the loop never terminates.

Common runaway scenarios:

Scenario 1: Tool keeps failing, model keeps retrying

Action: web_search("vague ambiguous query")
Observation: No results found
Thought: Let me try different keywords
Action: web_search("different keywords")
Observation: No results found
Thought: Maybe one more variation...
(infinite loop)

Scenario 2: Model misunderstands the task and pursues the wrong direction

Thought: I need the exact value of X
Action: calculator("...")
Observation: Approximate result
Thought: Not precise enough, I need more decimal places
Action: calculator("...")
(infinite pursuit of "precision")

Scenario 3: Tools form a circular dependency

Thought: I need to know A before I can look up B
Action: search(A)
Observation: Requires knowing B first
Thought: I need to know B before I can look up A
(circular dependency)

LangGraph's recursion_limit parameter is the hard safety net:

result = agent.invoke(
    {"messages": [("user", question)]},
    config={"recursion_limit": 5},  # Force-stop after 5 steps
)

When the step count exceeds the limit, LangGraph raises GraphRecursionError:

[recursion_limit triggered]
  Exception type: GraphRecursionError
  Message: Recursion limit of 5 reached without hitting a stop condition...

→ Conclusion: Always set a reasonable recursion_limit in production (15~25 recommended)
→ Too low: legitimate tasks get cut off; Too high: runaway Agent burns massive tokens

How to set recursion_limit

Simple tasks (single tool call): 5–8 steps is enough
Medium tasks (multi-tool, multi-step): 10–15 steps
Complex research tasks: 20–25 steps
Tasks requiring 30+ steps should reconsider architecture — you may need multi-Agent collaboration (covered in a later article)

The rule of thumb: set it to roughly 2× the number of steps a successful execution needs. Room to breathe, but a real ceiling.

Five Demo Scenarios: From Simple to Complex

The complete code includes 5 progressive demos covering the main ReAct usage patterns:

Demo 1: Pure Calculation (single tool, single step)

Question: Calculate (1024 * 768) + (1920 * 1080)
Steps: calculator('(1024 * 768) + (1920 * 1080)') → 2860032

Validates the basic tool-calling pipeline.

Demo 2: Search + Calculate (multi-tool, multi-step)

Question: What year were Python and JavaScript first released? Calculate the difference.
Steps: web_search("Python release year") → web_search("JavaScript release year") → calculator

Shows the Agent autonomously orchestrating different tools in the right order.

Demo 3: Multi-round Search (same tool, multiple calls)

Question: How much larger is Beijing than Shanghai in km²?
Steps: web_search("Beijing area") → web_search("Shanghai area") → calculator → 10070.04

Shows the Agent deciding what to search second based on what it found first.

Demo 4: No Tools Needed (direct answer)

Question: Explain the ReAct paradigm in one sentence.
Steps: No tool calls — direct answer

Shows the Agent knowing when not to call tools. This matters as much as knowing when to call them.

Demo 5: Trigger recursion_limit (safety net demo)

Question: Search Python/Java/C release years, calculate the sum (~10 steps needed)
Limit: recursion_limit=5
Result: GraphRecursionError (correctly triggered)

Production safety mechanism verification.

An Interesting Observation: Agents Can "Luck Into" Correct Answers

Demo 2 produced a result worth documenting carefully.

The Agent searched for JavaScript's release year. The Bing snippet it received came from an article published in 2023 that mentioned Python's 1991 origin. The model appears to have confused "2023" (article publication date) with JavaScript's release year. The calculation step ran 2023 - 1991 = 32, returning 32.

But the final answer was correct: "Python was released in 1991, JavaScript in 1995 — a 4-year difference."

The model overrode its (incorrect) calculation result with its internal training knowledge and delivered the right answer.

This reveals a subtle property of ReAct: an Agent's reasoning chain and its final answer can be decoupled. The model may make errors during tool calls, then "self-correct" in the final answer generation using built-in knowledge.

As an outcome, this is fine — you got the right answer. From an engineering perspective, it's a problem. If you need traceable, verifiable conclusions, "it happened to be correct" isn't sufficient. This is one of the challenges that Harness Engineering addresses (covered in a later article in this series).

Trace Visualization: Making Agent Reasoning Observable

A common production pain point: when something goes wrong, you don't know which step failed, because only the final answer is visible by default.

Good practice: print the full Thought/Action/Observation sequence as a readable Trace:

from langchain_core.messages import AIMessage, HumanMessage, ToolMessage

def print_trace(result: dict) -> None:
    for msg in result["messages"]:
        if isinstance(msg, HumanMessage):
            print(f"[USER] {msg.content}")

        elif isinstance(msg, AIMessage):
            content = msg.content if isinstance(msg.content, str) else ""
            if msg.tool_calls:
                for tc in msg.tool_calls:
                    args = ", ".join(f"{k}={repr(v)}" for k, v in tc["args"].items())
                    print(f"[ACTION] {tc['name']}({args})")
            else:
                print(f"[FINAL ANSWER] {content.strip()}")

        elif isinstance(msg, ToolMessage):
            obs = msg.content if isinstance(msg.content, str) else str(msg.content)
            print(f"[OBSERVATION] {obs.strip()[:300]}")

GLM-4-Flash content field pollution

When using GLM-4-Flash, you may occasionally see raw JSON in AIMessage.content — something like {"index": 0, "delta": ...}. This is the model leaking internal streaming delta data into the content field.

Fix: detect when content starts with { or [ and can be parsed by json.loads(), then discard it.

def _clean_thought(text: str) -> str:
    stripped = text.strip()
    if stripped and stripped[0] in ("{", "["):
        try:
            json.loads(stripped)
            return ""  # leaked JSON, discard
        except json.JSONDecodeError:
            pass
    return text

The complete demo code already includes this handling.

The Limitations of ReAct

ReAct is powerful, but it's not a silver bullet. Knowing its limits helps you use it correctly.

1. Context window fills up fast

Each cycle packs the entire history into context. Step count grows, token consumption spikes. Complex tasks (20+ steps) may fail on models with limited context windows.

2. Tool descriptions drive everything — write them well

ReAct relies entirely on the LLM understanding tool documentation to decide which tool to call and with what parameters. Vague docstrings lead to wrong tool selection. Tool descriptions are the invisible API of a ReAct system — treat them like API documentation.

3. No global planning capability

Standard ReAct is greedy: each step only looks at the current state to decide the next move, with no "plan the whole thing first, then execute" capability. For tasks requiring long-horizon planning (like writing an entire codebase), this can get stuck in local optima. This is what the Plan-and-Solve paradigm addresses (Article 3 in this series).

4. Poor fault tolerance for tool failures

When a tool returns an error, the model has to infer the next step from the error message alone. There's no predefined retry strategy or fallback logic. This needs to be handled at the tool design level and the Harness layer.

Interview Prep: Articulate How Your Agent "Thinks"

Common question: How does your Agent decide its next action?

Many candidates answer "it calls tools." But what the interviewer actually wants to hear is: who decides which tool to call, and when does it stop?

A clear answer framework:

"We use the ReAct paradigm. The core is a Thought → Action → Observation loop. At each step, the LLM looks at the full context — user question plus all previous Observations — and decides the next Action. The tool runs, its result is injected as a ToolMessage, and the model reasons again.

The loop terminates when the LLM judges it has enough information and stops calling tools, generating the final answer directly.

To prevent runaway loops, we set recursion_limit (typically 15–25). When it's exceeded, we catch the exception and fall back to a degraded response. We also log the full Trace — every Action and Observation — so we can replay the entire reasoning chain when debugging."

Key differentiators: mentioning Trace observability and recursion_limit shows you've thought beyond demos and considered production stability.

Summary

Three things from this article:

ReAct = Reasoning + Acting, interleaved: The Thought → Action → Observation loop lets Agents update their reasoning based on real-world feedback. The fundamental difference from CoT: actions produce real results that feed back into the reasoning process.
Tool design is ReAct's invisible interface: Docstring quality directly determines how accurately the LLM selects tools. Safe implementation (AST instead of eval) determines whether the system boundary holds.
recursion_limit is a required production setting: The model decides when to stop — that's inherently risky. recursion_limit is the last line of defense. Recommended value: roughly 2× the steps needed for successful completion.

Next up: Agent Series Article 3 — Plan-and-Solve: When ReAct Isn't Enough, How Agents Plan Before Acting. We'll see where ReAct's greedy strategy hits its ceiling on complex tasks, and how introducing an explicit planning layer breaks through it.

References

Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023
LangGraph Documentation
hello-agents Open Tutorial (Chapter 4)
Demo code for this article: agent-01-react-agent

Welcome to visit my personal homepage for more useful knowledge and interesting products

Open Source Project (No.73): Sub2API - All-in-One Claude/OpenAI/Gemini Subscription-to-API Relay

WonderLab — Sat, 23 May 2026 00:37:58 +0000

Introduction

"Fluidize your AI subscription quotas and maximize the value of every penny."

This is the No.73 article in the "One Open Source Project per Day" series. Today, we are exploring Sub2API (also known as CRS2).

With the rise of native AI power-tools like Claude Code and GitHub Copilot, many developers find themselves with multiple AI subscriptions (such as Claude Pro or OpenAI Plus). However, these subscriptions often come with usage limits or idle quotas. How can you consolidate these scattered subscription resources and distribute costs efficiently across different tools and users? Sub2API provides a perfect open-source solution.

It is more than just a simple forwarder; it is a full-featured API proxy platform that handles the entire pipeline — from account management and quota distribution to automated billing and built-in payments. It is particularly well-suited for team sharing, "carpooling" (shared-cost usage), or individual multi-account integration.

What You Will Learn

The core positioning of Sub2API and the pain points it solves.
Supported mainstream models and subscription types.
Core features: Multi-account management, intelligent scheduling, and Token-level billing.
Rapid deployment methods: Script installation and Docker Compose.
How to use Sub2API to build your own AI API relay service.

Prerequisites

Basic understanding of AI model APIs (OpenAI, Claude, Gemini, etc.).
Fundamental Linux command-line experience.
Familiarity with Docker or containerized deployment concepts.

Project Background

Project Overview

Sub2API is an AI API gateway platform developed in Go. Its core logic is to "pool" AI subscriptions from various channels (including OAuth-authenticated accounts, Session Keys, or standard API Keys).

With Sub2API, you can:

Aggregate Resources: Plug in multiple Claude Pro or OpenAI accounts and output a single, unified standard API.
Share Costs: Supports a "carpooling" mechanism with a built-in billing system to charge by usage.
Seamless Integration: The generated APIs work flawlessly with native tools like Claude Code, OpenClaw, and more, without complex client-side configuration.

Author/Team

Maintained by: Wei-Shaw
Ecosystem: The project enjoys active community support, including a mobile admin console (sub2api-mobile) and other surrounding tools.

Project Data

📄 Core Repository: Wei-Shaw/sub2api
🛠️ Tech Stack: Go (Gin, Ent), Vue 3, PostgreSQL, Redis
📄 License: LGPL-3.0
📈 Stats: Over 22k Stars on GitHub (Note: May include historical repository data or high community interest).

Main Features

Core Utility

Sub2API solves the "resource island" problem in AI usage:

Upstream Account Pool (Claude, OpenAI, Gemini)
      ↓ Integration
Sub2API Platform Layer (Auth, Billing, Load Balancing, Session Persistence)
      ↓ Unified Distribution
Downstream Applications (IDEs, Chat clients, Scripts)

Key Features

Multi-Account Management: Supports various upstream account types and automatically handles session persistence.
Precise Billing: Token-level usage tracking and cost calculation with customizable rates.
Smart Scheduling: Supports Sticky Sessions and load balancing to ensure continuity in long conversations.
Built-in Payment System: Native support for Alipay, WeChat Pay, Stripe, etc., allowing users to top up autonomously.
Concurrency & Rate Limiting: Configure per-user and per-account limits to protect your resources.
Admin Dashboard: Provides an intuitive Web UI for real-time monitoring and management.

Quick Start

Method 1: One-Click Script Installation (Recommended)

Recommended for clean Ubuntu/Debian systems:

curl -sSL https://raw.githubusercontent.com/Wei-Shaw/sub2api/main/deploy/install.sh | sudo bash

Notes:

Requires PostgreSQL 15+ and Redis 7+ to be pre-installed.
The script installs the binary to /opt/sub2api and creates a systemd service.

Method 2: Docker Compose Deployment

# Create deployment directory
mkdir -p sub2api-deploy && cd sub2api-deploy

# Download and run deployment preparation script
curl -sSL https://raw.githubusercontent.com/Wei-Shaw/sub2api/main/deploy/docker-deploy.sh | bash

# Spin up services
docker compose up -d

Once installed, access the admin dashboard at http://YOUR_SERVER_IP:8080.

Detailed Analysis

Architecture: Why is this more than just a Reverse Proxy?

The design priority of Sub2API is "Account State Management." Traditional reverse proxy tools (like Nginx) lack the ability to understand application-layer sessions.

Sticky Sessions: For tools like Claude Code that require context continuity, Sub2API uses the session_id in the Header to lock a request to a specific upstream account, ensuring conversations aren't interrupted by account switching.

Note: If using Nginx as a reverse proxy, ensure underscores_in_headers on; is enabled to support session headers.

Pooling Logic:
The system abstracts multiple accounts into a single "resource pool." When one account hits a Rate Limit, the scheduler automatically routes traffic away from it, maximizing uptime.
Integrated Ecosystem:
While most relay tools require an external payment gateway, Sub2API’s built-in integration significantly reduces the operational complexity for small teams or community-led "carpools."

Technical Stack

Backend: Go ensures high concurrency handling and ease of deployment.
Database: PostgreSQL handles complex relationships and billing records.
Cache: Redis manages rate limiting and real-time state synchronization.

Address and Resources

Official Resources

🌟 GitHub: Wei-Shaw/sub2api
📖 Documentation: Detailed guides on deployment, payment configuration, and API usage are available in the repository.
🖥️ Demo: https://demo.sub2api.org/

Target Audience

Carpool Leads: Organizers looking to split the cost of Claude/OpenAI Plus.
Developers / Power Users: Individuals wanting to consolidate multiple account quotas for native CLI tools.
Internal Enterprise Teams: Teams needing to distribute and audit AI resource usage internally.

Summary and Outlook

Key Takeaways

Clear Focus: Specialized in converting subscription-based quotas into standard API services.
All-in-One: A closed-loop for management, scheduling, billing, and payments.
Deployment Friendly: Multiple options including script-based and Docker-based setups.
Reliable Performance: Built on a solid Go/PostgreSQL/Redis foundation suitable for medium-to-large distribution.

One-Sentence Review

Sub2API is currently the most tightly integrated open-source solution combining "resource aggregation" with a "commercial model," making it a powerful tool for achieving "subscription freedom."

Visit my homepage for more useful insights and interesting products.

One Open Source Project a Day (No. 72): Andrej Karpathy Skills — Fix Four Chronic LLM Coding Problems With a Single CLAUDE.md

WonderLab — Fri, 22 May 2026 07:00:16 +0000

Introduction

"LLMs excel at looping until they meet specific goals — so provide success criteria rather than imperative instructions."

This is the NO.72 article in the "One Open Source Project a Day" series. Today we are exploring andrej-karpathy-skills.

This project is unusual: its core is not a tool, framework, or library — it is a single CLAUDE.md file.

The story starts with Andrej Karpathy posting on X after heavy Claude Code usage, documenting failure patterns he observed in LLM coding: diving into implementation without clarification, engineering simple problems into complex solutions, making unrequested changes to adjacent code.

The multica-ai team distilled those observations into four actionable behavioral principles and packaged them into a CLAUDE.md — drop it in a project, and Claude Code changes how it behaves. It also ships in Claude Code plugin and Cursor rules formats, covering the two main AI coding tools.

The project answers a frequently overlooked question: rather than teaching an LLM exactly what to do, teach it how to think.

What You Will Learn

The four LLM coding failure modes Karpathy identified
The content and real-world examples behind each of the four principles
Three installation methods: standalone CLAUDE.md / Claude Code plugin / Cursor rules
Why "give success criteria" is more effective than "give step-by-step instructions"
How to verify the guidelines are actually working

Prerequisites

Familiarity with Claude Code, Cursor, or similar AI coding tools
Enough hands-on coding experience to recognize LLM coding pain points

Project Background

Project Introduction

andrej-karpathy-skills is fundamentally a behavioral configuration file. Its design philosophy stems from one key insight:

LLM coding problems are often not about capability — they are about unconstrained behavior.

The model is capable of writing simple code, but nothing tells it "don't write complex code." It is capable of asking for clarification first, but there is no pressure making it do so. It knows it shouldn't touch unrelated code, but the habit of "while I'm at it" is hard to suppress.

This CLAUDE.md file makes those constraints explicit, injecting them into every conversation as context.

The project ships in three formats to cover different workflows and tools.

Author/Team

Maintainer: multica-ai (Multica AI team)
Inspiration: Andrej Karpathy's observations shared on X about LLM coding usage
Original author: The CLAUDE.md content was originally compiled by forrestchang; multica-ai extended it into a full plugin ecosystem

About Andrej Karpathy: Co-founder of OpenAI, former Tesla AI Director, now independent researcher. Known for nanoGPT, Neural Networks: Zero to Hero, and other educational projects widely followed in the AI community. His practical feedback on AI tools carries significant weight.

Project Stats

📄 Core file: CLAUDE.md (behavioral guidelines)
🔌 Claude Code plugin + Cursor rules
📖 Includes: EXAMPLES.md (contrast examples for each principle)
📄 License: MIT
🌐 Repository: multica-ai/andrej-karpathy-skills

Main Features

Core Utility

This CLAUDE.md directly targets the four most common LLM coding failure modes:

Common LLM coding failures
  ├── Silent assumptions (dive into code without clarifying)
  ├── Over-engineering (turn simple problems into complex ones)
  ├── Scope creep (touch unrelated code while "in the area")
  └── Vague execution (no verifiable definition of done)
        ↓ CLAUDE.md injects four behavioral principles
  Changed behavior
  ├── Think Before Coding
  ├── Simplicity First
  ├── Surgical Changes
  └── Goal-Driven Execution

Quick Start

Method 1: Claude Code plugin (recommended, global)

/plugin marketplace add forrestchang/andrej-karpathy-skills
/plugin install andrej-karpathy-skills@karpathy-skills

Applies to all projects after install — no per-project setup needed.

Method 2: Per-project CLAUDE.md

cd your-project
curl -o CLAUDE.md https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md

If your project already has a CLAUDE.md, merge the content manually.

Method 3: Cursor rules

The repo ships .cursor/rules/karpathy-guidelines.mdc with alwaysApply: true. Copy it into your project:

mkdir -p .cursor/rules
cp path/to/karpathy-guidelines.mdc .cursor/rules/

Verify it's active: Cursor → Settings → Rules — the guideline should appear in the list.

Signs the guidelines are working:

Claude asks clarifying questions before implementing, rather than diving in
PR diffs are cleaner — only the requested changes appear
No more "while I was there, I also improved..." in commit messages

The Four Principles

Principle 1: Think Before Coding

The problem it targets: When faced with ambiguous requirements, LLMs tend to make implicit assumptions and execute silently, rather than surfacing the uncertainty.

Wrong behavior example:

User: Add a feature to export user data
LLM: [immediately starts writing code]
     → Assumes CSV format, assumes all fields, assumes synchronous download
     → But the user actually needed a background job + specific fields + JSON format

Correct behavior (the principle requires surfacing uncertainty first):

User: Add a feature to export user data
LLM: Before I start, a few things to clarify:
     1. Scope: Export all users or a filtered subset?
     2. Format: CSV / JSON / Excel?
     3. Delivery: Browser download, background job via email, or API endpoint?
     4. Fields: Which fields to include? Any sensitive data restrictions?

Another canonical example — "Make search faster" has three completely different valid meanings:

Interpretation	Actual Need	Solution
Response time	API returns slowly	Add indexes, caching, optimize queries
Throughput	High concurrency	Horizontal scaling, queuing
Perceived UX speed	User feels it is slow	Preloading, skeleton screens, instant feedback

All three require fundamentally different approaches. None can be "defaulted to."

What the principle requires: When multiple reasonable interpretations exist, present all of them and let the user choose. When genuinely confused, stop and say "I'm not sure how to handle this" rather than pushing through.

Principle 2: Simplicity First

The problem it targets: LLMs have a strong over-engineering tendency — introducing abstractions, frameworks, and "flexibility" before complexity is actually needed.

Wrong behavior example — discount calculation:

# User request: implement a simple discount calculation

# ❌ LLM's "solution" (10x the code needed):
class DiscountStrategy(ABC):
    @abstractmethod
    def calculate(self, price: float) -> float: ...

class PercentageDiscount(DiscountStrategy):
    def __init__(self, config: DiscountConfig): ...
    def calculate(self, price: float) -> float: ...

class DiscountCalculator:
    def __init__(self, strategy: DiscountStrategy): ...
    def apply(self, cart: Cart) -> float: ...

# ...plus factory class, config class, registry...

# ✅ What was actually needed (one function):
def apply_discount(price: float, discount_pct: float) -> float:
    return price * (1 - discount_pct / 100)

Another example — "Save user preferences":

❌ LLM implemented:
   - A caching layer with expiry (nobody asked)
   - Input validation (no bad data has appeared yet)
   - Conflict merging logic (nobody hit this problem)
   - A change notification system (nobody mentioned it)

✅ What was actually needed:
   - One function that writes preferences to the database

The benchmark the principle provides:

"Would a senior engineer look at this and say it's overcomplicated?"

If 200 lines could be 50, rewrite it as 50.

Core maxim:

"Good code is code that solves today's problem simply, not tomorrow's problem prematurely."

Premature complexity is not just wasteful — it makes code harder to understand, introduces more bugs, and slows development, even when it follows recognized design patterns.

Principle 3: Surgical Changes

The problem it targets: LLMs do "drive-by refactoring" — while fixing a bug, they also update quote styles, add type annotations, rename variables, and reorganize imports.

This behavior feels helpful but has two serious problems:

Makes diffs hard to review: Reviewers cannot distinguish which changes are bug fixes from which are "while I was there" improvements
Introduces unexpected regressions: Every unrequested change is a potential risk point

The right approach:

# Original code (has a bug, and some "imperfections")
def calculate_total(items):
    total = 0
    for item in items:
        total += item['price']  # single quotes
    return total                 # no type annotation

# ❌ LLM's "comprehensive improvement":
def calculate_total(items: list[dict]) -> float:  # added type annotation
    """Calculate total price of items."""          # added docstring
    total: float = 0.0                             # changed variable type
    for item in items:
        total += item["price"]  # changed to double quotes ("style consistency")
    return total

# ✅ Only fix the bug (suppose the bug is empty items):
def calculate_total(items):
    if not items:               # only this one line added
        return 0
    total = 0
    for item in items:
        total += item['price']  # original style preserved
    return total

Specific requirements of the principle:

Every changed line must trace directly to the user's request
Match existing code style even if you prefer a different one
Do not improve code you happen to pass through unless explicitly asked
Only clean up unused imports/variables that your changes created — leave pre-existing dead code alone

Principle 4: Goal-Driven Execution

The problem it targets: Given a vague task, LLMs produce plans that look comprehensive but lack verifiable outcomes.

A vague plan example:

Task: "Refactor the auth module"

❌ Vague plan:
   1. Review existing code
   2. Identify problems
   3. Improve structure
   4. Run tests
   → Not a single step has a clear definition of "done"

The principle requires converting tasks to verifiable goals:

Task: "Fix the login bug"

✅ Goal-driven plan:
   Step 1: Write a failing test that reproduces the bug
           Checkpoint: Test actually fails on current code
   Step 2: Implement the fix
           Checkpoint: The test now passes
   Step 3: Run the full test suite
           Checkpoint: No new regressions
   → Every step has a clear, objective definition of "complete"

Another example:

Task: "Refactor the auth module"

✅ Concretized:
   1. All existing tests pass (record baseline)
   2. Extract TokenService (checkpoint: standalone unit tests pass)
   3. Refactor AuthController to use TokenService (checkpoint: integration tests pass)
   4. All original tests still pass (no regressions)

Karpathy's core insight:

LLMs excel at "looping until they meet specific goals" — so providing success criteria is more effective than providing imperative instructions.

Imperative instructions ("do A, then B, then C") leave the LLM without guidance when something goes wrong. Declarative goals ("this test must pass," "this interface must be callable") let the LLM choose its own path while giving it a clear completion criterion.

Why a File, Not a Tool?

The design choice of this project is worth reflecting on. Faced with LLM coding behavior problems, many solutions are possible:

Build an agent framework to constrain behavior
Develop post-processing tools to detect and correct issues
Fine-tune a better model

andrej-karpathy-skills chose the simplest one: a text file, placed in the project, which the LLM reads and follows itself.

This choice is itself the best demonstration of "Simplicity First" — minimum mechanism, today's problem solved. And a text file has one advantage no tool can match: it can be read, understood, and modified by anyone at any time, with no black box.

Project Links & Resources

Official Resources

🌟 GitHub: https://github.com/multica-ai/andrej-karpathy-skills
📄 Direct CLAUDE.md download: Available via curl (see Quick Start above)
📖 Examples: EXAMPLES.md in the repo (contrast examples for each principle — recommended reading)

Target Audience

Daily Claude Code / Cursor users: Who want to reduce LLM over-engineering and unnecessary code changes
Team engineering productivity leads: Looking to integrate behavioral standards into a shared CLAUDE.md and standardize AI-assisted coding across a team
Developers who care about reviewable PRs: Who are tired of LLM-generated "super diffs" and want clean, focused pull requests containing only the requested changes

Summary

Key Takeaways

Origin: Directly distilled from Karpathy's first-hand observations of LLM coding failure modes — grounded in real usage
Four principles:
- Think Before Coding: Make implicit assumptions explicit questions rather than silently picking one
- Simplicity First: Write the minimum code to solve today's problem, don't pre-build "flexibility"
- Surgical Changes: Every changed line traces to the request; no drive-by refactoring
- Goal-Driven Execution: Provide success criteria, not step-by-step instructions
Three install formats: CLAUDE.md (per-project) / Claude Code plugin (global) / Cursor rules
Core philosophy: LLMs excel at looping toward goals — give them goals, not procedures
Self-demonstrating: The simplest possible solution (one file) to the problem it addresses — Simplicity First, embodied

One-Line Review

andrej-karpathy-skills does something deceptively small but far-reaching: compresses the engineering wisdom of "how to use LLMs well for coding" into a single text file anyone can read, understand, and drop into any project — and that file itself is the best proof of the simple-first philosophy it advocates.

Find more useful knowledge and interesting products on my Homepage

Building Reliable AI Agents: Harness Engineering and Multi-Agent Architecture in Practice

WonderLab — Fri, 22 May 2026 06:58:31 +0000

The Problems We Actually Ran Into

If you've built an AI-assisted analysis tool, you've probably hit these two walls:

Wall #1: Inconsistent output quality. The longer the task chain, the more the AI drifts — its language stays precise, its tone stays confident, but the conclusions don't hold up. Ask it "are you sure?" and it'll double down with even more conviction.

Wall #2: Token costs keep climbing. The more history you accumulate, the more you have to re-feed the model on every new session. Token consumption grows linearly. Analysis quality doesn't follow.

These are real problems we encountered building a CarPlay bug analysis tool. AI performance on multi-project, long-chain bug analysis was wildly inconsistent. When we introduced a multi-agent architecture to fix that, token consumption and runtime shot up instead.

Those two problems pushed us to find a systematic engineering solution.

Part 1: Harness Engineering — Putting a Leash on the Model

The Problem: Non-determinism Is an Agent's Original Sin

LLMs are inherently probabilistic. In a single-turn conversation, that randomness is what makes them creative. In a long-chain task, it's what makes them dangerous.

A typical failure scenario: you ask an Agent to complete a 10-step development task. Steps 1–7 go fine. On step 8, the Agent drifts slightly. Step 9 builds on the drift. By step 10, you receive a result that looks complete but is entirely off-target. And you almost don't notice — because the Agent wrote a very convincing summary.

When AI starts executing autonomously across many steps, the central engineering challenge becomes: how do you supervise it and course-correct before the damage is done?

The Solution: Agent = Model + Harness

In February 2026, Martin Fowler's team (author: Birgitta Böckeler) introduced the concept of Harness Engineering:

Agent = Model + Harness

Harness is everything in an Agent that isn't the model itself — prompts, tool definitions, rules, context management, validation mechanisms, feedback loops. All of it is the Harness.

This definition sounds unremarkable at first. But it carries an important shift in thinking: to improve Agent reliability, don't swap in a better model — design a better Harness.

A Harness has two components:

Guides (feedforward): Give the AI the right inputs before it acts — clear instructions, relevant context, structured task descriptions
Sensors (feedback): Validate the AI's outputs after it acts — independent validators, quality evaluators, anomaly detectors

LangChain's experiment makes this concrete: without changing the underlying model, using Harness Engineering alone, their Agent's benchmark ranking jumped from outside the top 30 to top 5.

The Goal: Fix Problems Before They Reach Human Eyes

One sentence captures what Harness Engineering is for:

To make AI Coding Agents work with less human supervision, you need to systematically build an "external control framework" — the Harness. It's composed of feedforward Guides and feedback Sensors, with the goal of automatically correcting problems before they ever reach a human reviewer.

Part 2: Two Failure Modes That Break Single Agents

With the framework established, let's look at exactly where single agents fail. Anthropic's article Harness Design for Long-Running Application Development identifies two core failure modes:

Failure Mode 1: Context Anxiety

As a task gets long and the Agent approaches its context window limit, it starts to "panic" and rush to finish — marking incomplete work as done, writing vague analyses with false certainty.

This isn't a bug. It's the model using "confident closure" as a coping mechanism for "uncertain continuation."

Fix: Don't use Compact (context compression). Use Context Reset — completely clear the context, start a new Agent with a structured handoff document, and let the new Agent take over.

Failure Mode 2: Self-Evaluation Breakdown

Ask an Agent to evaluate its own output and it becomes pathologically optimistic — it'll give itself high marks even when the work is poor, because it's evaluating using the same cognitive framework that produced the output.

It's like asking someone to take an exam and grade it themselves. Almost guaranteed to score well.

Fix: Introduce an independent Evaluator Agent, deliberately prompted to be critical and skeptical, with no shared context with the Generator.

These two discoveries directly motivate the first and most fundamental multi-agent pattern.

Part 3: Multi-Agent Architecture — Five Coordination Patterns

Some teams choose a pattern based on how sophisticated it sounds, not on whether it fits the problem at hand. Start with the simplest pattern that might work, see where it struggles, then evolve from there.

Pattern 1: Generator-Validator

Best for: Tasks where output quality is critical and evaluation criteria can be stated explicitly.

How it works: Generator produces output → Validator evaluates against explicit criteria → if rejected, Generator gets specific feedback → loop until accepted or max iterations reached.

Generator → Output → Validator ──pass──→ Done
                          │
                     fail + feedback
                          │
                          └─→ Generator (next round)

Typical use cases:

Code generation (Generator writes code, Validator writes and runs tests)
Customer support replies (Validator checks accuracy, tone, completeness)
Compliance review (Validator checks output against rules line by line)

Critical caveat: The Validator must have specific, explicit criteria — not "check if it's good." A Validator without concrete standards will just rubber-stamp the Generator's output. Also set a maximum iteration count with a fallback strategy to prevent infinite oscillation.

Pattern 2: Orchestrator-Subagent

Best for: Tasks that decompose cleanly into independent subtasks with minimal interdependencies.

How it works: Orchestrator handles global planning and task delegation. Subagents each own a specific responsibility and report results back. Orchestrator integrates and produces final output.

Orchestrator ──delegate──→ Subagent A (security check)
             ──delegate──→ Subagent B (code style)
             ──delegate──→ Subagent C (test coverage)
                   ←──── collect all results ────
                   → integrate into final report

Claude Code uses this pattern: the main Agent handles the primary workflow while dispatching subagents in the background to search codebases or investigate independent questions — keeping the Orchestrator's context focused on the main task while parallel work happens elsewhere.

Limitation: The Orchestrator is an information bottleneck. Information discovered by one subagent that's relevant to another must route through the Orchestrator. Key details get lost or over-summarized after a few hops.

Pattern 3: Agent Team

Best for: Tasks that decompose into long-running, independent subtasks where each worker benefits from accumulating domain context over time.

How it works: Coordinator distributes work via a shared queue. Multiple Workers each pick up tasks, run autonomously through multi-step work, and signal on completion. Unlike Pattern 2, Workers persist across tasks — they keep accumulating context rather than starting fresh each time.

Coordinator ──→ Task queue
Worker A ←── pick up task ──→ complete ──→ signal
Worker B ←── pick up task ──→ complete ──→ signal
Worker C ←── pick up task ──→ complete ──→ signal
                                ↓
              Coordinator collects → integration tests

Typical use case: Migrating a large codebase from one framework to another, with each Worker independently migrating one service.

Limitation: Independence is a prerequisite. Workers can't easily share intermediate findings. Careful task partitioning and conflict resolution mechanisms are required.

Pattern 4: Message Bus

Best for: Event-driven pipelines where the agent ecosystem is expected to keep growing.

How it works: Agents communicate through publish/subscribe events, decoupled from each other. A router delivers matching messages. New agents can join by subscribing to topics without modifying existing connections.

Alert sources → Triage Agent → Router
                                ├──→ Network Investigation Agent
                                ├──→ Identity Analysis Agent
                                └──→ Context Enrichment Agent
                                            ↓
                                Response Coordination Agent → Actions

Best fit when: Work is triggered by events rather than a predetermined sequence, and teams need to develop and deploy individual agents independently.

Limitation: The longer the event chain, the harder debugging becomes. A misrouted message causes silent failures — the system doesn't crash, it just doesn't process the event.

Pattern 5: Shared State

Best for: Collaborative tasks where agents need to build on each other's discoveries, without a central coordinator.

How it works: Agents run autonomously, coordinating through a shared persistent store (database, filesystem, or document). No central Orchestrator. Each agent reads what others have written, takes action based on those findings, and writes its own discoveries back.

Agent A (academic literature) ─┐
Agent B (industry reports)    ─┤──→ Shared knowledge store
Agent C (patent filings)      ─┘    ↑ agents read each other's findings
                                    → iteratively deepen research

Limitation: Without explicit coordination, agents may duplicate work. The hardest failure mode is the reactive loop: Agent A writes a finding → Agent B responds → Agent A reacts again — burning tokens indefinitely. Termination conditions must be first-class citizens: time budgets, convergence thresholds (no new findings after N cycles), or a dedicated "am I done?" judge agent.

Part 4: Solving the "Goldfish Memory" Problem

The Problem: The Context Tax

Claude Code has a fundamental limitation: it's stateless. Close the conversation window, memory resets.

Every new session, you have to re-explain everything — project architecture, past decisions, coding style preferences, bugs you've already ruled out. You're paying a context tax on every session just to get the AI back up to speed.

Worse: this repeated loading is billed by token. You're paying for compute that produces zero new value.

Solution 1: Claude Code's Native Memory (CLAUDE.md + Auto Memory)

Claude Code offers two mechanisms for carrying knowledge across sessions:

	CLAUDE.md Files	Auto Memory
Written by	You	Claude automatically
Contains	Instructions and rules	Learned patterns and preferences
Best for	Coding standards, architecture, workflows	Build commands, debugging insights, behavior preferences
Loaded each session	First 200 lines	First 200 lines

CLAUDE.md placement determines scope, from most to least specific:

.claude/CLAUDE.md (project-level, shared via version control)
~/.claude/CLAUDE.md (user-level, applies across all projects)

For larger projects, split rules into .claude/rules/ with one file per topic. Rules can also be path-scoped using YAML frontmatter — only loaded when Claude is working on matching files:

---
paths:
  - "src/api/**/*.ts"
---
# API Development Rules
- All endpoints must include input validation
- Use the standard error response format

Path-scoped rules reduce context noise — the relevant rules load when relevant, and stay out of the way otherwise.

Auto Memory stores Claude's self-generated notes at ~/.claude/projects/<project>/memory/:

memory/
├── MEMORY.md          # Index file, loaded every session
├── debugging.md       # Detailed debugging notes
├── api-conventions.md # API design decisions
└── ...

Run /init to auto-generate a starter CLAUDE.md. Claude will update Auto Memory over time based on your corrections and preferences.

Solution 2: claude-mem — The Community's Context Tax Workaround

The native solution is reactive (you correct → Claude updates). The open-source project claude-mem is more aggressive:

npx claude-mem install

Core mechanism: Attach a local memory store outside Claude Code. Hooks intercept every tool call, compress the interaction into a summary stored in SQLite, and on the next session inject only the semantically relevant history — not everything.

Data flow:

Tool call → PostToolUse Hook captures
         → Claude API call (Observer role)
         → Compresses to XML-format observation
         → Stored in SQLite + Chroma vector DB
         ↓
New session → SessionStart
         → Query last 50 observations + 10 summaries
         → Inject into Claude context
         ↓
User submits prompt → UserPromptSubmit
         → Semantic search for top 5 relevant observations
         → Precision-inject relevant history

Reported result from the project itself: 95% token reduction — 6 observations consumed 2,911 tokens to deliver work that would have taken 56,291 tokens with full context re-loading.

The Observer uses a structured prompt to produce parseable XML:

<observation>
  <type>bugfix</type>
  <title>CarPlay startup disconnect root cause identified</title>
  <narrative>Root cause was IOKit initialization timing. Fix was to...</narrative>
  <facts>
    <fact>Disconnect occurs 200ms after kIOMessageServiceIsTerminated event</fact>
    <fact>CarPlay framework begins handshake before driver init completes</fact>
  </facts>
</observation>

Part 5: Teaching the AI Your Habits — Continuous Learning

The Problem: AI Has No Muscle Memory

You've used Claude Code for three months. It still doesn't know your coding style. Every new session, it's a new employee who knows nothing about you — doesn't know you prefer functional over OOP, doesn't remember your project's quirky conventions, has no record of how you solved a similar problem last week.

The Solution: Hooks + Instinct System

This architecture comes from the claude-code-everything project and consists of two independent subsystems:

Subsystem A: Memory Persistence
  → Answers "what did I do last session?" 
  → Short-term memory, restores work state across sessions

Subsystem B: Instinct Learning (Continuous Learning)
  → Answers "what are the user's habits?"
  → Long-term learning, accumulates behavioral preferences

Both are triggered via Hooks and inject their results into Claude's context at session start.

What Are Hooks?

Hooks are event-driven triggers that fire before and after Claude Code tool calls:

User request → Claude selects tool → PreToolUse hook → Tool executes → PostToolUse hook

Hook Type	Fires When	Key Input
PreToolUse	Before tool execution	tool_name, tool_input
PostToolUse	After tool completes	tool_name, tool_input, tool_output
Stop	After each Claude response	transcript_path (full session JSONL)
SessionStart	Session begins	session_id, cwd (can inject additionalContext)
SessionEnd	Session ends	session_id

PreToolUse hooks can control whether the tool runs: exit 0 continues, exit 2 aborts and surfaces the error to Claude.

What Does an Instinct Look Like?

An Instinct is a single atomic behavioral preference stored as a YAML file:

---
id: grep-before-edit
trigger: "when modifying existing code"
confidence: 0.7
domain: workflow
scope: project
---

# Grep Before Edit

## Action
Use Grep to locate code before Edit to confirm exact location.

## Evidence
- Observed 8 times across sessions
- Pattern: Grep → Read → Edit sequence repeated consistently
- Last observed: 2026-04-16

confidence: 0.3–0.9, controls whether the instinct gets injected (threshold ≥ 0.7)
scope: project (this project only) or global (all projects)
trigger + action: what Claude actually sees when this instinct is active

The Full Learning Pipeline

Tool call
  → observe.sh (async, non-blocking)
      → append to observations.jsonl
      → increment counter, send SIGUSR1 every 20 calls

observer-loop.sh (background daemon)
  → receives SIGUSR1
  → take last 500 observations
  → spawn Claude Haiku analysis (claude --model haiku --print)
      → Haiku identifies behavioral patterns
      → writes instinct YAML by rule:
          3–5 occurrences  → confidence 0.5
          6–10 occurrences → confidence 0.7
          11+ occurrences  → confidence 0.85
  → archive analyzed observations

New session → SessionStart
  → session-start.js reads instinct YAML files
  → filter confidence ≥ 0.7, take top 6
  → inject as additionalContext:

"Active instincts:
- [project 70%] Use Grep to locate code before Edit
- [global 85%] Grep before Edit, Read before Write"

Using Haiku instead of Sonnet for analysis is a deliberate cost decision — pattern recognition doesn't need the most powerful model, and this process fires every 20 tool calls.

Putting It Together

These four mechanisms address four distinct layers of AI Agent engineering:

Layer	Problem	Solution
Stability	AI drifts off-track on long tasks	Harness Engineering (Guides + Sensors)
Reliability	Single agent failure modes, self-evaluation blindspot	Multi-agent architecture (match pattern to problem)
Continuity	Every session starts from zero	CLAUDE.md + Auto Memory + claude-mem
Growth	AI can't accumulate behavioral habits	Hooks + Instinct continuous learning

The direction of travel is clear: from "re-teach the AI every session" to "the more you use it, the more it knows you." None of these solutions are theoretical novelties. They're engineering practices that emerged from real projects hitting real walls — and finding ways through them.

This article is based on the CarPlay bug analysis tool development experience from the Connected Car team. Shared for learning and discussion.

RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase

WonderLab — Thu, 21 May 2026 12:52:10 +0000

The Difference Between Code and Documents

Split a Python file into 1000-character chunks with RecursiveCharacterTextSplitter, embed them, run vector search — this is the most common "code RAG" implementation. The problem is that it treats code as text:

def evaluate_rag(questions, answers, contexts):
    """Evaluate RAG system quality"""
    ...（50 lines of code）...

Character-based chunking will:

Split functions in half (first half in chunk A, second half in chunk B)
Lose function boundary information (this IS evaluate_rag, not random text)
Ignore call relationships (what this function calls, who calls it)
Destroy structural hierarchy (this is a method of RAGPipeline)

Code carries three layers of information: semantics (what it does), structure (function/class/module boundaries), call relationships (who calls whom). Good code RAG models all three.

This article uses llm-in-action as the target and builds a code RAG system capable of answering "how is this function used?" and "show me all call chains through this function."

Parse Code with AST, Not Character Offsets

Python's ast module parses source files into syntax trees. A function definition is a node (ast.FunctionDef) with its exact start line, end line, and decorator list. Chunking at AST boundaries guarantees splits at function edges:

class _FuncExtractor(ast.NodeVisitor):
    def __init__(self, source: str, rel_path: str):
        self._lines       = source.splitlines()
        self._rel_path    = rel_path
        self._class_stack: list[str] = []
        self.units:        list[CodeUnit] = []

    def visit_ClassDef(self, node: ast.ClassDef):
        # Track current class so methods know their parent_class
        self._class_stack.append(node.name)
        self.generic_visit(node)
        self._class_stack.pop()

    def _visit_func(self, node):
        # Extract source by line number, not character offset
        src = "\n".join(self._lines[node.lineno - 1 : node.end_lineno])
        unit = CodeUnit(
            name         = node.name,
            kind         = "method" if self._class_stack else "function",
            file         = self._rel_path,
            start_line   = node.lineno,
            end_line     = node.end_lineno,
            source       = src,
            docstring    = ast.get_docstring(node) or "",
            parent_class = self._class_stack[-1] if self._class_stack else "",
            calls        = self._extract_calls(node),
        )
        self.units.append(unit)
        self.generic_visit(node)

    visit_FunctionDef      = _visit_func
    visit_AsyncFunctionDef = _visit_func

Call relationships are extracted from ast.Call nodes:

def _extract_calls(self, node) -> list[str]:
    calls: set[str] = set()
    for child in ast.walk(node):
        if isinstance(child, ast.Call):
            if isinstance(child.func, ast.Name):
                calls.add(child.func.id)           # direct call: foo()
            elif isinstance(child.func, ast.Attribute):
                calls.add(child.func.attr)          # attribute call: self.foo()
    return sorted(calls)

Extraction results on llm-in-action

Scanned: /mnt/hdd/Database/03_Projects/LLM/llm-in-action
Time: 0.13 seconds

Python files:   22
Functions:      188 (top-level)
Methods:         37 (class methods)
Total units:    225
Article dirs:    18

0.13 seconds to scan the entire codebase. AST parsing doesn't execute code, so there are zero side effects.

Call Graph: Understanding Who Calls Whom

Once function call relationships are extracted, build a bidirectional adjacency map — queryable in both directions:

class CallGraph:
    def __init__(self, units: list[CodeUnit]):
        self.callees: dict[str, set[str]] = defaultdict(set)  # caller → called
        self.callers: dict[str, set[str]] = defaultdict(set)  # callee → caller

        known = {u.name for u in units}
        for u in units:
            for callee in u.calls:
                if callee in known:           # intra-repo edges only
                    self.callees[u.name].add(callee)
                    self.callers[callee].add(u.name)

    def downstream(self, name: str, depth: int = 4) -> list[str]:
        """All functions transitively called by name (BFS)."""
        return self._bfs(name, self.callees, depth)

    def upstream(self, name: str, depth: int = 4) -> list[str]:
        """All functions that transitively call name (BFS)."""
        return self._bfs(name, self.callers, depth)

    def shortest_path(self, start: str, end: str) -> Optional[list[str]]:
        """Shortest call path from start → end."""
        queue: deque[list[str]] = deque([[start]])
        visited: set[str] = {start}
        while queue:
            path = queue.popleft()
            if path[-1] == end:
                return path
            for nxt in self.callees.get(path[-1], set()):
                if nxt not in visited:
                    visited.add(nxt)
                    queue.append(path + [nxt])
        return None

Call graph analysis results

Call graph statistics:
  Functions with outgoing edges:  78  (they call others)
  Functions with incoming edges:  92  (they are called)
  Total edges:                   168

Most-called functions (the codebase's core utilities):

get               ← called from 48 places  (cache reads throughout all articles)
set               ← called from 10 places  (cache writes)
split_documents   ← called from  5 places  (shared chunking helper)
build_embeddings  ← called from  4 places
query             ← called from  4 places

get appearing 48 times reflects Python duck typing — cache .get() calls across SemanticCache, InMemoryCache, and similar types all collapse to the same name in static analysis.

Functions with the most outgoing calls (orchestrators):

main                → 54 direct calls
build_self_rag_graph →  6 direct calls
build_index          →  5 direct calls
build_ragas_dataset  →  5 direct calls

main calling 54 functions is the signature of an entry point — it orchestrates the full pipeline by calling every sub-step.

Call chain traversal

build_self_rag_graph (14-self-rag/self_rag.py) full downstream:

build_self_rag_graph
  ├── make_retrieve_node
  ├── make_filter_node
  ├── make_decide_node
  ├── make_support_node
  ├── make_rag_generate_node
  └── make_direct_generate_node

This is exactly the Self-RAG StateGraph builder pattern: one factory function assembles all graph nodes, each node is an independent small function. The call graph makes this structure immediately visible.

build_index (08-ragas-eval/rag_pipeline.py) downstream chain:

build_index
  → load_documents
  → build_llm
  → build_embeddings
  → split_documents
  → get  (cache)

A canonical RAG initialization sequence: load docs → build LLM → build embeddings → chunk → cache.

Vector Store: Semantic Code Search

Code vectorization has one engineering constraint: function source can be long (50–200 lines), but embedding APIs commonly have a 512-token limit.

Solution: separate the retrieval unit from the Q&A context.

Embedding content: function name + docstring (short, semantically precise, fits in token budget)
Metadata: complete source code (stored in Chroma's metadata field, read at Q&A time for LLM context)

sig_line      = u.source.splitlines()[0]
embed_content = f"{full_name}: {u.docstring or sig_line}"[:400]

Document(
    page_content = embed_content,         # vectorized — used for retrieval
    metadata = {
        "name":        u.name,
        "file":        u.file,
        "start_line":  u.start_line,
        "source_code": u.source[:2000],   # not vectorized — used for LLM context
    },
)

At Q&A time, retrieval finds relevant functions, then the full source is read from metadata:

docs    = vs.similarity_search(question, k=4)
context = "\n\n---\n\n".join(
    d.metadata.get("source_code", d.page_content)[:600] for d in docs
)

Semantic search results

Query: "RAGAS evaluation metrics calculation"
  0.488  RAGPipeline.build_index   (08-ragas-eval/rag_pipeline.py:95)
  0.476  create_ragas_embeddings   (08-ragas-eval/evaluate.py:50)
  0.467  RAGPipeline.query         (08-ragas-eval/rag_pipeline.py:141)

Query: "rate limiting and access control in enterprise RAG"
  0.504  RAGPipeline.__init__      (08-ragas-eval/rag_pipeline.py:78)
  0.497  RateLimiter.__init__      (20-enterprise-rag/enterprise_rag.py:118)

Query: "incremental document indexing with record manager"
  0.296  generate_testset          (08-ragas-eval/generate_qa.py:51)

Query: "conversational history aware retriever"
  0.400  make_ds                   (18-conversational-rag/conversational_rag.py:428)

RAGAS and enterprise RAG rate limiting queries found the right files. Incremental update didn't — because the functions in 19-incremental-update/ don't mention "record manager" in their docstrings, only in their source code bodies. This is the core limitation of docstring-only embedding: search quality is bounded by docstring quality.

Choosing a Code Embedding Model

General-purpose text embedding models (BGE, text-embedding-3) are "adequate but not great" for code. They can retrieve by docstring, but don't understand that for i in range(n): acc += arr[i] is an accumulation.

Specialized code embedding models:

Model	Characteristics
`microsoft/codebert-base`	Code + documentation dual-tower; understands variable names and signatures
`Salesforce/codet5-base`	Generative model; suited for code completion + retrieval
`nomic-ai/nomic-embed-text-v1.5`	General model with strong code performance; 8192-token limit
`voyage-code-2`	Voyage AI's code-specialized model; among the best available

Recommended: if token limits aren't a concern (e.g., nomic-embed-text-v1.5 supports 8192 tokens), embed the complete function source directly — no need to split docstrings from source.

The Complete Code RAG Pipeline

# Build a code RAG system

# 1. AST extraction: all functions and methods
units = extract_repo(repo_dir)

# 2. Call graph: bidirectional adjacency
cg = CallGraph(units)

# 3. Vector store: docstrings for retrieval, source_code in metadata for Q&A
vs = build_vectorstore(units, embeddings)

# Three query modes

# A: Semantic search — find functions by meaning
hits = vs.similarity_search("embedding caching", k=5)

# B: Call chain — given a function name, find all upstream/downstream
callers  = cg.upstream("build_embeddings")    # → who calls it
callees  = cg.downstream("main")              # → what it calls
path     = cg.shortest_path("main", "get")    # → how main reaches get

# C: LLM Q&A — retrieve context, generate answer
answer = llm_code_qa("How is incremental update implemented?", vs, llm)

Results Summary

Metric	Value
Python files	22
Code units extracted	225 (188 functions + 37 methods)
AST parse time	0.13 seconds
Call graph edges	168
Vectorization time	5.8 seconds
Most-called function	`get` (48 places)
Widest caller	`main` (54 direct calls)

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/24-code-rag

Key file:

code_rag.py — AST extraction, call graph, vectorization, search, report

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 24-code-rag
cp .env.example .env
pip install -r requirements.txt
python code_rag.py

Summary

The core difference between code RAG and document RAG:

	Document RAG	Code RAG
Chunk unit	Fixed-size text blocks	Functions/methods (AST boundaries)
Structure	None	Class hierarchy, module hierarchy
Call relationships	None	Call graph (bidirectional)
Embedding content	Full text	Docstring + signature (or full source)
Query types	Semantic search	Semantic search + call chain traversal

Three key tradeoffs:

AST vs text chunking: AST cuts at function boundaries and preserves complete units. Text chunking is faster but destroys structure. For production code RAG, use AST — there's no reason not to.
Docstring vs full source embedding: Under token constraints, embed docstrings (short and semantically focused) — but quality depends on docstring completeness. With a long-context embedding model, embed the full source directly.
Call graph vs pure vector retrieval: Vector retrieval finds semantically similar functions; the call graph answers "what does X call?" and "who uses X?" — they're complementary, not interchangeable.

This is the final article in the RAG series. Twenty-four articles covering the complete path from "what is RAG?" to "how do you teach AI to understand a codebase?" All code is open-sourced at llm-in-action — every article has a runnable demo and a real benchmark report.

References

Agent Series (1): What Is an Agent — It's Not Just an LLM That Can Call Tools

WonderLab — Thu, 21 May 2026 07:43:56 +0000

You Think You're Using an Agent. You're Not.

In 2023, "AI Agent" became a buzzword overnight. Every company claimed they built an Agent. Every product slapped the Agent label on it.

But ask them: What's the fundamental difference between your Agent and a regular LLM call?

Most people go quiet for three seconds, then say: "...it can call tools."

Is that wrong? No. But it's missing the point. It's like answering "what's the difference between a car and a bicycle" with "a car has four wheels" — technically correct, but you forgot to mention the engine.

This article has one goal: help you understand what an Agent actually is — and why it's fundamentally different from an LLM or a Chatbot. Get this right, and you'll make better technical decisions instead of wrapping an LLM API call and calling it "our Agent system."

Start With a Scenario

Say you want to build an AI tool that analyzes competitors for users. The user types a company name, and the tool generates a competitive analysis report.

Option A: Direct LLM call

User input: Analyze Notion's competitors
↓
LLM generates report directly
↓
Output (based on training data, potentially outdated)

Option B: Chatbot

User input: Analyze Notion's competitors
↓
LLM generates reply, remembers conversation history
User follow-up: Focus on pricing strategy
↓
LLM continues with context
↓
Multi-turn conversation, still based on training data

Option C: Agent

User input: Analyze Notion's competitors
↓
Agent thinks: I need fresh data, let me search first
↓
Calls search tool → gets latest competitor info
↓
Agent thinks: I should compare pricing, let me calculate
↓
Calls calculation tool → gets result
↓
Agent thinks: I have enough information now
↓
Outputs report (grounded in real-time data, with sources)

See the difference? An Agent actively thinks "what do I need to do" and autonomously decides the next action. That's the core — not whether it can call tools, but who decides which tool to call and when.

Three Concepts, Three Levels

LLM: A "Brain" with Language Ability

A Large Language Model is fundamentally a function:

Input: text (prompt)
Output: predicted next token (repeated until done)

Its capabilities come from statistical patterns learned from massive amounts of text. It understands language, it can reason — but it has no memory, no perception, no action capability. Every call is stateless. It has no idea what you talked about last time.

A standalone LLM is like a brilliant scholar who only answers questions: deeply knowledgeable, but locked in a room with no windows, unaware of what's happening outside, unable to proactively do anything for you.

Chatbot: An LLM with Memory

Chatbot = LLM + conversation history management.

It solves one simple problem: making the LLM "remember" what was said in this conversation. The implementation is also simple — prepend conversation history to every prompt:

# Pseudocode: the core logic of a Chatbot
messages = []
while True:
    user_input = get_user_input()
    messages.append({"role": "user", "content": user_input})
    response = llm.invoke(messages)  # send full history to LLM
    messages.append({"role": "assistant", "content": response})
    print(response)

The limitation of a Chatbot: it can converse, but it can't act. It can't proactively look up information, call APIs, or run code — it can only answer based on what it already knows.

If the LLM is the brilliant scholar, the Chatbot is that scholar with a phone — you can finally have a conversation, but they're still in their room.

Agent: An Autonomous Actor

An Agent adds two critical capabilities on top of a Chatbot: tool use and autonomous decision-making loop.

But here's the key point that's often misunderstood: the tools themselves aren't what makes something an Agent. What matters is who decides which tool to use and when.

Chatbot with tools = you tell it "check the weather," it calls the weather API
Agent = it decides on its own that "to answer this question, I need to check the weather," then proactively calls it

This is the difference between passive response and active planning.

The Four Elements of an Agent

The clearest framework for understanding an Agent is to break it into four components. This framework draws from cognitive science research on intelligent behavior and represents the mainstream engineering understanding today.

┌─────────────────────────────────────────────────────────┐
│                        Agent                            │
│                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │  Perception │    │   Memory    │    │   Action    │  │
│  │             │    │             │    │             │  │
│  │ · User msgs │    │ · Chat hist │    │ · Call tools│  │
│  │ · Tool results    · Tool results    · Run code   │  │
│  │ · Environment│   · External KB│    │ · Call APIs │  │
│  └──────┬──────┘    └──────┬──────┘    └──────▲──────┘  │
│         │                  │                  │         │
│         └──────────┬───────┘                  │         │
│                    ▼                           │         │
│             ┌─────────────┐                   │         │
│             │  Reasoning  │───────────────────┘         │
│             │             │                             │
│             │ · Plan steps│                             │
│             │ · Choose tool                             │
│             │ · Decide done                             │
│             └─────────────┘                             │
└─────────────────────────────────────────────────────────┘

Perception: What the Agent can "see." At minimum, user input. More advanced: tool return values, database query results, screenshots, file contents. Perception defines the Agent's awareness — what it can't see, it can't act on.

Memory: What the Agent can "remember." This operates at several levels: current conversation history (short-term memory), past experiences stored in vector databases (long-term memory), and static external knowledge bases (semantic memory). We'll dedicate a full article to memory systems later in this series.

Reasoning: The Agent's "brain," and the most essential difference from a Chatbot. The LLM here acts as a controller, not a "question answerer." Its job: decompose the task, plan the steps, choose which tool to use next, decide when the task is complete.

Action: What the Agent can "do." Tool calls are the most common action — search, query a database, send an email, execute code. The range of actions defines the Agent's capability boundary — more tools means more tasks it can handle, but also higher risk of things going wrong (this is what Harness Engineering, a later topic, addresses).

These four elements are the foundation for understanding Agent architecture, and the central thread running through the rest of this series:

Perception + Memory → Article 6: Memory Management
Reasoning → Article 2: The ReAct Paradigm, Article 3: Plan-and-Solve
Action → Article 4: Tool Calling, Article 5: Intent Recognition

Two Agent Paradigms: Assembly Line vs. Expedition Guide

Real-world Agent systems break into two camps based on who controls the execution flow:

Workflow-Driven Agent

Representative tools: Dify, n8n, Coze, Zapier AI

Core idea: The developer draws the flowchart; LLM is one node in it.

Flowchart (defined by developer in advance):
Receive user question
    ↓
[LLM node] Classify question type
    ↓
If "billing question"  → [Tool node] Query billing system
If "complaint"         → [Tool node] Create support ticket
    ↓
[LLM node] Generate final reply
    ↓
Send to user

The execution path is pre-designed by the developer. The LLM handles natural language understanding and generation, but the "what happens next" logic is hardcoded in the flowchart.

Strengths:

Behavior is predictable; every path can be fully tested before launch
Easy to debug when something goes wrong (broken node is obvious)
Doesn't require the LLM to understand complex task planning

Best for:

Customer service bots (question types are fixed, processes are known)
Approval flow automation (steps are fixed, conditions are clear)
Form processing, data ETL (structured, predictable)

AI Native Agent

Representative frameworks: LangGraph, AutoGen, CrewAI

Core idea: LLM is the control center and decides what to do.

User question arrives
    ↓
[LLM reasoning]: I need current data, should search first
    ↓
[Calls search tool] → results returned
    ↓
[LLM reasoning]: A number in the results needs verification
    ↓
[Calls calculation tool] → result returned
    ↓
[LLM reasoning]: I have enough to answer now
    ↓
Final answer output

Every "what to do next" step is dynamically decided by the LLM at runtime. Nobody hardcoded the flow. This is the essence of AI Native Agent: the LLM isn't a tool — the LLM is the conductor.

Strengths:

Handles open-ended tasks with unclear boundaries
Adapts strategy based on intermediate results
Suited for problems requiring multi-step reasoning

Best for:

Open-ended research (user questions are diverse, impossible to enumerate)
Automated bug fixing (requires dynamic decisions based on code analysis)
Complex data analysis (needs multiple rounds of retrieval and computation)

An Analogy to Remember

Imagine planning a trip:

Workflow-Driven Agent = high-speed rail. Fixed tracks, fixed stops, fixed departure times. Highly efficient, never gets lost — but can only go where the rails go.

AI Native Agent = an experienced travel guide. You say "I want somewhere with historical character," and they ask a few questions, check reviews in real time, adjust the itinerary based on today's weather, and handle "the attraction is temporarily closed" on the fly. Flexible — but they might also take you on a detour.

When to Use an Agent vs. a Plain LLM Call

This is the most important engineering judgment you'll make — and the most common place for over-engineering.

The trap many fall into: Agent sounds sophisticated, so people reach for it regardless of the problem. But Agents have costs — longer response times, higher token consumption, more complex debugging.

Use this decision tree:

Does your task need an Agent?
│
├─ Are the task steps fixed and enumerable?
│   └─ Yes → use LLM + fixed Prompt, or Workflow-Driven Agent
│
├─ Does the task only need a single LLM call (no tools)?
│   └─ Yes → call the LLM API directly, no Agent needed
│
├─ Does the task need to decide the next step based on intermediate results?
│   └─ Yes → needs an Agent
│
├─ Does the task have more than 3 interdependent steps?
│   └─ Yes → needs an Agent
│
└─ Does the task need to handle situations you can't predict in advance?
    └─ Yes → needs an Agent (and specifically AI Native Agent)

Real examples:

Scenario	Recommended Approach	Reason
Article summarization	Direct LLM call	Single call, fixed prompt
FAQ chatbot	Chatbot	Multi-turn needed, no tools required
Customer service routing	Workflow-Driven Agent	Fixed flow, enumerable cases
Automated bug analysis & fix	AI Native Agent	Dynamic decisions based on code analysis
Competitive research report	AI Native Agent	Open-ended, needs multi-round search
Code review	AI Native Agent	Dynamic, depends on code structure

Don't use an Agent just because you can

If your task can be solved with a well-crafted Prompt, use the Prompt. The added complexity of an Agent (harder to debug, higher latency, higher cost) is only worth it when the task genuinely requires dynamic decision-making.

Anthropic's official guidance says it plainly: "LLMs should only be used as autonomous agents when autonomy and flexibility genuinely provide value — otherwise, direct API calls are more reliable and predictable."

What a Minimal AI Native Agent Looks Like

Enough theory — here's real code. Below is a minimal ReAct Agent built with LangGraph (this is the most foundational AI Native Agent paradigm; the next article covers it in depth):

# Dependencies: pip install langchain-anthropic langgraph
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent

# 1. Define tools (the Agent's "hands")
@tool
def search_web(query: str) -> str:
    """Search the web for current information"""
    # In real use, connect to a real search API (e.g., Tavily)
    return f"Search results: latest information about '{query}'..."

@tool
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression"""
    try:
        result = eval(expression)  # Note: don't use eval in production
        return str(result)
    except Exception as e:
        return f"Calculation error: {e}"

# 2. Create the Agent (LLM is the control center)
llm = ChatAnthropic(model="claude-sonnet-4-6")
tools = [search_web, calculate]

agent = create_react_agent(llm, tools)

# 3. Run
result = agent.invoke({
    "messages": [("user", "What is Apple's current market cap? How much more is that than $1 trillion?")]
})
print(result["messages"][-1].content)

Running this code, the Agent will automatically:

Decide it needs to search for Apple's market cap
Call search_web
See the result, decide it needs to compute the difference
Call calculate
Combine the results into a final answer

Nobody told it to search first and then calculate. It planned that on its own. That's how an AI Native Agent works.

Notes on the code above

eval(expression) has security implications in production; replace with a safe math library (e.g., numexpr)
A real search tool requires connecting to a search API like Tavily or SerpAPI
The model claude-sonnet-4-6 is the recommended version as of this article's writing (May 2026); adjust as needed

How to Explain This in an Interview

Common interview question: Is your system an Agent or a Workflow? What's the difference?

Many candidates stumble here because they've never seriously considered which one they actually built.

A clear response framework:

"Our system uses an AI Native Agent architecture, and the core distinction is who controls the execution flow.

In a Workflow-Driven approach, the developer pre-defines all possible paths, and the LLM is just one processing node — it's more predictable and well-suited for fixed-step scenarios.

We chose AI Native Agent because our tasks (like automated bug analysis) have unclear boundaries — the code might span multiple modules, and we need to dynamically decide what to retrieve next based on each intermediate analysis result. A Workflow-Driven approach can't enumerate all possible code scenarios.

Of course, the more autonomous the Agent, the higher the risk. That's why we added execution boundary controls (Harness Engineering) to ensure it never performs operations beyond its authorized scope."

The key to this answer: don't just say "I used an Agent." Explain why you chose it and show that you're aware of the trade-offs.

Summary

Three things from this article:

The hierarchy of LLM, Chatbot, and Agent: LLM is the brain, Chatbot is the brain with memory, Agent is a complete system that can autonomously plan and act. The core difference isn't "can it call tools" — it's "who decides when to call which tool."
The four elements of an Agent: Perception (what it sees), Memory (what it remembers), Reasoning (what it plans), Action (what it executes). The LLM plays the role of "conductor," not "executor."
The selection logic between the two paradigms: Workflow-Driven suits fixed, predictable tasks; AI Native Agent suits open-ended tasks requiring dynamic decision-making. Don't use an Agent because it sounds impressive — use the right tool for the job.

Next up: Agent Series Article 2 — ReAct: The Most Important Reasoning Paradigm for Agents. We'll dig into the Thought → Action → Observation loop, explore "what the Agent is thinking," and explain why Chain-of-Thought alone isn't enough.

References

hello-agents Open Tutorial (Chapter 1 and Chapter 4)
Anthropic, Building Effective Agents, 2024
OpenAI, A Practical Guide to Building Agents, 2025
LangGraph Documentation: langchain-ai.github.io/langgraph

This is the first article in the Agent Engineering series. If you're just starting out with Agents, start here and read in order. Questions or feedback? Leave a comment below.

One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls

WonderLab — Thu, 21 May 2026 01:51:49 +0000

Introduction

"~35% cheaper · ~70% fewer tool calls · 100% local"

This is the No.71 article in the "One Open Source Project a Day" series. Today we are exploring CodeGraph.

Start with a scenario: you ask Claude Code "How is AuthService being called?" Without any assistance, Claude's approach is: glob-scan directories, run multiple greps, read several files — then finally answer. The whole process might trigger 10–15 tool calls and consume hundreds of thousands of tokens.

CodeGraph's insight is to front-load this work: before you start, it has already parsed your codebase with tree-sitter into a semantic graph stored in a local SQLite database, then exposes 8 query tools to AI agents via MCP. When the agent needs to understand code, a single codegraph_context call returns entry points, related symbols, and code snippets — no file reading required.

9.6k Stars, 588 Forks. Benchmarks across 7 real open-source projects: average 35% cost savings, 70% fewer tool calls, 49% speed improvement. On VS Code's large TypeScript repository, one architecture Q&A dropped from 1.4M tokens to 393k — cost from $0.64 to $0.42.

What You Will Learn

CodeGraph's four-stage pipeline: Extract → Store → Resolve → Auto-Sync
The 8 MCP tools and when to use each
A detailed breakdown of benchmark results across 7 projects: why do larger codebases benefit more?
How 19-language support and 13-framework route recognition work
Complete setup walkthrough from installation to Claude Code integration
codegraph affected: using dependency tracing for smart CI test selection

Prerequisites

Familiarity with Claude Code, Cursor, or similar AI coding tools
Basic understanding of MCP (Model Context Protocol)
Node.js experience

Project Background

Project Introduction

CodeGraph is a local semantic code knowledge graph tool designed specifically to improve AI coding agent efficiency. Its core insight:

AI agents spend a massive amount of tokens and time in the "discovery phase" — scanning directories, searching for symbols, reading files — rather than on the actual reasoning and generation.

CodeGraph's solution is to outsource the discovery phase to a pre-built index: before you start working, the index is already ready, letting AI agents pull structured code knowledge directly instead of exploring the file system from scratch.

The technology choices are pragmatic: tree-sitter for AST parsing (mature, multi-language, high-performance), SQLite FTS5 for full-text search (zero external dependencies, fully local), and native OS file events for live sync (FSEvents/inotify/ReadDirectoryChangesW).

Author/Team

Author: Colby McHenry (GitHub: colbymchenry)
Repository: colbymchenry/codegraph
Distribution: npm package @colbymchenry/codegraph

Project Stats

⭐ GitHub Stars: 9,600+
🍴 Forks: 588
📦 npm package: @colbymchenry/codegraph
🔧 Runtime: Node.js 20–24
💻 Platforms: Windows, macOS, Linux
📄 License: MIT
🌐 Repository: colbymchenry/codegraph

Main Features

Core Utility

CodeGraph inserts a pre-built index layer between AI agents and codebases:

Codebase (TypeScript / Python / Go / ...)
        ↓ tree-sitter parsing
  Semantic graph (symbols + relationships + call chains)
        ↓ stored in SQLite FTS5
  Local knowledge base
        ↓ exposed via MCP
  AI coding agents (Claude Code / Cursor / Codex CLI / OpenCode)

Without CodeGraph:

User: "How is AuthService being called?"
→ Agent: glob("src/**/*.ts")         # Tool call 1
→ Agent: grep("AuthService")         # Tool call 2
→ Agent: read("auth.service.ts")     # Tool call 3
→ Agent: grep("import.*Auth")        # Tool call 4
→ Agent: read("user.controller.ts")  # Tool call 5
→ Agent: read("app.module.ts")       # Tool call 6
... 10–15 total tool calls, massive token consumption

With CodeGraph:

User: "How is AuthService being called?"
→ Agent: codegraph_callers("AuthService")   # Tool call 1
→ Returns: full caller list + call sites + code snippets
→ Agent answers directly, no file reading needed

Quick Start

One-command install (recommended):

# Run the interactive installer — auto-detects installed AI agents and configures them
npx @colbymchenry/codegraph

# Initialize in your project (-i for interactive)
cd your-project
codegraph init -i

Non-interactive install (CI environments):

# Auto-detect all installed agents, global install
codegraph install --yes

# Target specific agents
codegraph install --target=cursor,claude --yes

# Project-local install
codegraph install --target=auto --location=local

Manual Claude Code configuration:

npm install -g @colbymchenry/codegraph

Add to ~/.claude.json (or project-level .claude.json):

{
  "mcpServers": {
    "codegraph": {
      "type": "stdio",
      "command": "codegraph",
      "args": ["serve", "--mcp"]
    }
  }
}

Verify installation:

codegraph status          # Check index status and stats
codegraph query "UserService"  # Test symbol search

The 8 MCP Tools

The complete toolset CodeGraph exposes to AI agents:

Tool	Purpose	Typical Invocation
`codegraph_search`	Find symbols by name	"Find all functions called authenticate"
`codegraph_context`	Build code context for a task	"What code is relevant to the login flow?"
`codegraph_callers`	Find what calls a function	"What calls AuthService?"
`codegraph_callees`	Find what a function calls	"What does processPayment call internally?"
`codegraph_impact`	Analyze change impact radius	"What breaks if I change this function?"
`codegraph_node`	Get details about a specific symbol	"Show me UserController's full signature"
`codegraph_files`	Get indexed file structure	"What is the overall project structure?"
`codegraph_status`	Check index health and stats	"How many symbols are indexed? Last sync?"

codegraph_context is the most important tool — it doesn't just return search results; it intelligently assembles a comprehensive context package for a given task, including entry points, related symbols, and code snippets:

# Command-line equivalent
codegraph context "fix user login bug"
# → Automatically finds login-related functions, call chains, and relevant files
#   packaged into context Claude can consume directly

Project Advantages

Dimension	CodeGraph	Native AI Agent (no assist)	Other code indexers
Tool call count	~70% fewer	High (re-scans each task)	Partial reduction
Token usage	~59% fewer	High	Partial reduction
Data privacy	100% local	Depends on agent	Most require uploads
Real-time sync	Native OS file events	N/A	Usually polling or manual
Language support	19+ languages	Depends on agent	Usually 3–5
Framework route detection	13 frameworks	None	Rare
Installation complexity	One npx command	N/A	Usually requires server

Detailed Analysis

1. The Four-Stage Pipeline

Stage 1: Extraction

tree-sitter parses source files into ASTs, extracting:

Symbols: functions, classes, methods, interfaces, variable definitions
Relationships: function calls, module imports, class inheritance, interface implementations

tree-sitter's key advantage: it is a fault-tolerant parser — it can extract partial structure even when code has syntax errors. This is critical for indexing files that are actively being edited.

Stage 2: Storage

All data lands in a local SQLite database using the FTS5 (Full-Text Search 5) extension:

-- Symbols table (simplified)
CREATE VIRTUAL TABLE symbols USING fts5(
  name,          -- Symbol name
  kind,          -- function/class/method/...
  file_path,     -- Source file
  line_start,    -- Starting line
  signature,     -- Function signature
  docstring,     -- Documentation comment
  code_snippet   -- Code excerpt
);

-- Relationships table
CREATE TABLE edges (
  from_id  INTEGER,  -- Caller symbol ID
  to_id    INTEGER,  -- Callee symbol ID
  kind     TEXT,     -- calls/imports/inherits/implements
  file     TEXT,
  line     INTEGER
);

Stage 3: Resolution

The critical step: resolving abstract "called something named X" into concrete "called the definition in file Y at line Z."

Source code: import { AuthService } from './auth.service'
             ...
             this.authService.login(user)
            ↓ resolution
Graph edges: UserController.login → AuthService.login (calls)
             UserController → AuthService (imports)

Stage 4: Auto-Sync

Uses native OS file events (not polling!) to detect changes:

macOS: FSEvents
Linux: inotify
Windows: ReadDirectoryChangesW

A 2-second debounce prevents triggering mass rebuilds when files change rapidly — it waits for changes to settle before doing incremental updates.

2. Benchmark Deep Dive

Test conditions: Claude Code (headless, Opus 4.7) answering architecture questions. Each result is the median of 4 runs on the same question, across 7 real open-source repositories.

Project        Language       Size            Cost ↓  Token ↓  Speed ↑  Tool Calls ↓
──────────────────────────────────────────────────────────────────────────────────────
VS Code        TypeScript     ~10k files      35%     73%      41%      72%
Excalidraw     TypeScript     ~600 files      47%     73%      60%      86%
Django         Python         ~2.7k files     34%     64%      59%      81%
Tokio          Rust           ~700 files      52%     81%      63%      89%
OkHttp         Java           ~640 files      17%     41%      36%      64%
Gin            Go             ~150 files      22%     23%      34%      19%
Alamofire      Swift          ~100 files      38%     59%      51%      77%
──────────────────────────────────────────────────────────────────────────────────────
Average                                       35%     59%      49%      70%

Patterns worth noting:

Tokio (Rust, 700 files) sees the biggest gains (81% token reduction, 89% fewer tool calls): Rust's type system is complex — agents originally needed extensive file exploration to understand trait implementations and generic relationships. CodeGraph's pre-built relationships make this dramatically cheaper.
Gin (Go, 150 files) sees the smallest gains (23% token reduction, 19% fewer tool calls): Small Go projects have simple file structures. Agents can already navigate them efficiently, so CodeGraph's marginal value is lower.
VS Code's absolute numbers are the most striking: the same question costs $0.64 (1.4M tokens) without CodeGraph, $0.42 (393k tokens) with it. A single task saves $0.22.

Takeaway: The larger the codebase, the more complex the dependencies, and the richer the language's type system, the greater CodeGraph's benefit. For developers using Claude Code heavily on large projects, the ROI is clear.

3. 19 Languages + 13 Framework Route Detection

Language support (via tree-sitter grammars):

TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Swift, Kotlin, Dart, Svelte, Vue, Liquid, Pascal/Delphi, Scala

Framework route detection is a differentiating feature — CodeGraph doesn't just recognize symbols, it understands the mapping between URL routes and their handler functions:

# Django
urlpatterns = [
    path('users/<int:pk>/', UserDetailView.as_view()),
]
# → CodeGraph knows GET /users/{id}/ maps to UserDetailView

# FastAPI
@app.get("/items/{item_id}")
async def read_item(item_id: int):
    ...
# → CodeGraph knows GET /items/{id} maps to read_item()

The 13 supported frameworks: Django, Flask, FastAPI, Express, NestJS, Laravel, Rails, Spring, Gin/chi/gorilla/mux, Axum/actix/Rocket, ASP.NET, Vapor, React Router/SvelteKit.

This means AI agents can ask "Where is the handler for /api/users/:id?" and get a precise answer, without needing to scan routing config files.

4. `codegraph affected` — Smart CI Test Selection

An underappreciated feature: by tracing import dependencies, it identifies which test files are actually affected by changed source files.

# CI scenario: only run tests affected by this change
git diff --name-only | codegraph affected --stdin

# Manually specify changed files
codegraph affected src/auth.ts

# With filter (only e2e tests)
codegraph affected src/auth.ts --filter "e2e/*"

How it works:

Changed: src/auth.ts
  ↓ CodeGraph queries the dependency graph
  Direct importers: user.service.ts, auth.controller.ts
  Indirect importers: app.module.ts, integration.test.ts
  ↓ Filter to test files only
  Affected tests: auth.spec.ts, user.service.spec.ts, integration.test.ts
  ↓ Output
  [these files] ← run only these, not the full test suite

On large projects, this can compress CI test time from tens of minutes to a few minutes.

5. Configuration and Performance Notes

Project config file (.codegraph/config.json):

{
  "version": 1,
  "languages": ["typescript", "javascript"],
  "exclude": ["node_modules/**", "dist/**", "build/**", "*.min.js"],
  "maxFileSize": 1048576,
  "extractDocstrings": true,
  "trackCallSites": true
}

SQLite backend selection:

CodeGraph ships with two SQLite backends:

Native better-sqlite3 (default, recommended): High performance, supports concurrent reads
WASM fallback: Better compatibility, but 5–10x slower than native, and concurrent operations may produce database is locked errors

If you encounter performance issues or lock errors:

# Rebuild the native module
npm rebuild better-sqlite3

# Check which backend is active
codegraph status

CLI Reference

codegraph                        # Run interactive installer
codegraph init [path]            # Initialize in a project
codegraph uninit [path]          # Remove CodeGraph from a project
codegraph index [path]           # Full index (--force to rebuild)
codegraph sync [path]            # Incremental update
codegraph status [path]          # Show statistics
codegraph query <search>         # Search symbols
codegraph files [path]           # Show file structure
codegraph context <task>         # Build AI context for a task
codegraph affected [files]       # Find affected test files
codegraph serve --mcp            # Start MCP server

Library API (embed CodeGraph in your own tools):

import CodeGraph from '@colbymchenry/codegraph';

const cg = await CodeGraph.init('/path/to/project');

// Full index with progress callbacks
await cg.indexAll({
  onProgress: (p) => console.log(`${p.phase}: ${p.current}/${p.total}`)
});

// Search symbols
const results = cg.searchNodes('UserService');

// Get call chain
const callers = cg.getCallers(results[0].node.id);

// Build AI context
const context = await cg.buildContext('fix login bug', {
  maxNodes: 20,
  includeCode: true
});

// Impact radius analysis (depth 2)
const impact = cg.getImpactRadius(results[0].node.id, 2);

cg.watch();   // Start file watching for auto-sync
cg.close();   // Clean up resources

Project Links & Resources

Official Resources

🌟 GitHub: https://github.com/colbymchenry/codegraph
📦 npm: @colbymchenry/codegraph
⚡ Quick install: npx @colbymchenry/codegraph

Target Audience

Heavy Claude Code / Cursor users: Working on large projects and looking to reduce cost and improve response speed
Large TypeScript/Rust/Python project developers: Codebases large enough that AI agent file-scanning overhead is noticeable
CI/CD engineers: Using codegraph affected for smart test selection to eliminate unnecessary full test runs
Toolchain developers: Embedding code semantic analysis into their own tools via the Library API

Summary

Key Takeaways

Core value: Inserts a pre-built semantic index between AI agents and codebases — average 35% cost savings, 70% fewer tool calls, 49% speed improvement
Technology choices: tree-sitter (AST parsing) + SQLite FTS5 (full-text search) + native OS file events (live sync) — zero external service dependencies
8 MCP tools: codegraph_context is the most critical — one call returns a complete context package for the task at hand
19 languages + 13 framework route detection covering mainstream development stacks
codegraph affected: dependency-traced smart test selection, a CI acceleration tool
Gains scale with codebase size: Tokio (Rust, 700 files) reaches 89% fewer tool calls; small Go projects see ~19%

One-Line Review

CodeGraph does something deceptively simple yet extremely practical: it converts the code discovery work that AI agents redo on every task into a reusable local index — not a feature addition, but a workflow architecture optimization.

Find more useful knowledge and interesting products on my Homepage

RAG Series (23): Multimodal RAG — Images and Tables Can Be Retrieved Too

WonderLab — Wed, 20 May 2026 12:24:18 +0000

What Text RAG Can't See

Upload an annual report PDF. It contains revenue trend charts, product comparison tables, architecture diagrams. What does traditional RAG do?

A PDF parser extracts text
Text is chunked, embedded, stored in the vector store
User asks: "What was the revenue growth in Q3?"

The problem: the revenue chart is an image. The PDF parser extracts its alt text (usually empty) or filename. The numbers are in the image, not the text. RAG will never find them.

Tables are slightly better, but still problematic: parsers often flatten tables into lines of text, destroying the row/column structure and garbling the semantics.

This is a real business pain point. Roughly 30–50% of the information in real-world documents exists in non-plain-text form.

Three Approaches

Approach 1: Extract and Textualize

The most direct and most mature approach: convert images and tables into text descriptions, then run standard text RAG.

Images: use a Vision Language Model (VLM) to generate descriptions

from openai import OpenAI
import base64

def describe_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
                {"type": "text", "text": "Describe this image in detail, including all numbers, labels, trends, and key information. If this is a chart, list all data points."}
            ]
        }]
    )
    return response.choices[0].message.content

Tables: use pdfplumber to preserve structure, convert to Markdown

import pdfplumber

def extract_tables_as_markdown(pdf_path: str) -> list[str]:
    tables_md = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            for table in page.extract_tables():
                if not table:
                    continue
                header = table[0]
                rows = table[1:]
                md = "| " + " | ".join(str(h or "") for h in header) + " |\n"
                md += "| " + " | ".join("---" for _ in header) + " |\n"
                for row in rows:
                    md += "| " + " | ".join(str(c or "") for c in row) + " |\n"
                tables_md.append(f"[Page {page_num+1} table]\n{md}")
    return tables_md

Integrate into the RAG pipeline:

from langchain_core.documents import Document

def process_document(pdf_path: str) -> list[Document]:
    docs = []

    # 1. Extract plain text
    text_chunks = extract_text_chunks(pdf_path)
    docs.extend([Document(page_content=t, metadata={"type": "text", "source": pdf_path}) for t in text_chunks])

    # 2. Extract images → VLM descriptions
    images = extract_images_from_pdf(pdf_path)
    for img_path, page_num in images:
        description = describe_image(img_path)
        docs.append(Document(
            page_content=description,
            metadata={"type": "image", "source": pdf_path, "page": page_num, "image_path": img_path}
        ))

    # 3. Extract tables → Markdown
    tables = extract_tables_as_markdown(pdf_path)
    for table_md in tables:
        docs.append(Document(page_content=table_md, metadata={"type": "table", "source": pdf_path}))

    return docs

Strengths: Compatible with all existing text RAG infrastructure; no changes to the vector store.

Limitations: VLM captioning adds cost and latency; description quality directly affects retrieval quality; OCR is sensitive to scan quality.

Approach 2: CLIP Multimodal Embeddings

Principle: CLIP (Contrastive Language–Image Pre-training, OpenAI 2021) projects both text and images into the same vector space. The embedding of the phrase "revenue trend chart" will be close to the embedding of an actual revenue trend chart image.

from langchain_experimental.open_clip import OpenCLIPEmbeddings

clip_embeddings = OpenCLIPEmbeddings(
    model_name="ViT-H-14",
    checkpoint="laion2b_s32b_b79k"
)

# Embed text
text_embedding = clip_embeddings.embed_query("Q3 revenue trend")

# Embed image
image_embedding = clip_embeddings.embed_image(["path/to/chart.png"])

# Both are in the same vector space — similarity is meaningful
from numpy import dot
from numpy.linalg import norm
similarity = dot(text_embedding, image_embedding[0]) / (norm(text_embedding) * norm(image_embedding[0]))
print(f"Similarity: {similarity:.3f}")  # typically > 0.3 for semantically related pairs

Build a mixed text+image vector store:

import uuid

# Images stored with their CLIP embeddings
for img_path in image_paths:
    img_embedding = clip_embeddings.embed_image([img_path])[0]
    doc_id = str(uuid.uuid4())
    image_vectorstore.add_texts(
        texts=["[IMAGE]"],
        embeddings=[img_embedding],
        metadatas=[{"type": "image", "path": img_path}],
        ids=[doc_id]
    )

Dual-path retrieval at query time:

def multimodal_search(query: str, k: int = 5):
    # Text retrieval
    text_results = text_vectorstore.similarity_search(query, k=k)

    # Image retrieval (via CLIP's text encoder)
    query_embedding = clip_embeddings.embed_query(query)
    image_results = image_vectorstore.similarity_search_by_vector(query_embedding, k=k)

    return text_results + image_results

Strengths: Images don't need pre-captioning; retrieval operates on visual content directly.

Limitations: CLIP performs well on natural photographs but poorly on professional charts and graphs — those require understanding numerical relationships, not just visual recognition.

Approach 3: ColPali (The 2024 Breakthrough)

Background: Traditional document RAG follows this pipeline:

PDF → extract text/images → textualize → embed → retrieve

Every step loses information or introduces noise. ColPali (Google Research, 2024) took a different approach:

PDF → screenshot each page → vision language model → page-level embeddings → retrieve

Process each PDF page directly as an image. Bypass text extraction entirely.

Key components:

Backbone: PaliGemma 3B (Google's vision language model)
Late Interaction (from ColBERT): each page is divided into 1,030 patches; each patch gets its own embedding; queries generate token-level embeddings; retrieval scores via fine-grained patch × token similarity, then aggregates
The result: ColPali can pinpoint which part of a page answers a question

# Using the byaldi library (Python interface for ColPali)
from byaldi import RAGMultiModalModel

# Load ColPali
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")

# Index a PDF directory (screenshots each page, generates patch embeddings)
RAG.index(
    input_path="./financial_reports/",
    index_name="reports_index",
    store_collection_with_index=True,  # save original images for answer generation
    overwrite=True,
)

# Retrieve (returns the most relevant pages)
results = RAG.search("Q3 revenue quarter-over-quarter growth", k=3)

for r in results:
    print(f"File: {r['doc_id']}, Page: {r['page_num']}, Score: {r['score']:.3f}")

Generate an answer from the retrieved page image:

import base64
from openai import OpenAI

def answer_with_page_image(question: str, page_image_path: str) -> str:
    with open(page_image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode("utf-8")

    client = OpenAI()
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
                {"type": "text", "text": f"Based on this page, answer: {question}"}
            ]
        }]
    ).choices[0].message.content

The full ColPali flow:

User question → ColPali retrieves most relevant pages → extract page images → send to VLM → generate answer

Strengths:

Handles charts, formulas, and mixed layouts natively — no OCR required
Page-level understanding preserves visual layout
Significantly outperforms traditional methods on visually dense documents (research papers, financial reports)

Limitations:

Heavy model (PaliGemma 3B); retrieval latency higher than vector lookup
Requires NVIDIA GPU; not suitable for CPU-only deployments
Long index-build time (each page requires a forward pass)

Dedicated Table Handling

Tables are different from images — they have structured semantics and deserve specialized treatment.

Method 1: Preserve Markdown structure

def table_to_markdown(table: list[list]) -> str:
    if not table or not table[0]:
        return ""
    header = table[0]
    md = "| " + " | ".join(str(h or "-") for h in header) + " |\n"
    md += "| " + " | ".join(":---:" for _ in header) + " |\n"
    for row in table[1:]:
        md += "| " + " | ".join(str(c or "") for c in row) + " |\n"
    return md

Good LLMs can reason across rows and columns in Markdown format.

Method 2: Summary for retrieval + full table for generation

def index_table(table_md: str, table_metadata: dict) -> None:
    # Use LLM to generate a retrieval-friendly summary
    summary = llm.invoke(
        f"Summarize the key information in this table in one sentence (under 50 words):\n{table_md}"
    )

    # Store summary as the retrieval unit, full table in metadata
    vectorstore.add_texts(
        [summary.content],
        metadatas=[{**table_metadata, "full_table": table_md}]
    )

Retrieve by summary; send the full table Markdown to the LLM for answer generation.

Method 3: Structured extraction → natural language

For high-value tables (financials, product specs), extract as structured data then convert to natural language:

# Table → JSON
table_json = {
    "columns": ["Quarter", "Revenue ($B)", "QoQ Growth"],
    "rows": [
        {"Quarter": "Q1", "Revenue ($B)": 12.3, "QoQ Growth": "+5.2%"},
        {"Quarter": "Q2", "Revenue ($B)": 14.1, "QoQ Growth": "+14.6%"},
        {"Quarter": "Q3", "Revenue ($B)": 13.8, "QoQ Growth": "-2.1%"},
    ]
}

# JSON → natural language (better for semantic retrieval)
nl_description = (
    "Quarterly revenue data: Q1 $12.3B, Q2 $14.1B (up 14.6% QoQ), "
    "Q3 $13.8B (down 2.1% QoQ)."
)

Natural language is more retrieval-friendly and can be directly quoted in the LLM's answer.

Which Approach to Choose

	Extract + Textualize	CLIP Multimodal	ColPali
Document types	All	Image-heavy	Visually dense (reports, academic PDFs)
Infrastructure	Standard text RAG	Requires CLIP	Requires GPU, heavy model
Chart understanding	Depends on VLM caption quality	Weak (charts ≠ natural photos)	Strong (page-level understanding)
Update cost	Low	Medium	High (re-indexing is expensive)
Engineering complexity	Low	Medium	High
Cost	VLM captioning fees	Low	Model inference cost

Practical recommendations for most scenarios:

Scenario                              Recommended approach
──────────────────────────────────────────────────────────────
Standard enterprise docs (few images)  Text RAG, OCR or ignore images
Product docs (architecture diagrams)   Extract + GPT-4V caption
Financial/research reports (charts)    ColPali
E-commerce image search                CLIP
Quick knowledge base prototype         Extract + textualize (simplest)

A Complete Multimodal RAG Pipeline

Combining the approaches into a unified pipeline:

from enum import Enum

class DocElement(Enum):
    TEXT = "text"
    IMAGE = "image"
    TABLE = "table"

class MultimodalRAGPipeline:
    def __init__(self, text_embeddings, clip_embeddings, llm):
        self.text_emb = text_embeddings
        self.clip_emb = clip_embeddings
        self.llm = llm
        self.vectorstore = Chroma(embedding_function=text_embeddings)

    def index(self, pdf_path: str) -> None:
        elements = extract_all_elements(pdf_path)  # text / images / tables
        docs = []
        for elem in elements:
            if elem.type == DocElement.TEXT:
                docs.append(Document(page_content=elem.content, metadata={"type": "text"}))
            elif elem.type == DocElement.IMAGE:
                caption = self._generate_caption(elem.image_path)
                docs.append(Document(
                    page_content=caption,
                    metadata={"type": "image", "path": elem.image_path}
                ))
            elif elem.type == DocElement.TABLE:
                docs.append(Document(
                    page_content=table_to_markdown(elem.content),
                    metadata={"type": "table"}
                ))
        self.vectorstore.add_documents(docs)

    def _generate_caption(self, image_path: str) -> str:
        return describe_image(image_path)  # calls GPT-4V

    def query(self, question: str) -> dict:
        results = self.vectorstore.similarity_search(question, k=5)
        context_parts = []
        images_to_show = []
        for r in results:
            if r.metadata["type"] == "image":
                context_parts.append(f"[Image description] {r.page_content}")
                images_to_show.append(r.metadata["path"])
            else:
                context_parts.append(r.page_content)

        answer = self.llm.invoke(
            f"Answer based on the following:\n\n{'---'.join(context_parts)}\n\nQuestion: {question}"
        )
        return {"answer": answer.content, "images": images_to_show}

Summary

Multimodal RAG is fundamentally about converting non-text information into a retrievable form, then returning the original content to the LLM at answer-generation time. Three approaches:

Extract and textualize: most mature, engineering-simple, but dependent on OCR/VLM quality — suitable for most scenarios
CLIP multimodal embeddings: unified vector space for text and images; good for natural photograph retrieval; limited on professional charts
ColPali: direct visual page processing; best results for chart-heavy documents; requires GPU and higher engineering investment

Tables are often simpler than images: preserve Markdown structure + generate a retrieval summary, and standard text RAG handles them well.

Next (and final) in this series: Code RAG — helping AI understand your codebase, including AST-based splitting, code embedding models, and representing call graphs with knowledge graphs.

References

One Open Source Project a Day (No. 70): Claude Plugins Official — A Complete Tour of Anthropic's Official Claude Code Plugin Ecosystem

WonderLab — Wed, 20 May 2026 01:59:45 +0000

Introduction

"Plugins extend Claude Code with commands, agents, skills, hooks, and MCP servers."

This is the NO.70 article in the "One Open Source Project a Day" series. Today we are exploring claude-plugins-official.

This is Anthropic's officially maintained plugin registry on GitHub. If you have been using Claude Code lately, you have almost certainly seen the /plugin install command — this repository is the official plugin source behind that command.

20.2k Stars, 2.5k Forks, 669 open issues. Behind these numbers is a rapidly maturing ecosystem: Anthropic engineers have contributed 30+ plugins covering LSP support for 12 languages, a multi-agent PR review toolkit, Git workflow automation, and code quality analysis. 15 external partners — including GitHub, Firebase, Linear, and Terraform — have already joined.

But this article is not just about "what plugins exist." What deserves equal attention is the design of the plugin specification itself: a single plugin.json file, three extension mechanisms (Skills/Commands/MCP), and two trigger modes (user-invoked vs. model-invoked). Understanding this architecture means understanding the boundaries and possibilities of Claude Code extensibility.

What You Will Learn

The complete plugin directory structure of claude-plugins-official (internal vs. external plugins)
The Claude Code plugin specification: from plugin.json to Skills/Commands/MCP
Deep dives into five key plugins: pr-review-toolkit, agent-sdk-dev, code-review, hookify, commit-commands
How to build a spec-compliant Claude Code plugin from scratch
The current state of the external partner plugin ecosystem

Prerequisites

Familiarity with Claude Code (basic slash command usage)
A basic understanding of MCP (Model Context Protocol)
Interest in building your own Claude Code extensions

Project Background

Project Introduction

claude-plugins-official is Anthropic's official plugin registry, serving the Claude Code plugin ecosystem. It plays two roles simultaneously:

Plugin registry: Users install plugins via /plugin install <name>@claude-plugins-official
Development specification reference: plugins/example-plugin is the official complete reference implementation, demonstrating all extension mechanisms

The repository is divided into two main directories:

/plugins: Developed by Anthropic engineers, covering LSP, development workflows, code quality, output styles, and more
/external_plugins: Submitted by partners and the community, subject to quality and security review

Author/Team Introduction

This is a multi-contributor Anthropic internal project, with different engineers leading each plugin:

pr-review-toolkit: Daisy (daisy@anthropic.com)
agent-sdk-dev: Ashwin Bhat (ashwin@anthropic.com)
code-review: Boris Cherny (boris@anthropic.com)
frontend-design: Prithvi Rajasekaran + Alexander Bricken
commit-commands: Anthropic team

Project Stats

⭐ GitHub Stars: 20,200+
🍴 Forks: 2,500+
👁️ Watchers: 147
🐛 Open Issues: 669
💻 Languages: Python (31.6%), TypeScript (28.9%), HTML (19.5%), Shell (13.0%), JavaScript (7.0%)
🏷️ Topics: skills, mcp, claude-code
🌐 Repository: anthropics/claude-plugins-official
📖 Docs: code.claude.com/docs/en/plugins

Main Features

Core Utility

claude-plugins-official provides a standardized way to extend Claude Code's capabilities:

Native Claude Code capabilities
        ↓
  /plugin install <name>@claude-plugins-official
        ↓
  Extended Claude Code
  ├── New Slash Commands (user-invoked)
  ├── New Skills (model-auto-triggered)
  ├── New Agents (specialized task delegates)
  └── New MCP Tools (external service integrations)

Quick Start

Installing plugins (CLI):

# PR review toolkit
/plugin install pr-review-toolkit@claude-plugins-official

# Git commit command suite
/plugin install commit-commands@claude-plugins-official

# Agent SDK development tools
/plugin install agent-sdk-dev@claude-plugins-official

# Code review plugin
/plugin install code-review@claude-plugins-official

# Hook management tool
/plugin install hookify@claude-plugins-official

Installing plugins (UI):

In Claude Code, type: /plugin
→ Click "Discover"
→ Browse and install desired plugins

Installing external partner plugins:

# GitHub integration
/plugin install github@claude-plugins-official

# Linear project management
/plugin install linear@claude-plugins-official

# Firebase integration
/plugin install firebase@claude-plugins-official

# Terraform infrastructure
/plugin install terraform@claude-plugins-official

Plugin Directory Overview

Internal Plugins (/plugins)

Category	Plugins
LSP Language Support	clangd-lsp, csharp-lsp, gopls-lsp, jdtls-lsp, kotlin-lsp, lua-lsp, php-lsp, pyright-lsp, ruby-lsp, rust-analyzer-lsp, swift-lsp, typescript-lsp
Development Workflows	agent-sdk-dev, claude-code-setup, commit-commands, feature-dev, mcp-server-dev, plugin-dev, pr-review-toolkit, ralph-loop
Code Quality	code-modernization, code-review, code-simplifier, security-guidance
Output Styles	explanatory-output-style, learning-output-style
Utility Tools	claude-md-management, cwc-makers, frontend-design, hookify, math-olympiad, session-report, skill-creator
Reference	example-plugin

External Partner Plugins (/external_plugins): asana, context7, discord, fakechat, firebase, github, gitlab, greptile, imessage, laravel-boost, linear, playwright, serena, telegram, terraform

Five Key Plugins: Deep Dives

Plugin 1: pr-review-toolkit — Six Parallel PR Review Agents

This is arguably the most elegantly designed plugin in the entire directory: 6 specialized agents running in parallel, reviewing the same Pull Request from different angles simultaneously.

PR submitted
  ↓
comment-analyzer      ← Comment accuracy, documentation completeness, comment rot
pr-test-analyzer      ← Test coverage quality, edge cases, behavioral vs. line coverage
silent-failure-hunter ← Silent failures, empty catch blocks, missing error logging
type-design-analyzer  ← Type encapsulation, invariants (rated 1–10 on 4 dimensions)
code-reviewer         ← CLAUDE.md compliance, style, bugs (0–100 score)
code-simplifier       ← Readability, unnecessary complexity, redundant abstractions

Natural language triggering:

"Are the tests thorough?"                  → triggers pr-test-analyzer
"Check the error handling in the API client" → triggers silent-failure-hunter
"Is this documentation accurate?"           → triggers comment-analyzer
"Simplify this code"                        → triggers code-simplifier

Full PR review (trigger all agents at once):

"I'm ready to create this PR. Please:
1. Review test coverage
2. Check for silent failures
3. Verify code comments are accurate
4. Review any new types
5. General code review"

Recommended workflow:

Write code → code-reviewer
Fix issues → silent-failure-hunter (if error handling changed)
Add tests  → pr-test-analyzer
Document   → comment-analyzer
Polish     → code-simplifier
→ Create PR

Plugin 2: agent-sdk-dev — Agent SDK Project Scaffolding

This plugin compresses "building a Claude Agent SDK project from scratch" into a single command.

/new-sdk-app command (interactive project creation):

/new-sdk-app my-agent-project
# Interactive prompts:
# 1. Language: TypeScript or Python?
# 2. Agent type: coding / business / custom
# 3. Starting point: minimal / basic / specific example
# 4. Package manager: npm/yarn/pnpm or pip/poetry

What it does automatically:

Checks and installs the latest SDK version
Creates project file structure, .env.example, .gitignore
Runs type checking (TS) or syntax validation (Python)
Automatically runs the appropriate Verifier agent (validates the project against SDK best practices)

Two Verifier Agents:

# Python project verification
"Verify my Python Agent SDK application"
→ Checks: SDK installation, requirements.txt/pyproject.toml,
           SDK usage patterns, agent init/config, .env security, error handling

# TypeScript project verification
"Verify my TypeScript Agent SDK application"
→ Checks: SDK installation, tsconfig.json, type safety/imports,
           agent init/config, .env security, error handling

Verifier output format:

Overall Status: PASS / PASS WITH WARNINGS / FAIL
- Critical Issues (blocking functionality)
- Warnings (suboptimal patterns)
- Passed Checks
- Recommendations (with SDK documentation links)

Plugin 3: code-review — Confidence-Filtered 4-Agent Code Review

This plugin addresses the most common problem with AI code review: too many false positives.

How it works:

1. Pre-checks — Skip closed, draft, or trivial PRs
2. Collect CLAUDE.md guideline files from the repo
3. Summarize PR changes
4. Launch 4 parallel agents:
   - Agent #1 & #2 → CLAUDE.md compliance checks
   - Agent #3 → Bug detection (changed code only)
   - Agent #4 → Git blame/history context analysis
5. Score each issue 0–100 for confidence
6. Filter out issues below threshold (default: 80)
7. Post review comment with high-confidence issues only

Confidence scale:

Score	Meaning
0	Not confident, likely false positive
25	Somewhat confident, might be real
50	Moderately confident, real but minor
75	Highly confident, real and important
100	Absolutely certain, must fix

What gets filtered out:

Pre-existing issues not introduced by this PR
Code that looks buggy but isn't
Issues that linters will catch
Items with lint-ignore comments

Configuring the confidence threshold (in commands/code-review.md):

# Change 80 to your preferred threshold
Filter out any issues with a score less than 80.

Plugin 4: hookify — Create Claude Code Hooks in Plain English

Claude Code's Hooks feature (triggering custom logic before/after tool executions) requires manually editing complex JSON configuration. hookify eliminates that friction.

Core commands:

# Create a rule from plain English
/hookify Warn me when I use rm -rf commands

# Analyze recent conversation, auto-suggest behaviors to block
/hookify

# List all rules
/hookify:list

# Enable/disable rules interactively
/hookify:configure

Rules take effect immediately — no restart required.

Rule file format (stored as .claude/hookify.<name>.local.md):

---
name: block-dangerous-rm
enabled: true
event: bash
pattern: rm\s+-rf
action: block
---

⚠️ **Dangerous rm command detected!**

Please verify the path and ensure you have backups.

Advanced rules (multiple conditions):

---
name: warn-sensitive-files
enabled: true
event: file
action: warn
conditions:
  - field: file_path
    operator: regex_match
    pattern: \.env$|credentials|secrets
  - field: new_text
    operator: contains
    pattern: KEY
---

🔐 **Sensitive file edit detected!**

Event types and when they fire:

event	Fires when
`bash`	Before a Bash command executes
`file`	On file read/write operations
`prompt`	When the user submits a prompt
`stop`	When Claude is about to stop responding
`all`	All of the above

Practical rule examples:

# Block destructive commands
event: bash
pattern: rm\s+-rf|dd\s+if=|mkfs|format
action: block

# Warn about debug code
event: file
pattern: console\.log\(|debugger;
action: warn

# Require tests before stopping
event: stop
action: block
conditions:
  - field: transcript
    operator: not_contains
    pattern: npm test|pytest|cargo test

# Prevent hardcoded API keys in TypeScript
event: file
conditions:
  - field: file_path
    operator: regex_match
    pattern: \.tsx?$
  - field: new_text
    operator: regex_match
    pattern: (API_KEY|SECRET|TOKEN)\s*=\s*["']

Plugin 5: commit-commands — Three-Command Git Workflow

# Auto-generate commit message and commit
/commit

# Full one-command workflow: commit → push → create PR
/commit-push-pr

# Clean up local branches whose remotes have been deleted
/clean_gone

What /commit does:

Analyzes staged and unstaged changes
Reviews recent commit history to match repo style
Stages files and creates a commit message with Claude Code attribution
Automatically skips sensitive files (.env, credentials.json, etc.)

What /commit-push-pr does:

Creates a new branch if currently on main
Commits the changes
Pushes to origin
Creates a PR via GitHub CLI (includes Summary + test checklist)

Plugin Specification Deep Dive

Standard Plugin Directory Structure

plugin-name/
├── .claude-plugin/
│   └── plugin.json      # Plugin metadata (required)
├── .mcp.json            # MCP server config (optional)
├── skills/              # Skill definitions (preferred)
│   ├── my-skill/
│   │   └── SKILL.md     # Model-auto-triggered skill
│   └── my-command/
│       └── SKILL.md     # User-invoked slash command (also uses SKILL.md)
├── commands/            # Slash commands (legacy format)
│   └── my-command.md
├── agents/              # Agent definitions
└── README.md

The plugin.json Metadata Format

{
  "name": "my-plugin",
  "description": "A description of what this plugin does",
  "author": {
    "name": "Your Name",
    "email": "you@example.com"
  }
}

Deliberately minimal by design: only three fields — no version, no dependency declarations. A plugin's capabilities are determined by its directory contents, not its metadata.

Three Extension Mechanisms Compared

Mechanism 1: Skills (Recommended for new plugins)

# Model-auto-triggered (context-based)
---
name: security-review
description: Automatically trigger security review when security-related code is detected
version: 1.0.0
---

# User-invoked (becomes /skill-name command)
---
name: my-command
description: Short description shown in /help
argument-hint: <arg1> [optional-arg]
allowed-tools: [Read, Glob, Grep]
---

Mechanism 2: Commands (Legacy format, functionally equivalent)

# commands/my-command.md
---
# YAML frontmatter
---
# Command content (same as SKILL.md)

Mechanism 3: MCP Servers (External service integration)

// .mcp.json
{
  "my-service": {
    "type": "http",
    "url": "https://api.myservice.com/mcp"
  }
}

The Key Difference Between Two Skill Trigger Modes

Trigger Type	Activation	Best For
User-invoked	User types `/skill-name`	Explicit one-time operations (commit, review)
Model-auto-triggered	Claude judges from task context	Continuous context enhancement (output styles, security checks)

This distinction is critical: the frontend-design plugin is a perfect example of model-auto-triggered — the user simply says "create a dashboard" and Claude automatically applies that plugin's design principles, no explicit invocation required.

External Plugin Submission Process

To get your own plugin listed in the official directory:

1. Ensure your plugin meets quality and security standards
   ↓
2. Visit https://clau.de/plugin-directory-submission
   ↓
3. Fill out the plugin submission form
   ↓
4. Anthropic review (quality + security)
   ↓
5. Merged into /external_plugins directory

Review criteria (inferred from documentation):

Plugin must have a complete README and plugin.json
MCP servers, if used, must be clearly documented
No malicious code or unauthorized data collection
Must provide real value, not duplicate existing functionality

Project Links & Resources

Official Resources

🌟 GitHub: https://github.com/anthropics/claude-plugins-official
📖 Plugin Development Docs: code.claude.com/docs/en/plugins
📝 External Plugin Submission: clau.de/plugin-directory-submission
🔧 Reference Implementation: /plugins/example-plugin (best starting point for plugin development)

Target Audience

Power Claude Code users: Looking to extend their workflow capabilities with official plugins
Plugin developers: Learning how to build spec-compliant Claude Code plugins
Toolchain engineers: Connecting their own services to the Claude Code ecosystem via MCP servers
Engineering productivity leads: Building custom team plugins and standardizing development workflows

Summary

Key Takeaways

Plugin ecosystem:

30+ internal plugins: Covering LSP (12 languages), PR review, Git workflows, code quality, output styles, and other core development scenarios
15 external plugins: GitHub, Firebase, Linear, Terraform, and other mainstream tools already integrated
Unified installation: One command handles everything — /plugin install <name>@claude-plugins-official

Plugin specification:

Minimal metadata: plugin.json needs only name, description, and author
Three extension mechanisms: Skills (recommended) / Commands (legacy) / MCP Servers (external services)
Two trigger modes: User-invoked vs. model-auto-triggered based on context
Security awareness built in: Official docs explicitly note that Anthropic does not vouch for third-party MCP servers — trust verification is the user's responsibility

Core value of the five key plugins:

pr-review-toolkit: 6 parallel agents → multi-angle coverage, eliminating single-reviewer blind spots
agent-sdk-dev: One-command scaffolding + automatic Verifier → lower barrier to entry for Agent SDK
code-review: Confidence filtering (default threshold 80) → fewer false positives, focus on real issues
hookify: Natural language hook creation → no complex JSON config files
commit-commands: /commit-push-pr → full pipeline from code to PR in one command

One-Line Review

claude-plugins-official is not just a plugin directory — it is Anthropic's public answer to the question "how should AI tools be extended": minimize metadata, maximize compositional flexibility, let Skills appear in the right context automatically rather than forcing users to memorize commands.

Find more useful knowledge and interesting products on my Homepage

RAG Series (22): Long Context vs RAG — Do We Even Need RAG?

WonderLab — Tue, 19 May 2026 02:02:34 +0000

A Question Worth Taking Seriously

Gemini 1.5 Pro supports 1 million token context. Claude 3.5 handles 200K tokens. GPT-4 Turbo handles 128K. A small novel fits in context. Some people ask: is RAG still necessary?

The question deserves a real answer, because it hides a genuine engineering decision: for a production system, should I use RAG or long context?

The Numbers

Large language model context windows (2024–2025):

Model	Context Window	Approximate text
Gemini 1.5 Pro	1,000,000 tokens	~750,000 words, ~1500 pages
Claude 3.5 Sonnet	200,000 tokens	~150,000 words, ~300 pages
GPT-4 Turbo	128,000 tokens	~96,000 words, ~190 pages
GPT-4o	128,000 tokens	~96,000 words, ~190 pages

This looks like a lot. But how much content does a real knowledge base have?

A mid-sized company's internal documentation: thousands of documents, millions of words
A large codebase: tens of thousands of files, billions of tokens
A news or research database: millions of articles

All of these exceed any model's context window. That is the hard ceiling on long context.

The Real Cost of Long Context

"Bigger window" doesn't mean "free." Every request processes every token, and the cost is real.

Cost 1: Money

Rough estimates at late 2024 pricing (input tokens):

Model	Price per 1M tokens	1M token request
Gemini 1.5 Pro	$1.25	$1.25
Claude 3.5 Sonnet	$3.00	$3.00
GPT-4 Turbo	$10.00	$10.00

Compare to RAG:

Retrieval phase: Embedding API only (< $0.001)
Generation phase: 2,000–5,000 tokens of retrieved context + question (< $0.05)

RAG can cost 20–200× less than long context for the same question.

At 1,000 user queries per day against an enterprise knowledge base:

Long context (1M tokens): ~$1,250/day
RAG (3K token context): ~$3–15/day

Cost 2: Latency

More tokens = slower response. Time to first token (TTFT) grows roughly linearly with input length:

100K token input → TTFT ~2–5 seconds
1M token input   → TTFT ~15–30 seconds (varies by model and infrastructure)

A conversational application where the user waits 30 seconds before any output is largely unusable.

Cost 3: Lost in the Middle

A 2023 Stanford paper "Lost in the Middle" (Liu et al.) found that when relevant information appears in the middle of a long context, LLM recall drops significantly. Information at the beginning or end performs best; information in the middle performs worst.

Position vs. recall (approximate trend):
Beginning (0–10%)    ████████████████ high
Middle (40–60%)      ██████           low
End (90–100%)        ████████████     higher

Stuffing 100 documents into context does not guarantee the model finds the one at position 50.

The Real Cost of RAG

RAG isn't free either.

Cost 1: Imperfect Retrieval

Vector search is approximate matching — it makes mistakes:

False negatives: relevant documents not retrieved. The user's question is semantically distant from the relevant passage; it falls outside the top-k.
False positives: irrelevant documents retrieved. The LLM receives noise, which can cause confusion or hallucination.

This is exactly the problem that earlier articles in this series addressed: hybrid retrieval, Rerank, HyDE — all of these are patches for retrieval imperfection.

Cost 2: Chunking Breaks Context

Chunking splits documents into fragments. Related information can end up in different chunks. A 10-page research report whose conclusion depends on an assumption from page 3 may be split such that only the conclusion chunk is retrieved — the LLM gets the conclusion without the premise.

Cost 3: System Complexity

RAG is an engineering system: vector store + embedding model + retrieval pipeline + update mechanism + evaluation framework. Compared to "send the document to the LLM," it has significantly higher maintenance cost.

Five-Dimension Comparison

Dimension	Long Context	RAG
Document volume ceiling	~10–100 docs (limited by window and cost)	Unlimited (vector store scales)
Cost	High (all tokens billed every request)	Low (only relevant fragments)
Latency	High (large inputs are slow)	Low (small inputs are fast)
Recall completeness	Perfect (everything is present)	Incomplete (depends on retrieval quality)
Knowledge updates	Requires resending all content	Only update changed documents
Engineering complexity	Low (direct API call)	High (retrieval pipeline to maintain)
Single-document understanding	Strong (cross-document reasoning)	Weaker (affected by chunking)

Neither approach wins on all dimensions.

Decision Framework: Which One?

Four dimensions to locate your scenario:

Dimension 1: Document Volume

< 50 docs, total < 100K tokens     → consider long context
50–1000 docs                       → evaluate cost, decide
> 1000 docs, or total > 1M tokens  → RAG

Dimension 2: Update Frequency

Static content (monthly updates or less)   → long context acceptable
Dynamic content (daily/hourly updates)     → RAG (incremental indexing is cheap)
Real-time data                             → RAG (or direct API integration)

Dimension 3: Query Volume

One-time analysis (research, report generation)   → long context
Low-frequency queries (< 100/day)                 → either works
High-frequency queries (> 1000/day)               → RAG (cost differences compound)

Dimension 4: Latency Requirements

Interactive Q&A (< 3 second response)   → RAG
Report generation, offline analysis     → long context acceptable

Summary Decision Table

Use case                       Docs    Updates   Queries   Recommendation
──────────────────────────────────────────────────────────────────────────
Legal contract review (single) small   none      once      Long context
Enterprise knowledge base Q&A  large   frequent  high      RAG
PDF financial report analysis  medium  none      once      Long context
Product documentation chatbot  large   moderate  high      RAG
Codebase understanding         huge    frequent  high      RAG
Meeting notes summary (single) small   none      once      Long context

Hybrid Strategy: Use Both

Long context and RAG are not mutually exclusive. Sometimes the best choice is a combination.

Strategy 1: RAG selects documents, long context reads in full

# Step 1: use RAG to find the 3 most relevant documents
relevant_docs = retriever.invoke(query)  # top-3 documents

# Step 2: send full documents (not chunks) to the LLM
full_docs = [load_full_doc(doc.metadata["source"]) for doc in relevant_docs]
full_context = "\n\n".join([doc.page_content for doc in full_docs])

# Step 3: LLM answers based on complete documents
answer = llm.invoke(f"Answer based on the following documents:\n{full_context}\n\nQuestion: {query}")

Good fit for: large document sets (can't send all), but each document requires complex cross-passage reasoning.

Strategy 2: Coarse-grained RAG with large chunks

Traditional RAG uses 512–1024 token chunks. With larger windows, you can use 3,000–10,000 token chunks — preserving much more context while still doing retrieval filtering.

# Split with larger chunks (preserve more context)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,    # traditional 512 → now 4000 is reasonable
    chunk_overlap=400,
)

# Retrieve fewer chunks since each is larger
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 3 × 4000 = 12,000 tokens: precise and context-rich

Strategy 3: Summary cache + precise retrieval

For large document libraries, use the LLM to generate a structured summary for each document; retrieve summaries; load the original passage on demand.

# Pre-processing: generate summaries (one-time)
for doc in all_documents:
    summary = llm.invoke(f"Summarize this document's key points in 3 sentences:\n{doc.page_content}")
    summary_doc = Document(page_content=summary, metadata={
        "source": doc.metadata["source"],
        "original": doc.page_content,
    })
    summary_vectorstore.add_documents([summary_doc])

# Query time: retrieve summaries, load original passages
def query_with_summary(question):
    summaries = summary_vectorstore.similarity_search(question, k=5)
    relevant_chunks = [
        extract_relevant_passage(s.metadata["original"], question)
        for s in summaries
    ]
    return llm.invoke(build_prompt(question, relevant_chunks))

What Actually Changed

The rise of large context windows genuinely shifted some decisions:

Scenarios where RAG was once necessary but now may not be:

Understanding documents under 50 pages (just stuff it in — simpler)
One-time document analysis tasks (not worth building a RAG system)
Prototype validation (fast idea testing, no need for production-grade RAG)

Scenarios where RAG is still necessary (most production systems):

Knowledge bases with > 1,000 documents
Frequently updated content
High concurrency, cost-sensitive deployments
Attribution requirements (RAG natively knows which document an answer came from)

Large context windows made "skip RAG for simple cases" a reasonable choice. They didn't make RAG obsolete — they made RAG's use case clearer: when document volume, update frequency, or cost makes "full context" impractical, RAG is irreplaceable.

Summary

	Long Context	RAG
Core strength	Complete context, cross-document reasoning	Scalable, low cost, real-time updates
Core limitation	High cost, high latency, hard document ceiling	Imperfect retrieval, engineering complexity
Best for	Small-scale, one-time deep analysis	Large-scale production systems
Trend	Windows keep growing, costs keep falling	Retrieval quality keeps improving

These are not competitors — they're complementary tools. Understanding the true cost of each, and choosing the right one, is engineering judgment.

References

RAG Series (21): Performance Optimization — Faster and Cheaper

WonderLab — Tue, 19 May 2026 02:01:42 +0000

The Cost Structure of RAG

What happens in a single RAG request:

1. embed(question)          → 1 Embedding API call
2. vectorstore.search()     → vector store retrieval (local, fast)
3. llm.generate(context)    → 1 LLM API call

At minimum 2 API calls per request. At scale, these compound quickly:

Latency: LLM calls typically 1–10 seconds; Embedding calls 0.1–0.5 seconds
Cost: token-based billing means identical questions pay the same price every time

The four optimizations each target a different point in this chain:

Optimization	Where	What it saves
LLM response cache	LLM call	Skip LLM entirely, 0ms response
Embedding cache	Embedding call	No re-embedding for identical text
Semantic Cache	LLM call	Reuse answers for similar questions
Async batch Embedding	Embedding call	N serial round-trips → 1 concurrent call

Optimization 1: LLM Response Cache

Principle: A given (prompt, model, temperature) combination always produces a deterministic LLM call. Cache the result on the first call; return it directly on subsequent identical calls — no network request at all.

LangChain exposes this as a global switch:

from langchain_core.globals import set_llm_cache
from langchain_community.cache import InMemoryCache

set_llm_cache(InMemoryCache())   # one line, affects all LLM calls

For persistence across restarts, swap in SQLite:

from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".llm_cache.db"))

Results

3 questions, each asked twice:

Q: What are the four core metrics in RAGAS?
  Cache miss:  1743ms   Cache hit:   0.7ms   Speedup: 2441×

Q: What are the common vector database options?
  Cache miss:  3675ms   Cache hit:   0.9ms   Speedup: 4126×

Q: What is Rerank?
  Cache miss:  9753ms   Cache hit:   0.9ms   Speedup: 10993×

Average: miss=5057ms  hit=0.8ms  speedup=6068×

Hit latency is 0.8ms — that's dictionary lookup time, not network latency. On a cache hit, zero network requests are made.

6000× sounds exaggerated, but this is what "in-memory dict vs. network API call" actually looks like.

Good fit for: FAQ-style Q&A, report generation (user clicks "regenerate" repeatedly), popular questions asked by many users.

Limitation: Exact prompt match only. A rephrased question is a cache miss.

Optimization 2: Embedding Cache

Principle: The embedding vector for a given text is deterministic (same model + same text = same vector). CacheBackedEmbeddings wraps a base embeddings object with a ByteStore layer — embed once, serialize and store, read from cache thereafter.

from langchain_classic.embeddings import CacheBackedEmbeddings
from langchain_classic.storage import InMemoryByteStore, LocalFileStore

# In-memory (lost on restart)
store = InMemoryByteStore()

# File-based (persistent across restarts)
# store = LocalFileStore("./embedding_cache/")

cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=base_embeddings,
    document_embedding_cache=store,
    namespace=EMB_MODEL,   # isolates cache by model name
)

# API identical to regular embeddings
vectorstore = Chroma.from_documents(docs, embedding=cached_embeddings)

namespace=EMB_MODEL matters: if you switch embedding models, the old cached vectors have a different dimension and distribution. Namespacing by model name prevents the new model from reading stale vectors.

Results

8 texts, three passes:

First index (8 texts, all new):
  285ms   1 API call   8 texts sent

Repeat index (8 texts, all cached):
  5.7ms   0 API calls  0 texts sent

Knowledge base update (6 unchanged + 2 new):
  63.5ms  1 API call   2 texts sent

The third row is the point: on a knowledge base update, the 6 unchanged documents are served from cache. Only the 2 new documents trigger an API call. This pairs naturally with the Indexing API from the previous article — content hash tracking identifies which documents need re-indexing; Embedding cache ensures identical content is never re-embedded.

Good fit for: knowledge bases with a large stable core and occasional updates. The more documents, the lower the update frequency, the bigger the benefit.

Optimization 3: Semantic Cache

Principle: LLM response cache requires an exact prompt match. Semantic Cache goes further: store historical (question, answer) pairs as vectors; when a new question arrives, run a nearest-neighbor search; if a sufficiently similar historical question is found, return its answer directly — skipping both retrieval and LLM.

"What metrics does the RAGAS framework have?"  → miss → LLM generates → stored
"Describe the four core RAGAS metrics"         → vector search → finds above
                                               → similarity ≥ threshold → return cached answer

Implementation:

class SemanticCache:
    def __init__(self, embeddings, threshold: float = 0.85):
        self._store   = Chroma(collection_name="semantic_cache", ...)
        self._answers = {}          # cache_id → answer
        self._threshold = threshold

    def get(self, question: str) -> Optional[str]:
        results = self._store.similarity_search_with_relevance_scores(question, k=1)
        if results:
            doc, score = results[0]
            if score >= self._threshold:
                return self._answers[doc.metadata["cache_id"]]
        return None

    def set(self, question: str, answer: str) -> None:
        cache_id = str(uuid.uuid4())
        self._store.add_texts([question], metadatas=[{"cache_id": cache_id}])
        self._answers[cache_id] = answer

Results: Threshold Calibration Is the Hard Part

Threshold: 0.85

RAGAS group:
  Original:  "What metrics does RAGAS have?"              → miss (3782ms)
  Paraphrase: "Describe the four core RAGAS metrics"      → miss (3298ms) ← expected HIT
  Different:  "How should I choose a vector database?"    → miss (2509ms) ← correct miss

Rerank group:
  Original:  "What role does Rerank play in RAG?"         → miss (11602ms)
  Paraphrase: "Why do RAG systems need re-ranking?"       → miss (3834ms) ← expected HIT
  Different:  "What is hybrid retrieval?"                 → miss (12578ms) ← correct miss

Total hit rate: 0/6

The paraphrases didn't hit the cache. This is not a code bug — threshold 0.85 is too high for these paraphrase pairs.

Why: bge-large-zh-v1.5 cosine similarity between these pairs likely falls in the 0.80–0.84 range, just below the threshold. Semantic similarity ≠ high cosine similarity. The mapping depends on the embedding model's representation space and training data.

The correct approach: calibrate before setting a threshold. Measure the similarity distribution on your actual question samples:

# Calibration: measure similarity on known similar pairs and known-different pairs
from numpy import dot
from numpy.linalg import norm

def cosine(a, b):
    return dot(a, b) / (norm(a) * norm(b))

similar_pairs = [
    ("What RAGAS metrics are there?", "List the RAGAS evaluation metrics"),
    ("How to choose a vector DB?", "Which vector database should I use?"),
]
dissimilar_pairs = [
    ("What RAGAS metrics are there?", "How to choose a vector DB?"),
]

for q1, q2 in similar_pairs:
    v1 = embeddings.embed_query(q1)
    v2 = embeddings.embed_query(q2)
    print(f"Similar:    {cosine(v1, v2):.3f}  {q1[:30]} / {q2[:30]}")

for q1, q2 in dissimilar_pairs:
    v1 = embeddings.embed_query(q1)
    v2 = embeddings.embed_query(q2)
    print(f"Dissimilar: {cosine(v1, v2):.3f}  {q1[:30]} / {q2[:30]}")

# Set threshold between the two distributions

Find a threshold that separates the two distributions. For Chinese Q&A with bge models, 0.80–0.85 is a common starting range — but you must validate on your own data before deploying.

The real value of Semantic Cache: high-volume FAQ systems where users ask the same questions in many different ways (customer service bots, documentation assistants). Potential for large LLM call reduction. But the value is entirely dependent on threshold calibration — it's not a drop-in default.

Optimization 4: Async Batch Embedding

Principle: Embedding N texts sequentially = N network round-trips. Embedding N texts in a single batch call = 1 network round-trip, processed in parallel server-side.

import asyncio
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(...)

# Sequential (slow): one API call per text
sequential = [embeddings.embed_query(text) for text in texts]

# Async batch (fast): one API call for all texts
async def embed_batch(texts):
    return await embeddings.aembed_documents(texts)

batch = asyncio.run(embed_batch(texts))

Results

12 texts:

Sequential (one by one):    830ms
Async batch (one call):     289ms
Speedup:                    2.87×

Same vectors, 11 fewer network round-trips. Vector agreement > 0.9999 cosine similarity.

Where to apply in the RAG pipeline:

# Batch indexing at build time
async def index_documents_async(docs: list[Document]):
    texts   = [d.page_content for d in docs]
    vectors = await embeddings.aembed_documents(texts)
    # bulk write to vector store
    ...

# Concurrent user queries in the service layer
async def handle_batch_queries(questions: list[str]):
    vectors = await embeddings.aembed_documents(questions)
    results = await asyncio.gather(*[
        retriever.ainvoke(q) for q in questions
    ])
    return results

The more documents, the bigger the gain. Batch documents in chunks of 50–100 during index builds; expect 3–5× speedup over sequential, depending on network latency.

Combining All Four Optimizations

# 1. LLM cache (global, always on)
set_llm_cache(SQLiteCache(".llm_cache.db"))

# 2. Embedding cache (wrap the base embeddings)
store = LocalFileStore("./embedding_cache/")
embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=base_embeddings,
    document_embedding_cache=store,
    namespace=EMB_MODEL,
)

# 3. Semantic Cache (check before full pipeline)
semantic_cache = SemanticCache(embeddings, threshold=YOUR_CALIBRATED_THRESHOLD)

def query(question: str) -> str:
    cached = semantic_cache.get(question)
    if cached:
        return cached
    docs   = retriever.invoke(question)
    answer = llm.invoke(...)
    semantic_cache.set(question, answer)
    return answer

# 4. Async for bulk operations
vectors = asyncio.run(embeddings.aembed_documents(texts))

All four are orthogonal and stackable. Highest-ROI combination: LLM cache + Embedding cache — near-zero implementation cost, should be on by default. Semantic Cache requires calibration but delivers large savings once tuned. Async batch is specifically valuable at index-build time and under high concurrency.

Summary

=====================================================================
  Optimization Results Summary
=====================================================================

  Optimization             Before          After         Savings
  ─────────────────────────────────────────────────────────────
  LLM response cache       5057ms          0.8ms         99.98%  ✓ strongly recommended
  Embedding cache (rebuild) 285ms          5.7ms         98%     ✓ strongly recommended
  Embedding cache (update)  8 API calls    2 API calls   75%     ✓ strongly recommended
  Semantic Cache (t=0.85)   functional     needs calibr. —       ⚠ calibrate first
  Async batch Embedding     830ms          289ms         65%     ✓ recommended at scale
=====================================================================

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/21-rag-performance

Key file:

rag_performance.py — all four benchmarks with report generation

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 21-rag-performance
cp .env.example .env
pip install -r requirements.txt
python rag_performance.py

Summary

This article implemented and measured four RAG performance optimizations:

LLM response cache: cheapest and highest impact — one line of code, repeated questions go from 5057ms to 0.8ms (6000× speedup)
Embedding cache: identical text never re-embedded; knowledge base updates only embed changed content (8 calls → 2 calls)
Semantic Cache: conceptually correct, but threshold 0.85 produced 0/6 hits in this experiment — threshold calibration is non-optional; measure similarity distribution on real data before setting any value
Async batch Embedding: 2.87× speedup for 12 texts; benefit grows with document count

The first three optimizations attack the same root problem: repeated computation is waste. The same work shouldn't cost twice. The fourth attacks a different problem: serial waiting is unnecessary. Work that can be parallelized shouldn't be queued.

Different problems, same goal: making RAG viable in production.

References

RAG Series (20): Enterprise RAG Architecture

WonderLab — Tue, 19 May 2026 02:00:40 +0000

The Gap Between Demo and Production

Every article in this series has shared one architectural assumption: a single vector store, accessible to everyone, returning any document to any user.

That works in a demo. In an enterprise environment, it breaks immediately:

Company A's documents can be retrieved by Company B's users
Financial data can be pulled by any employee
HR policies visible to contractors
One user hammers the API and takes down the service for everyone else

Production enterprise RAG needs three layers:

Incoming request
  ↓ rate limit check   — is this user still within quota?
  ↓ cache lookup       — has this question been answered before?
  ↓ tenant routing     — which knowledge base?
  ↓ permission filter  — within that KB, what can this user see?
  ↓ retrieve + generate — answer from authorized content only
  ↓ cache write        — store for next time

This article implements each layer.

Layer 1: Multi-Tenancy

Strategy: one Qdrant Collection per tenant

Each customer or department gets its own Qdrant Collection. Collections are physically isolated — you can't search acme_corp's content by querying globex_corp, because the two collections are entirely separate vector spaces.

from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

qdrant_client = QdrantClient(":memory:")   # production: host="qdrant-server"

tenant_stores: dict[str, QdrantVectorStore] = {}

for tenant_id, docs in TENANT_DOCS.items():
    qdrant_client.create_collection(
        collection_name=tenant_id,
        vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
    )
    store = QdrantVectorStore(
        client=qdrant_client,
        collection_name=tenant_id,
        embedding=embeddings,
    )
    store.add_documents(docs)
    tenant_stores[tenant_id] = store

Routing is trivial — the request carries a tenant_id, the service selects the matching store:

def get_retriever(tenant_id: str, role: str, k: int = 3):
    if tenant_id not in tenant_stores:
        raise ValueError(f"Unknown tenant: {tenant_id}")
    store = tenant_stores[tenant_id]
    # permission filter added in Layer 2
    ...

Why not a shared Collection with a tenant_id metadata filter?

It works technically, but carries a risk: a filter bug means Tenant A's data leaks to Tenant B. There's no hard boundary. Collection-level isolation also makes teardown clean — removing a tenant means dropping their Collection, with no residue.

For soft isolation (departments within one company), metadata filtering is fine. For hard isolation (different customers), separate Collections are safer.

Layer 2: Access Control

Strategy: documents carry access_level; retrieval injects a Qdrant filter

Each document declares its access level in metadata:

Document(
    page_content="Annual bonus: S-tier 3 months, A-tier 2 months...",
    metadata={"source": "hr-policy", "access_level": "hr_only"},
)
Document(
    page_content="Robot control system: EtherCAT bus, latency <1ms...",
    metadata={"source": "robot-spec", "access_level": "engineering_only"},
)

Roles map to the access levels they can see:

ROLE_PERMISSIONS: dict[str, list[str]] = {
    "admin":    ["public", "engineering_only", "hr_only", "finance_only"],
    "engineer": ["public", "engineering_only"],
    "hr":       ["public", "hr_only"],
    "finance":  ["public", "finance_only"],
    "employee": ["public"],
}

At retrieval time, the role's allowed levels become a Qdrant MatchAny filter:

from qdrant_client.models import Filter, FieldCondition, MatchAny

def get_retriever(tenant_id: str, role: str, k: int = 3):
    levels = ROLE_PERMISSIONS.get(role, ["public"])

    access_filter = Filter(
        must=[
            FieldCondition(
                key="metadata.access_level",
                match=MatchAny(any=levels),
            )
        ]
    )
    return tenant_stores[tenant_id].as_retriever(
        search_kwargs={"k": k, "filter": access_filter}
    )

This filter executes at the vector database layer, not the application layer. Unauthorized documents never leave the database — they aren't returned to the application, so there's nothing to leak.

Layer 3: Caching

Strategy: (tenant_id, role, question) as cache key, TTL 300 seconds

@dataclass
class CacheEntry:
    answer: str
    created_at: float = field(default_factory=time.time)

class QueryCache:
    def __init__(self, ttl_seconds: int = 300):
        self._store: dict[tuple, CacheEntry] = {}
        self._ttl = ttl_seconds

    def get(self, tenant_id, role, question) -> Optional[str]:
        entry = self._store.get((tenant_id, role, question.strip().lower()))
        if entry and (time.time() - entry.created_at) < self._ttl:
            return entry.answer
        return None

    def set(self, tenant_id, role, question, answer) -> None:
        self._store[(tenant_id, role, question.strip().lower())] = CacheEntry(answer)

Including role in the cache key matters: an engineer and an HR manager asking the same question get different contexts (different documents pass the permission filter), so they may get different answers. Cache entries are not cross-role reusable.

Layer 4: Rate Limiting

Strategy: sliding window, 5 requests per user per minute

class RateLimiter:
    def __init__(self, max_requests: int = 5, window_seconds: int = 60):
        self._max = max_requests
        self._window = window_seconds
        self._log: dict[str, list[float]] = defaultdict(list)

    def allow(self, user_id: str) -> bool:
        now = time.time()
        self._log[user_id] = [t for t in self._log[user_id]
                               if now - t < self._window]
        if len(self._log[user_id]) >= self._max:
            return False
        self._log[user_id].append(now)
        return True

Sliding window vs. fixed window: a fixed window allows bursting at boundaries — a user can send 5 requests at second 59 and 5 more at second 61, sending 10 in 60 seconds. A sliding window enforces the limit across any 60-second interval.

Experiment Results

Scenario A: Normal Retrieval

Engineer alice queries company info and technical docs:

Q: What type of company is ACME Corp?
A: ACME Corp is a smart manufacturing company.
Sources: [company-intro, robot-spec]   ← public + engineering docs, correct
elapsed: 995ms

Q: What communication protocol does ACME's robot system use?
A: ACME Corp's robot control system uses the EtherCAT real-time bus.
Sources: [company-intro, robot-spec]   ← engineering doc correctly retrieved
elapsed: 1709ms

Scenario B: Permission Filtering

The key thing to read here is the sources array — not whether docs_retrieved > 0:

[B1] Engineer alice asks about annual bonus (hr_only doc):
  Sources: [company-intro, robot-spec]   ← hr-policy is NOT in sources
  A: The reference material does not contain information about the bonus policy.

[B2] HR bob asks about net profit (finance_only doc):
  Sources: [company-intro, hr-policy]    ← financial-report is NOT in sources
  A: The reference material does not contain ACME's 2025 net profit.

[B3] HR bob asks about annual leave (hr_only doc):
  Sources: [company-intro, hr-policy]    ← hr-policy correctly appears
  A: Year 1: 12 days. Each additional year: +2 days. Maximum: 20 days.

What access control actually looks like in practice: hr-policy never appears in alice's sources list; financial-report never appears in bob's sources list. The Qdrant filter intercepts these documents at the database layer. The LLM never receives them, so it correctly responds that the information isn't available.

This is the right behavior: users still get documents they can access (public + their role-specific docs); only the restricted documents are absent.

Scenario C: Tenant Isolation

[C1] Globex user charlie asks about ACME Corp's headcount:
  Tenant: globex_corp
  Sources: [products, company-intro]   ← these are Globex's own docs
  A: The reference material does not contain ACME's employee count.

[C2] Globex user queries their own product lines:
  Sources: [company-intro, products]   ← Globex docs correctly returned
  A: GlexCloud, GlexAnalytics, GlexAI...

Charlie is querying the globex_corp Collection for ACME Corp information. Of course nothing comes back — ACME's content doesn't physically exist in Globex's Collection.

Scenario D: Cache Hit

First request (Scenario A1): 995ms, cache_hit=false
Same question repeated:        0ms, cache_hit=true

0ms means the repeated request skipped both retrieval and LLM generation entirely. For frequently repeated questions — company policy, common workflows, product FAQs — caching compounds quickly.

Scenario E: Rate Limiting

Config: 5 req / 60s / user

Request 1: allowed
Request 2: allowed
Request 3: allowed
Request 4: allowed
Request 5: allowed
Request 6: RATE LIMITED   ← limit enforced
Request 7: RATE LIMITED

The rate limiter correctly allowed 5 and blocked 2 out of 7 requests.

FastAPI Service Layer

The four layers above are wired together in a single query() function, then exposed via FastAPI:

from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel

app = FastAPI(title="Enterprise RAG Service")

class QueryRequest(BaseModel):
    tenant_id: str
    question: str

@app.post("/query")
async def query_endpoint(
    req: QueryRequest,
    x_user_id: str = Header(...),     # user identity from request header
    x_user_role: str = Header(...),   # user role from request header
):
    result = query(
        tenant_id=req.tenant_id,
        user_id=x_user_id,
        role=x_user_role,
        question=req.question,
    )
    if result.rate_limited:
        raise HTTPException(status_code=429, detail="Too many requests")
    return {
        "answer":    result.answer,
        "sources":   result.sources,
        "cache_hit": result.cache_hit,
    }

Start with: uvicorn enterprise_rag:app --host 0.0.0.0 --port 8080

In production, x_user_id and x_user_role should come from JWT token decoding, not raw client headers.

Production Upgrade Path

Component	Demo Implementation	Production Replacement
Qdrant	`:memory:`	Dedicated server, `host="qdrant-server"`
Cache	In-process dict	Redis (distributed, persistent, TTL native)
Rate limiter	In-process counter	Redis + sliding-window Lua script (safe across instances)
User identity	Raw Header	JWT token decode + signature verification
Logging	print()	Structured logs + alerting on LLM call volume / latency / errors

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/20-enterprise-rag

Key file:

enterprise_rag.py — full implementation: multi-tenancy + access control + cache + rate limiting + FastAPI + scenario verification

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 20-enterprise-rag
cp .env.example .env
pip install -r requirements.txt
python enterprise_rag.py

Summary

This article implemented a four-layer enterprise RAG architecture. Key findings:

Collection-level tenant isolation — separate Qdrant Collections per tenant provide a physical boundary; metadata filtering alone offers no hard guarantee
Permissions enforced at the DB layer — Qdrant's MatchAny filter means restricted documents never leave the database; there's nothing for the application to leak
Cache key must include role — same question, different role → different context → potentially different answer; cross-role cache reuse produces wrong results
Sliding window beats fixed window — eliminates boundary bursting; any 60-second interval is bounded, not just aligned windows
Access control is about absence — users see the documents they're allowed to see; restricted documents simply don't appear in sources; the LLM correctly reports "no information available" for what it never received

The gap between a RAG demo and a RAG production system is mostly engineering, not algorithms.