DEV Community: Chen Zhang

I Got Tired of Pasting Context Into Agents, So I Built MFS

Chen Zhang — Fri, 26 Jun 2026 03:03:10 +0000

Most of my failed agent runs do not fail because the model cannot reason. They
fail because I pasted the wrong context, missed the important thread, or forgot
which repo/doc/ticket had the real answer.

That is the part that started to bother me. Modern agents are getting very good
at reasoning, coding, and using tools, but in my day-to-day work the annoying
part is often much more basic: the agent still cannot reach the context it
needs.

A bug report might start in Slack or Feishu. The decision might be buried in a
Jira ticket. The reproduction data might live in Postgres. The fix is somewhere
in a repo. A useful note might sit in an old memory file or a skill I tuned a
month ago.

The agent is right there, ready to help. But I still end up doing the awkward
middle work: open five systems, search manually, copy snippets, paste them into
the chat window, and hope I included the right pieces.

That is the problem I wanted MFS to attack.

MFS stands for Multi-source File-like Search. It is an open-source context
harness for agents: a single searchable and browsable workspace over code,
memory, skills, docs, chat, tickets, databases, object stores, and other working
context.

Project: github.com/zilliztech/mfs

Make Every Source Look Like a Tree

Agents already know how to use shell-like tools. They can list files, inspect
paths, grep for strings, and read specific ranges. So instead of inventing a new
interface for every source, MFS turns each source into a file-like tree under a
stable URI.

A Postgres database can be browsed like a tree. A table exposes rows that can be
searched and reopened. A Slack channel becomes a tree of channels and threads.
A GitHub repo, an issue tracker, a PDF folder, an S3 bucket, or a Notion space
all become objects under paths the agent can inspect.

The same small command set works everywhere:

mfs tree github://acme/backend -L 1
# src/
# tests/
# README.md

mfs ls postgres://prod/public
# tickets/
# users/

mfs cat jira://acme/PLAT/issues.jsonl --locator '{"id":"PLAT-491"}'
# reopen the exact issue record

The point is not to replace the shell. If I know the exact file and exact string,
rg, grep, and cat are still the fastest tools in the room.

MFS earns its keep when the hard part is finding: the wording in my head does
not match the wording in the source, the answer could be in several systems, and
I need the agent to locate evidence instead of asking me to paste it in.

Search Gives Candidates. Browse Gives Evidence.

The most important design choice in MFS is pairing two moves that often get
treated separately:

Search to quickly find candidates across a large indexed corpus.
Browse to reopen the exact source and verify the evidence.

That sounds simple, but it matters a lot for agent reliability.

A search result is a hint. It is not yet proof. A ranked snippet can point the
agent to the right file, row, thread, or document, but the agent should reopen
the source before it quotes, edits, summarizes, or makes a decision.

This is how humans search too. Google narrows the world down to candidates. Then
you click a result and read the actual page. A library catalog gets you to the
right shelf. Then you open the book and check the page.

MFS gives agents the same loop:

mfs search "where do we retry failed webhook deliveries?" ./repo --json
mfs cat file://local/repos/payments/webhooks/deliver.go --range 80:110

Search narrows the field. Browse verifies the evidence.

Two Skills: One for Ingest, One for Find

MFS ships with two agent-facing skills:

Skill	What it handles
`mfs-ingest`	Register a source, configure connectors, sync, index, and troubleshoot ingestion.
`mfs-find`	Search, grep, list, browse, and reopen exact evidence from sources already in MFS.

The split is deliberate. Ingest can change state and trigger indexing work.
Find is read-only retrieval and browsing. Agents behave better when those
boundaries are explicit.

Installation is one command:

npx skills add zilliztech/mfs --all -g

After that, I can ask the agent in plain language:

Use mfs-ingest to index this repo, then use mfs-find to find where webhook
retry logic is implemented.

The agent does not need me to remember every command. It follows the skill,
runs the right mfs commands, and reopens the evidence before answering.

What Can It Connect To?

The current connector set covers the sources that tend to make up real working
context:

Local files: code, Markdown, PDFs, docs, screenshots
Object stores: S3, R2, GCS, MinIO
Cloud drives and docs: Google Drive, Notion, web pages
Databases and warehouses: Postgres, MySQL, MongoDB, BigQuery, Snowflake
Code and work tracking: GitHub, Jira, Linear
Support and CRM: Zendesk, HubSpot
Chat and mail: Slack, Discord, Gmail, Feishu/Lark

These sources do not look alike at all. A PDF, a database row, a Slack thread,
and a GitHub file have different shapes, sync rules, credentials, and failure
modes. MFS hides as much of that repetitive work as possible behind one model:
objects, paths, metadata, chunks, search hits, and locators.

That means a search across all registered sources can return results in the same
format:

mfs search "what do we already have about hybrid retrieval?" --all

Example shape:

postgres://prod/public/engineering_tickets/rows.jsonl   score=0.88
  #482 hybrid retrieval flaky on long queries ...

notion://workspace/design/retrieval-rfc.md              score=0.85
  Hybrid search: combine dense + sparse ...

web://milvus-tutorials/hybrid-search                    score=0.81
  Hybrid search runs an ANN search and a BM25 search ...

file://local/repo/src/milvus.py                         score=0.76
  def hybrid_search(self, query: str, top_k: int = 10):

The agent can then reopen each hit through cat, line ranges, or structured
locators instead of relying on snippets.

A Small Evaluation Signal

I do not want to oversell the numbers. They are not a universal guarantee. They
come from specific agents, corpora, prompts, and timeouts.

But they are still useful evidence for the search-plus-browse loop.

In one code-search run over 2,000 Python files from CodeSearchNet, the agent had
to identify target source files from natural-language tasks. The shell-only
baseline found 22 of 24 targets, using 962 average tokens. With MFS search plus
browse, it found 23 of 24 and used 460 average tokens.

The hard subset was clearer. For paraphrased queries with weak literal anchors,
shell tools found 7 of 8 targets with 1,734 average tokens. MFS search plus
browse found 8 of 8 with 692 average tokens.

The interesting part is not just better retrieval. It is that the agent can
spend fewer tokens wandering around because search gives it a good candidate set
and browse lets it verify narrowly.

Local First, But Not Local Only

MFS is a client/server system.

For a local demo, it can run with Milvus Lite for vectors, SQLite for metadata,
local cache, and an ONNX embedding model. That path does not require an API key,
a GPU, or a cloud account after setup.

For production, the same architecture can move to a server deployment: Zilliz
Cloud or Milvus for vector search, Postgres for metadata, object storage for
artifacts, and Docker Compose or Kubernetes for deployment.

The important boundary is where state and credentials live. The CLI and skills
are clients. The server owns connector config, credentials, indexing jobs,
metadata, cache, and search backends. Connector TOML files reference secrets
like env:SLACK_BOT_TOKEN or file:/path/to/key; they do not store raw secret
values.

That matters if you want agents to use real company context without handing
every secret to the client process.

Building on Top of It

MFS can be used directly as a daily search tool, but the larger goal is to make
it a foundation for agent applications.

If you are building an agent, plugin, MCP server, or skill system, you should not
have to rebuild connector sync, document conversion, chunking, embeddings,
vector indexing, metadata bookkeeping, caching, and evidence reopening every
time.

MFS tries to make that lower layer reusable.

The upper layer can focus on the product experience: what the agent should do
with the context once it has found it.

Final Thought

The next jump in agents will not come only from bigger models or longer context
windows. It will also come from better contact with the working environment:
repos, docs, chat, tickets, databases, memory, skills, and all the messy history
that surrounds real work.

MFS is my attempt at that context layer: search broadly, browse precisely, and
let the agent reopen the original evidence before it acts.

Links:

Understanding Agent Memory: Four Layers from Chat History to Skills

Chen Zhang — Mon, 22 Jun 2026 11:14:43 +0000

Agent memory has been everywhere lately.

I have been working on this problem from the engineering side for the past few months, and the more I build, the more I feel the field is messy for one simple reason: people are using the same word, "memory", to mean very different things.

Sometimes memory means chat history. Sometimes it means a vector database. Sometimes it means a Markdown file. Sometimes it means a skill the agent can use later. And sometimes it means the whole context layer around the agent.

That sounds confusing, but I think the confusion mostly comes from one missing question:

What exactly counts as memory for an agent?

Once that question is clearer, the taxonomy, architecture choices, and trade-offs become much easier to reason about.

Here is the four-layer model I use.

Layer 1: The Early View of Memory

The earliest and simplest split was short-term memory vs. long-term memory.

Short-term memory is whatever still fits in the model context window, plus small local notes that capture temporary state or user preferences. Many coding assistants have some version of this: a Markdown file in the workspace, a project instruction file, or a lightweight memory page that gets loaded into context.

Long-term memory usually means storing a much larger history of conversations and retrieving relevant pieces on demand.

This view is easy to understand, but it is narrow. In this framing, memory is mostly "what you and the agent talked about before."

That was useful as a starting point, but it does not explain what modern agents are becoming.

Layer 2: The CoALA Taxonomy

In September 2023, the CoALA paper, Cognitive Architectures for Language Agents, proposed a much broader way to classify memory. LangChain later used the same framing in its Memory for Agents post.

CoALA splits long-term memory into three important types:

Memory type	What it means for agents	Common implementation
Procedural memory	What the agent knows how to do	Skills, workflows, tools, SOPs
Semantic memory	Facts and knowledge the agent can use	RAG, knowledge bases, documents
Episodic memory	What the agent has experienced	Conversation history, task history, event logs

There is also working memory, which maps roughly to the short-term context we already talked about.

At first, these terms can feel academic. But looking back from 2026, they map surprisingly well to how agents actually evolved.

Procedural memory is skills. It is not just "knowledge about a task"; it is the repeatable procedure for getting the task done.

Semantic memory is RAG. It is the agent's access to facts, documents, manuals, product knowledge, codebase knowledge, or any external source of truth.

Episodic memory is the original "memory" people talked about: past conversations, previous sessions, and what happened during earlier work.

This is the first shift I think matters: agent memory is not only chat history. It includes skills and knowledge too.

The human analogy helps. What is a person's memory?

It is partly what they have experienced. That is episodic memory.

It is partly what they have learned from books, documents, and the world. That is semantic memory.

And it is partly what they know how to do: cooking, debugging, negotiating, writing, shipping software. That is procedural memory.

Agents are starting to need the same three categories.

Layer 3: Should Memory Live in Markdown?

OpenClaw made one design choice that I find especially important: memory is stored in plain Markdown files. The Markdown file is the source of truth. Any index built on top of it, such as a vector index, is only a derived structure.

This has obvious benefits.

Markdown is easy to read. It is easy to edit. It can be versioned with Git. It is transparent to the user. If the user wants to delete or correct a memory, they can do it directly instead of digging through a database table or a hidden service.

That was also one of the core ideas behind MemSearch: extracted memories should be visible, editable, portable, and searchable.

But I think there is a deeper reason Markdown works well here:

The natural shape of memory is language.

Most real memories are not clean rows in a table. They are conditional, messy, exception-heavy, and context-sensitive.

For example:

The user prefers shipping on Friday afternoon, but if the release is large, delay it to Monday unless the PM is pushing.

Try putting that into a rigid schema. What is the release day? Friday? Monday? Where do the conditions go? What about the exception?

This is what most useful memory looks like. It carries assumptions, edge cases, tone, and context. If you force it into a pre-designed structure too early, you often lose the part that made it useful.

Natural language is powerful precisely because it can hold these irregular shapes.

I like Geoffrey Hinton's analogy that language is like Lego. Tokens are the small blocks. With the same blocks, you can build a bridge, a spiral, a tower, or something strange that only makes sense in one specific situation.

That is why Markdown as a memory substrate feels so natural. It gives language a simple, inspectable container.

This does not mean structured memory is useless. I see at least three cases where structure helps:

Domain-specific workflows. Sales records, support tickets, medical events, and legal facts often benefit from explicit fields and labels.
Human-facing presentation. Tables, timelines, and dashboards can make memory easier to inspect.
General extraction layers. Triples, graphs, and entity relations can be useful derived views, especially in domains with complex dependencies.

So I do not think the answer is "Markdown only" or "database only." The better framing is:

Markdown is a good source of truth for memory content. Structured systems are often useful indexes, projections, or views.

Layer 4: Memory and Skills

The next layer is where things get more interesting.

If skills are procedural memory, then the agent memory problem is not only about recalling facts or conversations. It is also about how an agent learns reusable ways of doing work.

Current agent systems have mostly solved skill discovery and skill use.

A skill is usually described by a small metadata block: name, description, trigger conditions, and instructions. The agent harness can use progressive disclosure: first read the metadata, then load the full skill only when it seems relevant.

That means the lookup path is mostly there.

The missing part is skill writing and skill evolution.

Creating a good skill by hand is still clunky. You have to explain a lot of background, describe when it should trigger, list files to inspect, define verification steps, and then test it repeatedly.

Skill evolution is also hard. Some approaches try to create evaluation datasets, run skill variants, score them, and select the better version. That can work, but it is heavy. You need datasets, environments, tasks, and evaluation signals.

There is another path:

Use the agent's existing memory to create and improve skills.

If an agent has access to your task history, it can look for repeated patterns. Maybe you keep releasing the same package. Maybe you keep debugging the same service. Maybe you keep asking for the same publishing workflow.

Those repeated operations are not just history. They are raw material for procedural memory.

This is the idea behind MemSearch's memory-to-skill workflow.

How MemSearch Turns Memory into Skills

The design has two steps.

First, the system distills candidate skills from memory. It scans memory journals, finds repeated reusable workflows, and writes candidate skills into a separate candidate directory.

The candidate can keep evolving as more related memories arrive. If your release process changes, the candidate skill can be revised before it is ever installed.

Second, a human reviews and installs the skill.

This boundary matters. A bad note is usually just noise. A bad installed skill is worse: it can actively trigger at the wrong time and do the wrong thing.

So the system proposes, but the human approves.

Here is a real example from my own workflow. While developing MemSearch, I repeatedly went through the same release process. Eventually, the system recognized that this was reusable and distilled a release skill candidate.

Once installed, I no longer needed to explain the release process from scratch. I could invoke the skill directly.

That is the moment where memory and skills form a loop:

Daily agent use creates memory. Memory distills into skills. Skills improve future agent use. Future use produces better memory.

This is a small form of bootstrapping. The agent starts turning what happened before into reusable capability.

Layer 4.5: Memory Needs a Context Harness

After thinking through these four layers, I started to feel that memory itself is only one part of a larger context problem.

Conversation history is memory. Skills are memory. Knowledge bases are memory.

But what about repositories, design docs, Slack messages, issue trackers, tickets, meeting notes, and project specs?

They are also part of the working memory of a team. They are just scattered across many systems, formats, and access patterns.

For any real task, the hard part is rarely "does the information exist?" The hard part is:

Which tiny slice of all available context matters right now?

That is why I think agent systems need a context harness above memory: a unified layer that connects scattered sources and lets the agent search and browse them through one interface.

This is the problem we are exploring with MFS.

MFS represents each source as a file-like tree mounted at a stable address. Then agents can use skills to connect new sources, search across them, and inspect the relevant pieces.

For example, suppose you want to understand:

What happens when our webhook retry fails?

The answer may be split across three places:

the code repository, where the retry logic lives
an old Slack thread, where someone explained the design trade-off
your personal memory, where you recorded a previous incident

Without a context harness, you search these systems one by one. With one, the agent can search across code, messages, and memory in the same task flow.

At that point, memory stops being a standalone feature. It becomes one layer inside a broader context architecture.

Final Thoughts

Agent memory is still evolving. There may be a fifth layer, a sixth layer, and a better vocabulary than the one I am using here.

But the direction feels clear to me.

Memory is not just a chat log. It is a thread that connects conversations, knowledge, skills, context, and the agent's own ability to improve over time.

And the foundation is surprisingly simple:

Write down what happened. Write down what is true. Write down how to do the thing next time.

The rest is indexing, retrieval, review, and iteration.

References

CoALA paper, Cognitive Architectures for Language Agents
LangChain, Memory for Agents
MemSearch
MFS

Vector Graph RAG: Multi-Hop RAG Without a Graph Database

Chen Zhang — Fri, 03 Apr 2026 07:54:52 +0000

Standard RAG falls apart when the answer isn't in one chunk. Ask "What side effects should I watch for with the first-line diabetes medication?" and the system needs to first figure out that metformin is the first-line drug, then look up metformin's side effects. The query never mentions "metformin" — it's a bridge entity the system has to discover on its own. Naive vector search can't do this.

The industry answer has been knowledge graphs plus graph databases. That works, but it means deploying Neo4j or similar, learning a graph query language, and operating two separate storage systems. The complexity doubles for what's essentially one feature: following entity chains across passages.

I built Vector Graph RAG to get multi-hop reasoning without any of that overhead. The entire graph structure lives inside Milvus — entities, relations, and passages stored as three collections with ID cross-references. No graph database, no Cypher queries, just vector search and metadata lookups.

Building a Logical Graph in Milvus

The key insight is simple: a knowledge graph relation like (metformin, is_first_line_drug_for, type_2_diabetes) is just text. Text can be embedded into vectors. So why not store the entire graph structure in a vector database?

Vector Graph RAG uses three Milvus collections with ID cross-references:

Entities: Deduplicated entity names, embedded for semantic search. Each entity record stores the IDs of relations it participates in.
Relations: Triple-based relations (subject, predicate, object). Each record stores the subject and object entity IDs, plus the IDs of source passages. The relation text is embedded for vector search.
Passages: Original document chunks. Each record stores the IDs of entities and relations extracted from it.

These three collections form a logical graph through ID references. "Graph traversal" becomes a series of ID-based metadata queries in Milvus — no graph query language needed.

The extra ID lookups add maybe 2-3 primary key queries per hop. Each takes under 10ms. The real bottleneck in any RAG pipeline is the LLM call (1-3 seconds), so a few extra milliseconds of metadata lookup is invisible.

The Four-Step Retrieval Pipeline

Step 1: Seed Retrieval

An LLM extracts key entities from the user query. These entities are embedded and used to search the Entities and Relations collections. The results are the "seeds" — entry points into the logical graph.

Step 2: Subgraph Expansion

This is where multi-hop happens. From each seed entity, the system follows ID references one hop outward: find the entity's relation IDs, fetch those relations, then fetch the entities on the other end of those relations.

In the diabetes example, expanding from "type 2 diabetes" discovers the relation (metformin, is_first_line_drug_for, type_2_diabetes), which surfaces "metformin" — the bridge entity the original query never mentioned. From "metformin," another expansion finds relations about renal function monitoring and side effects.

Step 3: LLM Reranking

After expansion, we have a pool of candidate relations and passages. A single LLM call scores and filters them for relevance to the original query. This replaces what iterative approaches do with multiple rounds of LLM-guided search.

Step 4: Answer Generation

The top-ranked relations and their associated passages go to the LLM for final answer generation.

Two LLM Calls, Not Ten

Most multi-hop RAG approaches are iterative. IRCoT calls the LLM 3-5 times per query. Agentic RAG systems can make 10+ LLM calls.

Vector Graph RAG front-loads the discovery work into vector search and subgraph expansion. The LLM only gets called twice: once for reranking, once for generation. This cuts API costs by roughly 60% and makes the system 2-3x faster compared to iterative approaches.

Benchmark Results

Evaluated on three standard multi-hop QA benchmarks using Recall@5:

Dataset	Naive RAG	Vector Graph RAG
MuSiQue (2-4 hop)	65.2%	82.4%
HotpotQA (2 hop)	78.6%	91.2%
2WikiMultiHopQA (2 hop)	76.4%	89.8%
Average	73.4%	87.8%

Against SOTA methods, Vector Graph RAG achieves the highest average Recall@5 at 87.8%, beating HippoRAG 2 on average — while using only 2 LLM calls per query and requiring no graph database.

Getting Started

pip install vector-graph-rag

from vector_graph_rag import VectorGraphRAG

# Initialize - uses Milvus Lite (local .db file) by default
rag = VectorGraphRAG()

# Index your documents
rag.add_texts([
    "Metformin is the first-line medication for type 2 diabetes.",
    "Metformin requires regular monitoring of renal function.",
    "Type 2 diabetes affects insulin sensitivity in the body.",
])

# Query with multi-hop reasoning
result = rag.query(
    "What monitoring is needed for the first-line type 2 diabetes drug?"
)
print(result)

By default, it uses Milvus Lite with a local .db file — no server needed. For production, switch to Milvus standalone/cluster or Zilliz Cloud.

Wrapping Up

Vector Graph RAG shows that the "graph" in Graph RAG doesn't have to mean a graph database. Store the graph structure as cross-referenced collections in a vector database and you get the same reasoning power with half the infrastructure.

If your RAG system struggles with multi-hop questions, give Vector Graph RAG a try. It's open source, installs in one command, and runs locally out of the box.

zilliztech / vector-graph-rag

Graph RAG with pure vector search, achieving SOTA performance in multi-hop reasoning scenarios.

Vector Graph RAG

Graph RAG with pure vector search — no graph database needed.

💡 Encode entities and relations as vectors in Milvus, replace iterative LLM agents with a single reranking pass — achieve state-of-the-art multi-hop retrieval at a fraction of the operational and computational cost.

✨ Features

No Graph Database Required — Pure vector search with Milvus, no Neo4j or other graph databases needed
Single-Pass LLM Reranking — One LLM call to rerank, no iterative agent loops (unlike IRCoT or multi-step reflection)
Knowledge-Intensive Friendly — Optimized for domains with dense factual content: legal, finance, medical, literature, etc.
Zero Configuration — Uses Milvus Lite by default, works out of the box with a single file
Multi-hop Reasoning — Subgraph expansion enables complex multi-hop question answering
State-of-the-Art Performance — 87.8% avg Recall@5 on multi-hop QA benchmarks, outperforming HippoRAG

📦 Installation

pip install vector-graph-rag
# or
uv add vector-graph-rag

With document…

View on GitHub

I Built an AI Agent That Watches the Market While I Sleep

Chen Zhang — Fri, 03 Apr 2026 06:18:24 +0000

I have a full-time job and no time to watch the stock market all day. But I still trade — mostly US tech stocks. Last year I made at least three bad decisions because I was too tired or too rushed to think clearly. So I built an AI agent to do the watching for me.

The stack: OpenClaw as the agent framework, Exa for information gathering, and Milvus as a personal memory store. Total cost: about $20/month.

The NVIDIA Moment

On February 26th, NVIDIA reported Q4 earnings — revenue up 65% year-over-year. The stock dropped 5.5%. I didn't find out until the next morning.

But when I checked my phone, there was already a message from my agent, sent the previous evening:

NVDA earnings analysis: Revenue beat expectations, but the market is skeptical about AI capex sustainability. In similar past situations, the stock tended to drop short-term. You had a similar experience in September 2024 — you panic-sold and the stock recovered within three weeks. Recommendation: hold, don't panic-sell.

It didn't just analyze the earnings report. It pulled up my own past trading notes and reminded me not to repeat the same mistake.

Information Gathering with Exa

The first problem: where does the information come from?

Exa is a search API designed for AI agents. It does semantic search — describe what you're looking for in plain language, and it understands what you mean. The index refreshes every minute and filters out SEO spam.

from exa_py import Exa

exa = Exa(api_key="your-api-key")

result = exa.search(
    "Why did NVIDIA stock drop despite strong Q4 2026 earnings",
    type="neural",
    num_results=10,
    start_published_date="2026-02-25",
    contents={
        "text": {"max_characters": 3000},
        "highlights": {"num_sentences": 3},
        "summary": {"query": "What caused the stock drop?"}
    }
)

The contents parameter is the killer feature — it extracts full text, highlights key sentences, and generates a summary, all in one request. No need to click through links one by one.

Personal Memory with Milvus

Information gathering solves "what's happening out there." But making good decisions also requires knowing yourself — what you got right, what you got wrong, what your blind spots are.

Milvus is a vector database. You convert text into vectors and store them. When you search later, it finds results by meaning, not keywords. So "Middle East conflict tanks tech stocks" and "geopolitical tensions trigger semiconductor selloff" match each other.

I set up three collections: past decisions/lessons, personal preferences/biases, and observed market patterns.

from pymilvus import MilvusClient
from openai import OpenAI

milvus = MilvusClient("./my_investment_brain.db")
llm = OpenAI()

def embed(text: str) -> list[float]:
    return llm.embeddings.create(
        input=text, model="text-embedding-3-small"
    ).data[0].embedding

milvus.create_collection("decisions", dimension=1536, auto_id=True)
milvus.create_collection("preferences", dimension=1536, auto_id=True)
milvus.create_collection("patterns", dimension=1536, auto_id=True)

A memory extractor runs after every conversation — it pulls out decisions, preferences, patterns, and lessons, then stores them automatically with deduplication (similarity > 0.92 = skip).

When the agent analyzes a current situation, it searches all three collections for relevant past experience:

def recall_my_experience(situation: str) -> dict:
    query_vec = embed(situation)
    past = milvus.search("decisions", data=[query_vec], limit=3,
                         output_fields=["text", "date", "tag"])
    prefs = milvus.search("preferences", data=[query_vec], limit=2,
                          output_fields=["text", "type"])
    patterns = milvus.search("patterns", data=[query_vec], limit=2,
                             output_fields=["text"])
    return {
        "past_decisions": [h["entity"] for h in past[0]],
        "preferences": [h["entity"] for h in prefs[0]],
        "patterns": [h["entity"] for h in patterns[0]]
    }

That's why the NVIDIA alert referenced my trading history from a year ago — the agent found the lesson in my own notes.

Analysis Framework: Writing My Logic as a Skill

OpenClaw's Skills system lets you define situation-specific analysis rules in markdown. I wrote a post-earnings evaluation skill with my personal criteria:

---
name: post-earnings-eval
description: ">"
  Evaluate whether to buy, hold, or sell after an earnings report.
---

The most important line in the skill: "I have a tendency to let fear override data. If my Milvus history shows I regretted selling after a dip, say so explicitly." A personal bias-correction mechanism baked into the agent.

Making It Run Automatically

OpenClaw's Heartbeat mechanism handles scheduling. The Gateway sends a pulse every 30 minutes, and the agent acts based on a HEARTBEAT.md file:

Morning brief (6:30-7:30 AM): Search overnight news via Exa, query Milvus for positions and past experience, generate a personalized summary, push to phone.
Price alerts (market hours): Monitor watchlist stocks, alert on >3% moves with context from past decisions.
End of day summary: Recap the day, compare with morning expectations.

No cron jobs, no server. Just a markdown file.

Results

My weekly market-tracking time went from ~15 hours to ~2 hours. The agent runs 24/7, so nothing slips through. And because it uses my own past experiences in every analysis, the recommendations are personalized, not generic.

The NVIDIA situation was the proof point. Without the agent, I probably would have panic-sold again. Instead, I had complete information — including a reminder of my own past mistake — and made the right call.

Total monthly cost: ~$10 Exa API + ~$10 LLM calls. OpenClaw and Milvus run locally for free.

If you want to try it, start with the smallest goal: "get a market summary on my phone every morning." One weekend to set up, then iterate from there.

Disclaimer: This post shares a personal technical project. All market analysis examples are illustrative and do not constitute investment advice.

Claude Code's Memory: 4 Layers of Complexity, Still Just Grep and a 200-Line Cap

Chen Zhang — Thu, 02 Apr 2026 04:25:07 +0000

Claude Code's memory system? Grep search, a 200-line index cap, and zero cross-agent sharing. After studying the leaked source, it's way more primitive than the hype suggests.

It does a fair amount of work — multiple layers, even a mechanism that lets the agent "dream" to consolidate memories. Sounds sophisticated. But peel it open and you find an agent trapped in its own sandbox. Memory can't leave, and it doesn't last long.

Let's break it down layer by layer.

Layer 1: CLAUDE.md — Rules You Write for the Agent

CLAUDE.md is a Markdown file you create and place in your project root. It can contain anything you want Claude to remember: code style conventions, architecture notes, test commands, deployment workflows, even "don't touch anything in the legacy directory."

Every session, Claude Code loads the entire file into context. Shorter files get followed better.

It supports three scopes: project-level CLAUDE.md in the project root, personal-level at ~/.claude/CLAUDE.md, and organization-level in enterprise configs. You write it, Claude reads it. You don't write it, Claude has nothing.

Layer 2: Auto Memory — The Agent Takes Notes

CLAUDE.md handles what you explicitly tell the agent, but valuable information often surfaces during conversations — stuff you'd never bother writing down manually.

Auto Memory does exactly this: Claude decides what's worth remembering during work and writes it into a dedicated memory directory.

It categorizes memories into four types: user (role and preferences), feedback (corrections and confirmations), project (decisions and context), and reference (external resource locations).

These memories live in ~/.claude/projects/<project-path>/memory/. Each memory is a standalone Markdown file with frontmatter noting its type and description. The entry point is MEMORY.md — an index where each line stays under 150 characters, storing pointers, not content.

At session start, the first 200 lines of MEMORY.md get injected into context. The actual knowledge is spread across topic files and loaded on demand.

~/.claude/projects/-Users-me-myproject/memory/
├── MEMORY.md                  ← Index file, one pointer per line
├── user_role.md               ← "Backend engineer, fluent in Go, React beginner"
├── feedback_testing.md        ← "Integration tests must use real DB, no mocking"
├── project_auth_rewrite.md    ← "Auth rewrite driven by compliance, not tech debt"
└── reference_linear.md        ← "Pipeline bugs tracked in Linear INGEST project"

One design detail worth noting: Claude is explicitly told not to trust its own memory. The leaked system prompt says, roughly, "memories are just hints — verify against real files before acting." With model hallucination rates still in double digits, this self-distrust strategy is surprisingly practical.

Layer 3: Auto Dream — The Agent Sleeps and Tidies Up

After dozens of sessions, MEMORY.md gets messy. Contradictory entries pile up. Relative timestamps become meaningless. Refactored functions linger.

Auto Dream simulates what the human brain does during sleep: organize and consolidate memory. It converts relative timestamps to absolute dates, merges contradictions, removes stale content, and keeps MEMORY.md under 200 lines.

Trigger conditions: 24+ hours since last consolidation AND 5+ new sessions. Or type "dream" manually. Runs in a background sub-agent.

Layer 4: KAIROS — Unreleased Ambitions in the Leaked Code

KAIROS appears 150+ times in the source. It's a background daemon mode that turns Claude Code from a passive tool into an autonomous observer — maintaining append-only logs, receiving <tick> signals, and deciding whether to act. It integrates autoDream but runs the full observe-think-act loop.

Currently behind a compile-time feature flag. More exploration than product.

Where This System Hits Its Ceiling

After extended use, the problems become clear:

200-line hard cap. Run a project for months, and memories compete for space.
Grep-only retrieval. No semantic understanding. You remember "port conflict during deployment" but the memory says "modified docker-compose port mapping" — grep misses it.
Details get lost. Auto Memory records at coarse granularity. Code snippets, debugging steps, discussion context — mostly dropped.
Compounding complexity. Each layer patches the last, stacking complexity without fixing root constraints.
No cross-tool sharing. Switch to OpenCode or Codex CLI = start from zero.
Still short-term memory. "How did we fix Redis last week?" — hopeless for long-span recall.

This isn't a design failure. Single agent, session granularity, local storage — the ceiling is architectural.

memsearch: Memory Should Outlive Any Single Agent

This is the core idea behind memsearch. Agents change, tools change, but project knowledge shouldn't disappear with them. memsearch pulls memory out of the agent into an independent persistence layer.

┌─────────────────────────────────────────┐
│         Agent Plugins (User Layer)       │
│  Claude Code · OpenClaw · OpenCode · Codex│
└──────────────────┬──────────────────────┘
                   ↓
┌──────────────────┴──────────────────────┐
│       memsearch CLI / Python API         │
│            (Developer Layer)             │
└──────────────────┬──────────────────────┘
                   ↓
┌──────────────────┴──────────────────────┐
│    Core: Chunker → Embedder → Milvus     │
│            (Engine Layer)                │
└──────────────────┬──────────────────────┘
                   ↓
┌──────────────────┴──────────────────────┐
│     Markdown Files (Source of Truth)      │
│          (Persistent Storage)            │
└─────────────────────────────────────────┘

How It Works

Memory writing is automatic. Each conversation gets summarized by Haiku and appended to that day's Markdown file, then asynchronously vectorized. Background, invisible to users.

Recall triggers through skills. A memory-recall skill runs in a forked sub-agent (context: fork) — zero token overhead in the main session.

Hybrid retrieval. Semantic vectors + BM25 keyword matching + RRF fusion. Ask "how did we fix that Redis timeout" — semantic search gets it. Say "search handleTimeout" — BM25 nails it.

The sub-agent drills from L1 (semantic search, truncated previews) through L2 (full context expansion) to L3 (raw conversation transcripts) as needed.

L1: Semantic search → top results with scores and previews
L2: memsearch expand <hash> → full paragraph, all details
L3: memsearch transcript → raw user/agent conversation + tool calls

Cross-Agent Sharing

This is the fundamental difference. memsearch supports Claude Code, OpenClaw, OpenCode, and Codex CLI — all sharing the same Markdown memory format with collection names computed from project paths.

Memory written from any agent is searchable by all others.

Spend an afternoon in Claude Code debugging deployment, switch to OpenCode next day — it finds yesterday's memories and gives the right answer. Point Milvus at Zilliz Cloud for team collaboration, and new members get instant project context without digging through Slack.

Quick Start

# Claude Code
/plugin marketplace add zilliztech/memsearch
/plugin install memsearch

# Python
uv tool install "memsearch[onnx]"

CLI and Python API for custom integrations:

from memsearch import MemSearch

mem = MemSearch(paths=["./memory"])
await mem.index()
results = await mem.search("Redis config")

Milvus Lite for local (zero config), Zilliz Cloud for teams (free tier), or self-hosted Docker. Default ONNX embeddings run on CPU — no GPU, no API calls needed.

Claude Code's memory design has real value, and KAIROS is worth watching. But single-agent memory optimization can only go so far. Memory should outlive any single agent.

Full analysis with architecture diagrams: https://zc277584121.github.io/ai-coding/2026/04/01/claude-code-memory-vs-memsearch.html

zilliztech / memsearch

A Markdown-first memory system, a standalone library for any AI agent. Inspired by OpenClaw.

memsearch

Cross-platform semantic memory for AI coding agents.

Why memsearch?

🌐 All Platforms, One Memory — memories flow across Claude Code, OpenClaw, OpenCode, and Codex CLI. A conversation in one agent becomes searchable context in all others — no extra setup
👥 For Agent Users, install a plugin and get persistent memory with zero effort; for Agent Developers, use the full CLI and Python API to build memory and harness engineering into your own agents
📄 Markdown is the source of truth — inspired by OpenClaw. Your memories are just .md files — human-readable, editable, version-controllable. Milvus is a "shadow index": a derived, rebuildable cache
🔍 Progressive retrieval, hybrid search, smart dedup, live sync — 3-layer recall (search → expand → transcript); dense vector + BM25 sparse + RRF reranking; SHA-256 content hashing skips unchanged content; file watcher auto-indexes in real time

🧑‍💻

…

View on GitHub

Claude Code's Leaked Source: A Real-World Masterclass in Harness Engineering

Chen Zhang — Wed, 01 Apr 2026 08:42:08 +0000

Earlier this year, Mitchell Hashimoto coined the term "harness engineering" â€” the discipline of building everything around the model that makes an AI agent actually work in production. OpenAI wrote about it. Anthropic published guides. Martin Fowler analyzed it.

After studying Claude Code's leaked source â€” particularly its memory system, caching architecture, and security layers â€” the harness turns out to be far more interesting than the LLM calls themselves.

The Evolution: From Prompt to Harness

The AI engineering discipline has shifted rapidly:

2023-2024: Prompt Engineering    â†’ "How to ask the model"
2025:      Context Engineering   â†’ "What information to feed the model"
2026:      Harness Engineering   â†’ "How the entire system runs around the model"

Prompt engineering is the question. Context engineering is the blueprint. Harness engineering is the construction site â€” tools, permissions, safety checks, cost controls, feedback loops, and state management that let the agent operate reliably.

The leaked Claude Code source is a concrete case study for each of these harness layers.

Prompt Cache Economics: A Cost Center, Not an Optimization

One of the most revealing modules is promptCacheBreakDetection.ts. It tracks 14 distinct cache invalidation vectors and uses "sticky latches" â€” mechanisms that prevent mode switches from breaking cached prompt prefixes.

â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚              Prompt Cache Layer                  â”‚
â”‚                                                  â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”         â”‚
â”‚  â”‚ Vector 1â”‚  â”‚ Vector 2â”‚  â”‚Vector 14â”‚  ...     â”‚
â”‚  â”‚ Mode    â”‚  â”‚ Tool    â”‚  â”‚ Context â”‚         â”‚
â”‚  â”‚ Switch  â”‚  â”‚ Change  â”‚  â”‚ Rotate  â”‚         â”‚
â”‚  â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜  â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜  â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜         â”‚
â”‚       â”‚            â”‚            â”‚                â”‚
â”‚       â–¼            â–¼            â–¼                â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”       â”‚
â”‚  â”‚         Sticky Latch Layer           â”‚       â”‚
â”‚  â”‚  "Hold current prefix until forced"  â”‚       â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜       â”‚
â”‚                         â”‚                        â”‚
â”‚                         â–¼                        â”‚
â”‚              â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”                 â”‚
â”‚              â”‚  Cache Decision  â”‚                â”‚
â”‚              â”‚  KEEP / BREAK   â”‚                 â”‚
â”‚              â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜                 â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜

This reframes prompt caching from a performance trick into a billing optimization problem. At scale, each cache miss is real money. The code treats cache management with the same rigor as database query planning â€” monitoring invalidation patterns, measuring hit rates, and making explicit keep-or-break decisions.

The takeaway for agent builders: if your agent makes repeated API calls (and it does), prompt caching is not optional â€” it's a cost center that needs active management.

Multi-Agent Coordination: The Prompt IS the Harness

Claude Code's sub-agent system is internally called "swarms." The surprising part: coordination between agents is not handled by a state machine, a DAG executor, or an orchestration framework. It's done through natural language prompts.

â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚            Main Agent (Orchestrator)       â”‚
â”‚                                           â”‚
â”‚  System prompt includes:                  â”‚
â”‚  - "Do not rubber-stamp weak work"        â”‚
â”‚  - Tool permission boundaries             â”‚
â”‚  - Task decomposition strategy            â”‚
â”‚                                           â”‚
â”‚         â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¼â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”           â”‚
â”‚         â–¼          â–¼          â–¼           â”‚
â”‚    â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”   â”‚
â”‚    â”‚ Sub-    â”‚ â”‚ Sub-    â”‚ â”‚ Sub-    â”‚   â”‚
â”‚    â”‚ Agent A â”‚ â”‚ Agent B â”‚ â”‚ Agent C â”‚   â”‚
â”‚    â”‚         â”‚ â”‚         â”‚ â”‚         â”‚   â”‚
â”‚    â”‚ Isolatedâ”‚ â”‚ Isolatedâ”‚ â”‚ Isolatedâ”‚   â”‚
â”‚    â”‚ Context â”‚ â”‚ Context â”‚ â”‚ Context â”‚   â”‚
â”‚    â”‚ Scoped  â”‚ â”‚ Scoped  â”‚ â”‚ Scoped  â”‚   â”‚
â”‚    â”‚ Tools   â”‚ â”‚ Tools   â”‚ â”‚ Tools   â”‚   â”‚
â”‚    â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜   â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜

Each sub-agent runs in an isolated context with specific tool permissions. The orchestrator coordinates them through instructions embedded in prompts â€” quality standards, scope boundaries, conflict resolution rules. All in natural language.

This is a strong signal: for LLM-based multi-agent systems, traditional orchestration frameworks may add unnecessary complexity. The model already understands natural language instructions. Why build a state machine when a well-written prompt can express the same coordination logic?

Memory and State: Progressive Disclosure at Every Layer

One of the more practical patterns in the codebase is the file-based memory system with progressive disclosure.

The design uses a two-tier structure:

â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚             Memory Architecture               â”‚
â”‚                                               â”‚
â”‚  Tier 1: Index (always loaded)                â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”    â”‚
â”‚  â”‚ MEMORY.md                            â”‚    â”‚
â”‚  â”‚                                      â”‚    â”‚
â”‚  â”‚ - [User role](user_role.md) â€” ...    â”‚    â”‚
â”‚  â”‚ - [Testing](feedback_test.md) â€” ...  â”‚    â”‚
â”‚  â”‚ - [Auth rewrite](project_auth.md)    â”‚    â”‚
â”‚  â”‚                                      â”‚    â”‚
â”‚  â”‚ Cost: ~200 tokens (one-line hooks)   â”‚    â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜    â”‚
â”‚                                               â”‚
â”‚  Tier 2: Full content (loaded on demand)      â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚user_role.mdâ”‚ â”‚feedback_   â”‚ â”‚project_  â”‚ â”‚
â”‚  â”‚            â”‚ â”‚test.md     â”‚ â”‚auth.md   â”‚ â”‚
â”‚  â”‚ Full user  â”‚ â”‚ Full test  â”‚ â”‚ Full     â”‚ â”‚
â”‚  â”‚ context    â”‚ â”‚ guidelines â”‚ â”‚ context  â”‚ â”‚
â”‚  â”‚ (~500 tok) â”‚ â”‚ (~300 tok) â”‚ â”‚(~400 tok)â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚                                               â”‚
â”‚  Only fetched when the index line matches     â”‚
â”‚  the current task context                     â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜

The index (MEMORY.md) is always loaded into the context window â€” cheap, at ~200 tokens. Full memory files are only fetched when a one-line hook in the index matches the current task. This keeps the context window lean while still giving the agent access to rich historical context when needed.

This is essentially the same pattern as database indexing: maintain a small, fast lookup structure that points to larger data. Applied to LLM context windows, it's a practical solution to the "agents forget everything between sessions" problem without paying the token cost of loading everything upfront.

The memory system also categorizes memories by type â€” user preferences, feedback corrections, project context, external references â€” each with different update and retrieval patterns. This is more sophisticated than a flat memory store and mirrors how humans organize knowledge.

Security: 23 Checks and Adversarial Hardening

Every bash command execution passes through 23 security checks. This is not a theoretical threat model â€” it's the result of real-world adversarial usage.

The defenses include:

Zero-width character injection â€” invisible Unicode characters that can alter command semantics
Zsh expansion tricks â€” shell-specific syntax that can escape sandboxes
Native client authentication â€” a DRM-style mechanism where the Zig HTTP layer computes a hash (cch=56670 placeholder replaced at transport time) to verify client legitimacy

User Input
    â”‚
    â–¼
â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚  23 Security Checks  â”‚
â”‚                      â”‚
â”‚  â€¢ Zero-width chars  â”‚
â”‚  â€¢ Zsh expansion     â”‚
â”‚  â€¢ Path traversal    â”‚
â”‚  â€¢ Injection patternsâ”‚
â”‚  â€¢ ...19 more        â”‚
â”‚                      â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
           â”‚
     PASS? â”‚
     â”Œâ”€â”€â”€â”€â”€â”´â”€â”€â”€â”€â”€â”
     â”‚           â”‚
    YES         NO
     â”‚           â”‚
     â–¼           â–¼
  Execute     Block +
  Command     Log Event

The lesson: agent security is not just sandboxing. It's adversarial input hardening at every boundary. If an agent can execute shell commands, assume someone will try to make it execute the wrong ones â€” intentionally or not.

When NOT to Call the Model

Claude Code detects user frustration using regex, not LLM inference.

Patterns like "wtf", "so frustrating", "this is broken" are matched via simple pattern rules and trigger tone adjustments in subsequent responses. No API call needed.

This sounds almost trivially simple. But it embodies a core harness engineering principle: use the cheapest, fastest tool that solves the problem.

The codebase applies this principle consistently:

Task	Solution	Why not LLM?
Frustration detection	Regex	Fast, free, reliable enough
Terminal rendering	React + Ink with Int32Array buffers	Rendering is a solved problem
Cache invalidation tracking	Dedicated TS module	Deterministic logic, no ambiguity
Client auth	Zig HTTP layer hash	Security must be deterministic

LLM calls are reserved for tasks that genuinely require language understanding. Everything else uses conventional engineering. A $0 regex beats a $0.01 model call when accuracy is comparable.

The Rendering Layer: Game Engine Meets Terminal

An unexpected finding: the CLI terminal interface is built with React + Ink and uses game-engine-style rendering optimizations.

The implementation uses Int32Array buffers and patch-based updates â€” similar to how game engines minimize draw calls by only updating changed pixels. The team claims this achieves ~50x fewer stringWidth calls during token streaming.

This makes sense when you think about it. A terminal UI streaming LLM output has similar challenges to a game render loop: frequent partial updates, variable-length content, frame-rate sensitivity. The harness applies domain-appropriate engineering rather than treating the terminal as an afterthought.

The Big Picture: Model as Commodity, Harness as Moat

The leaked source paints a clear picture of where the real engineering effort lives in a production AI agent:

â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚                 Agent Harness                    â”‚
â”‚                                                  â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚ Cache    â”‚ â”‚ Security â”‚ â”‚ Tool Orchestrationâ”‚ â”‚
â”‚  â”‚ Economicsâ”‚ â”‚ Hardeningâ”‚ â”‚ & Permissions     â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚ Memory & â”‚ â”‚ State    â”‚ â”‚ Multi-Agent      â”‚ â”‚
â”‚  â”‚ Retrievalâ”‚ â”‚ Persist  â”‚ â”‚ Coordination     â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚ Cost     â”‚ â”‚ UI/UX    â”‚ â”‚ Observability    â”‚ â”‚
â”‚  â”‚ Control  â”‚ â”‚ Renderingâ”‚ â”‚ & Logging        â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚                                                  â”‚
â”‚              â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”                    â”‚
â”‚              â”‚   LLM API    â”‚                    â”‚
â”‚              â”‚  (the easy   â”‚                    â”‚
â”‚              â”‚    part)     â”‚                    â”‚
â”‚              â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜                    â”‚
â”‚                                                  â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜

The LLM API call is the smallest box. Everything around it â€” caching, memory, security, cost control, rendering, coordination â€” is the actual product.

For anyone building AI agents: the model selection matters less than you think. The harness is where the engineering lives â€” and where the differentiation happens.

MCP Is Being Abandoned: How Fast Can a 'Standard' Die?

Chen Zhang — Thu, 26 Mar 2026 14:52:34 +0000

In mid-March, Perplexity's CTO Denis Yarats casually dropped a bombshell at the Ask 2026 conference: the company is moving away from MCP internally, going back to REST APIs and CLIs.

The audience barely reacted, but the statement exploded on social media. YC CEO Garry Tan retweeted it with a blunt "MCP sucks honestly" — it eats too much context window, authentication is broken, and he wrote a CLI wrapper in 30 minutes to replace it.

A year ago, this kind of pushback would have been unthinkable. MCP was hailed as the ultimate standard for AI tool integration, ecosystem growth was explosive, and server counts doubled weekly. Now it's been hyped, overused, and rejected.

So what actually went wrong with MCP?

MCP Is Too Heavy: Context Windows Can't Handle It

A standard MCP setup consumes roughly 72% of the context window. Someone measured it: three servers (GitHub, Playwright, and an IDE integration) burned through 143K tokens of tool definitions on a 200K-token model. Before the agent even starts working, less than 30% of the space remains.

The cost isn't just financial. The more noise packed into the context, the worse the model focuses on what matters. With 100 tool schemas sitting there, the agent has to wade through all of them for every single decision.

Researchers call this "context rot." The data is stark: tool selection accuracy drops from 43% to below 14%. The more tools you add, the worse the agent gets at picking the right one.

The root cause is MCP's loading strategy — it dumps all tool descriptions into the session at the start, regardless of whether they'll be used. This is a protocol-level design choice, not a bug, but the cost scales with the number of tools.

Skills take a different approach: progressive disclosure. At session start, the agent only sees metadata for each Skill — name, one-line description, and trigger conditions. A few dozen tokens total. Only when the agent determines a Skill is relevant does it load the full content.

MCP lines up every tool at the door and asks you to pick. Skills hand you an index and let you look things up as needed.

MCP Is Too Dumb: It Just Waits to Be Called

MCP is essentially a tool-calling protocol: how to discover tools, how to invoke them, how to get results. Clean for small-scale use, but the "cleanness" is also the limitation — it's too flat.

No Hierarchy, No Subcommands

In MCP, a tool is a function signature. No subcommands, no awareness of session lifecycle, no knowledge of where the agent is in its workflow.

CLI tools are fundamentally different. A CLI naturally has subcommands — git commit, git push, git log are completely different behavior paths. An agent can run --help first, explore what's available, and expand as needed.

No SOPs, No Guidance for the Agent

A Skill is essentially a markdown file containing an SOP: what to do first, what to do next, how to retry on failure, when to notify the user. The agent doesn't get an isolated tool — it gets a complete operational playbook.

MCP tools passively wait. Skills actively participate in the agent's workflow.

Can't Reuse the Agent's LLM

This one hits close to home. Six months ago, we built the claude-context MCP project to provide contextual code retrieval for Claude Code. When a user asked a question, MCP retrieved relevant conversation fragments from a Milvus vector database and fed them back into the context.

The problem: out of top 10 retrieved results, maybe 3 were useful. The other 7 were noise. I tried several approaches — dumping everything to the agent (too noisy), adding reranking inside the MCP server (small models weren't accurate enough), using a large model (but the MCP server is a separate process that can't access the outer agent's LLM).

Skills solve this. A retrieval Skill can be structured as: run vector search for top 10, then let the agent's own LLM judge relevance and filter noise. No extra model, no extra API key.

Our later project memsearch used this approach — three-layer progressive retrieval where the agent's LLM participates throughout the decision-making process, keeping only curated results visible to the main conversation.

But Does MCP Deserve to Die?

MCP has been donated to the Linux Foundation, has over 10,000 active servers, and SDK downloads hit 97 million per month. That ecosystem doesn't just die overnight.

Some developers take a moderate view: Skills are like a detailed recipe; MCP is the tool. Both have their place.

It depends on the scenario. MCP over stdio — the mode most developers run locally — has the most issues. But MCP over HTTP is different: enterprise tool platforms need unified permission management, centralized OAuth, standardized telemetry — things scattered CLI tools genuinely struggle to provide.

MCP probably won't disappear. It will retreat to where it fits — enterprise tool platforms, scenarios requiring centralized governance. But the claim that "every agent should use MCP" no longer holds up.

What's Next

The most pragmatic combination right now is CLI + RESTful + Skills: CLIs for everyday operations, RESTful APIs for external system integration, Skills for agent behavioral knowledge. MCP hasn't vanished, but its position has shifted — from "the standard" to "one option among several."

MCP left the industry with a valuable question: what kind of tool interface do agents actually need? The answer is slowly taking shape — it's just no longer something a single protocol can provide.

Which Embedding Model Should You Actually Use in 2026? I Benchmarked 10 Models to Find Out

Chen Zhang — Fri, 20 Mar 2026 08:53:57 +0000

Still using OpenAI's text-embedding-3-small without a second thought? If you're building RAG or vector search systems, you've probably noticed that new embedding models drop every few weeks, each claiming SOTA on some leaderboard. But when it comes to picking one for production, those MTEB scores don't always translate to real-world performance.

On March 10, 2026, Google released Gemini Embedding 2 Preview — a model that supports five modalities (text, image, video, audio, PDF) natively, 100+ languages, native MRL (Matryoshka Representation Learning), and 3072-dimensional output. On paper, it checks every box.

The official benchmarks look impressive too:

But official benchmarks tend to highlight the best scenarios. So I decided to test things myself: pick a batch of 2025-2026 models and run them through tasks that public benchmarks don't cover well.

The Contenders

I selected 10 models spanning API services and open-source local deployment, plus classic baselines like OpenAI text-embedding-3-large and CLIP ViT-L-14.

Model	From	Params	Dims	Modalities	Notes
Gemini Embedding 2	Google	Unknown	3072	Text/Image/Video/Audio/PDF	All-modality universal
Jina Embeddings v4	Jina AI	3.8B	2048	Text/Image/PDF	MRL + LoRA multi-task
Voyage Multimodal 3.5	Voyage AI (MongoDB)	Unknown	1024	Text/Image/Video	Balanced across the board
Qwen3-VL-Embedding-2B	Alibaba Qwen	2B	2048	Text/Image/Video	Open-source, lightweight multimodal
Jina CLIP v2	Jina AI	~1B	1024	Text/Image	Modern CLIP architecture
Cohere Embed v4	Cohere	Unknown	Fixed	Text	Enterprise retrieval
OpenAI 3-large	OpenAI	Unknown	3072	Text	Most widely used
BGE-M3	BAAI	568M	1024	Text	Open-source multilingual
mxbai-embed-large	Mixedbread AI	335M	1024	Text	Lightweight, English-focused
nomic-embed-text	Nomic AI	137M	768	Text	Ultra-lightweight
CLIP ViT-L-14	OpenAI (2021)	428M	768	Text/Image	Classic baseline

A quick rundown of the newer ones:

Gemini Embedding 2 is Google's first all-modality embedding model, released March 2026, supporting all five modalities.

Jina Embeddings v4 is built on Qwen2.5-VL-3B (3.8B params). It uses three LoRA adapters (retrieval.query / retrieval.passage / text-matching) to switch between retrieval scenarios. Supports text, images, and PDFs.

Jina CLIP v2 is Jina AI's modernized CLIP architecture focused on text-image cross-modal alignment with multilingual support.

Voyage Multimodal 3.5 comes from Voyage AI, acquired by MongoDB for $220M in February 2025. Supports text, images, and video.

Qwen3-VL-Embedding is Alibaba Qwen's open-source multimodal embedding series (2B and 8B variants). I tested the 2B version since it fits on a single 11GB consumer GPU — a good test of lightweight deployment viability.

Cohere Embed v4 and OpenAI 3-large are text-only stalwarts, regulars on MTEB leaderboards and the most common choices for RAG.

BGE-M3 from BAAI is an open-source multilingual model (568M params, 100+ languages) — the benchmark in Chinese open-source embeddings.

mxbai-embed-large (335M) and nomic-embed-text (137M) are lightweight open-source options. mxbai excels at English MRL, while nomic is the smallest model in this benchmark.

Why Existing Benchmarks Aren't Enough

Before designing my tests, I looked at what's already out there and found gaps.

MTEB (Massive Text Embedding Benchmark) is the gold standard, but it's text-only, doesn't test cross-lingual retrieval (e.g., Chinese query → English document), doesn't evaluate MRL dimension truncation, and has limited coverage of truly long documents (10K+ tokens).

MMEB (Massive Multimodal Embedding Benchmark) adds multimodal, but lacks hard negatives — distractors are too easy, making it hard to differentiate models on fine-grained understanding.

Neither tests cross-lingual retrieval, MRL compression quality, or long-document needle retrieval. These happen to be exactly the pain points developers face when building RAG / Agent / vector search systems. So I designed four evaluation tasks: cross-modal retrieval, cross-lingual retrieval, needle-in-a-haystack, and MRL dimension compression.

Evaluation Tasks and Results

Round 1: Cross-Modal Retrieval (Text ↔ Image)

Scenario: E-commerce visual search, multimodal knowledge bases, multimedia content understanding.

Task design: 200 image-text pairs from COCO val2017. Text descriptions generated by GPT-4o-mini, each image paired with 3 hard negatives — descriptions that differ from the correct one by just one or two details. Models must retrieve correctly from a pool of 200 images + 600 distractor descriptions.

Here's an actual sample from the dataset:

Correct description:
"The image features vintage brown leather suitcases with various travel stickers including 'California', 'Cuba', and 'New York', placed on a metal luggage rack against a clear blue sky."

Hard negatives (single keyword swaps):

leather suitcases → canvas backpacks

California → Florida

metal luggage rack → wooden shelf

The model must truly understand visual details to distinguish these hard negatives.

Scoring: Bidirectional R@1 — text-to-image and image-to-text, averaged as hard_avg_R@1.

Results

This one surprised me.

Qwen3-VL-2B took first with hard_avg_R@1 = 0.945, beating Gemini (0.928) and Voyage (0.900). A 2B open-source model outperformed closed-source APIs.

Why? Look at the Modality Gap — the L2 distance between the mean text embedding vector and the mean image embedding vector. A smaller gap means text and image vectors live closer together in the embedding space, making cross-modal retrieval easier.

Model	hard_avg_R@1	Modality Gap	Params
Qwen3-VL-2B	0.945	0.25	2B (open-source)
Gemini Embed 2	0.928	0.73	Unknown (closed)
Voyage MM-3.5	0.900	0.59	Unknown (closed)
Jina CLIP v2	0.873	0.87	~1B
CLIP ViT-L-14	0.768	0.83	428M

Qwen3-VL-2B's modality gap of 0.25 is far smaller than Gemini's 0.73. If you're building a mixed text-image collection in Milvus, a smaller modality gap means text and image vectors can coexist in the same index without extra alignment tricks.

Takeaway from Round 1: In cross-modal capability, open-source small models can already compete with closed-source APIs.

Round 2: Cross-Lingual Retrieval (Chinese ↔ English)

Scenario: Bilingual knowledge bases where users ask in Chinese but answers live in English documents, or vice versa.

Task design: 166 manually constructed Chinese-English parallel sentence pairs across three difficulty levels, plus 152 hard negative distractors per language.

Difficulty levels:

Level Chinese English Hard Negative

Easy 我爱你。 I love you. —

Medium 这道菜太咸了。 This dish is too salty. "This dish is too **sweet." / "This **soup* is too salty."*

Hard 画蛇添足 To gild the lily "To add fuel to the fire" / "To let the cat out of the bag"

Mapping "画蛇添足" (literally "drawing legs on a snake") to "To gild the lily" — this kind of cultural concept alignment is the hardest part.

Level	Chinese	English	Hard Negative
Easy	我爱你。	I love you.	—
Medium	这道菜太咸了。	This dish is too salty.	"This dish is too sweet." / "This soup* is too salty."*
Hard	画蛇添足	To gild the lily	"To add fuel to the fire" / "To let the cat out of the bag"

Results

Gemini dominated here with 0.997, near-perfect, nailing even idiomatic expressions. It was the only model with R@1 = 1.000 on the Hard subset.

Model	hard_avg_R@1	Easy	Medium	Hard (idioms)
Gemini Embed 2	0.997	1.000	1.000	1.000
Qwen3-VL-2B	0.988	1.000	1.000	0.969
Jina v4	0.985	1.000	1.000	0.969
Voyage MM-3.5	0.982	1.000	1.000	0.938
OpenAI 3-large	0.967	1.000	1.000	0.906
Cohere v4	0.955	1.000	0.980	0.875
BGE-M3 (568M)	0.940	1.000	0.960	0.844
nomic (137M)	0.154	0.300	0.120	0.031
mxbai (335M)	0.120	0.220	0.080	0.031

This task split models into two clear groups: the top 8 (R@1 > 0.93) have genuine multilingual capability, while nomic and mxbai (R@1 < 0.16) essentially only understand English. No middle ground.

Round 3: Needle-in-a-Haystack

Scenario: RAG systems processing lengthy legal contracts, research papers. Can the embedding model still find key information buried in tens of thousands of characters?

Task design: Wikipedia articles as the "haystack" (4K-32K characters), with a fabricated fact inserted at different positions (start / 25% / 50% / 75% / end) as the "needle." The model must correctly rank the needle-containing document higher than the needle-free version via embedding similarity.

Example:

Needle: "The Meridian Corporation reported quarterly revenue of $847.3 million in Q3 2025."

Query: "What was Meridian Corporation's quarterly revenue?"

Haystack: A 32,000-character Wikipedia article about photosynthesis, with the revenue fact hidden somewhere inside.

Results

The discrimination here was bigger than I expected.

Model	1K	4K	8K	16K	32K	Overall	Degradation
Gemini Embed 2	1.000	1.000	1.000	1.000	1.000	1.000	0%
OpenAI 3-large	1.000	1.000	1.000	—	—	1.000	0%
Jina v4	1.000	1.000	1.000	—	—	1.000	0%
Cohere v4	1.000	1.000	1.000	—	—	1.000	0%
Qwen3-VL-2B	1.000	1.000	—	—	—	1.000	0%
Voyage MM-3.5	1.000	1.000	—	—	—	1.000	0%
Jina CLIP v2	1.000	1.000	1.000	—	—	1.000	0%
BGE-M3 (568M)	1.000	1.000	0.920	—	—	0.973	8%
mxbai (335M)	0.980	0.600	0.400	—	—	0.660	58%
nomic (137M)	1.000	0.460	0.440	—	—	0.633	56%

"—" means the length exceeds the model's context window or wasn't tested.

Three tiers emerged. Gemini, OpenAI, Jina v4, and Cohere scored near-perfect within their context windows. BGE-M3 (568M) showed slight degradation at 8K (0.92). Models under 335M (mxbai, nomic) dropped significantly at 4K, hitting 0.40-0.44 accuracy at 8K.

Gemini was the only model that completed the full 4K-32K range with a perfect score. On the other end, sub-335M models fell to 0.46-0.60 at just 4K characters (~1000 tokens) — if your RAG documents average over 2000 words, keep this in mind.

Round 4: MRL Dimension Compression

What is MRL?

MRL (Matryoshka Representation Learning) is a training technique that makes the first N dimensions of an embedding vector form a meaningful low-dimensional representation on their own. For example, a 3072-dim vector truncated to its first 256 dimensions can still retain decent semantic quality. Half the dimensions = half the storage cost.

Task design: 150 sentence pairs from STS-B (Semantic Textual Similarity Benchmark), each with human-annotated similarity scores (0-5). Models generate embeddings at full dimensions, then truncated to 256 / 512 / 1024 dims, measuring Spearman rank correlation (ρ) with human scores at each dimension.

Results

If you're planning to reduce storage costs by truncating embedding dimensions in your vector database, pay attention here.

Model	ρ (Full dim)	ρ (256 dim)	Degradation
Voyage MM-3.5	0.880	0.874	0.7%
Jina v4	0.833	0.828	0.6%
mxbai (335M)	0.815	0.795	2.5%
nomic (137M)	0.781	0.774	0.8%
OpenAI 3-large	0.767	0.762	0.6%
Gemini Embed 2	0.683	0.689	-0.8%

Gemini ranked last in this round. mxbai-embed-large (just 335M params) placed third in MRL, beating OpenAI 3-large. Jina v4 and Voyage led because they were specifically trained with MRL objectives. Dimension compression ability has little to do with model size — what matters is whether it was explicitly trained for it.

Note: MRL rankings reflect dimension-compression resilience, which is different from full-dimension semantic quality. Gemini's full-dimension retrieval is strong (proven in cross-lingual and cross-modal rounds), but it scored low on this slimming test. If you don't need dimension compression, this round's results matter less.

Full Scorecard

Model	Params	Cross-Modal	Cross-Lingual	Needle	MRL ρ
Gemini Embed 2	Unknown	0.928	0.997	1.000	0.668
Voyage MM-3.5	Unknown	0.900	0.982	1.000	0.880
Jina v4	3.8B	—	0.985	1.000	0.833
Qwen3-VL-2B	2B	0.945	0.988	1.000	0.774
mxbai-embed-large	335M	—	0.120	0.660	0.815
OpenAI 3-large	Unknown	—	0.967	1.000	0.760
BGE-M3	568M	—	0.940	0.973	0.744
nomic-embed-text	137M	—	0.154	0.633	0.780
Cohere v4	Unknown	—	0.955	1.000	—
Jina CLIP v2	~1B	0.873	0.934	1.000	—
CLIP ViT-L-14	428M	0.768	0.030	—	—

"—" means the model doesn't support that capability or wasn't tested. CLIP included as a 2021 baseline.

One thing is clear: no single model wins every round. Gemini leads in cross-lingual and long documents but ranks last in MRL. Qwen3-VL-2B takes first in cross-modal but is mid-pack on MRL. Voyage is consistently strong but never first. Every model's scorecard has a different shape.

Conclusions and Selection Guide

Round-by-Round Summary

Cross-modal: Qwen3-VL-2B (0.945) took first, Gemini (0.928) second, Voyage (0.900) third. Open-source 2B model beat closed-source APIs — modality gap was the key differentiator.

Cross-lingual: Gemini (0.997) led by a wide margin, handling even idiom-level Chinese-English alignment perfectly. Top 8 models all scored above 0.93; English-only lightweight models essentially scored zero.

Needle-in-a-haystack: API and large open-source models scored perfectly within 8K; sub-335M models degraded starting at 4K. Gemini was the only model to achieve a perfect score across the full 32K range.

MRL compression: Voyage (0.880) and Jina v4 (0.833) led, with less than 1% degradation when truncated to 256 dims. Gemini (0.668) ranked last.

Gemini Embedding 2 Verdict

Back to the question I started with — how did Gemini Embedding 2 actually perform?

Strengths: Cross-lingual #1 (0.997), needle-in-a-haystack #1 (1.000), cross-modal #2 (0.928), broadest modality coverage (five modalities — other models max out at three).

Weaknesses: MRL compression ranked last (ρ=0.668), cross-modal accuracy beaten by open-source Qwen3-VL-2B.

If you don't need dimension compression, Gemini is currently unmatched for cross-lingual + long-document scenarios. But for cross-modal precision and dimension compression, specialized models do better.

Selection Decision Tree

Based on these benchmark results, here's a simple decision flow:

Limitations

A few models I didn't get to test: NVIDIA's NV-Embed-v2, Jina v5-text. I also didn't cover video, audio, or PDF/table modalities even though some models claim support, nor did I test domain-specific scenarios like code retrieval. The sample sizes are relatively small — ranking differences between some models may fall within statistical margin of error. More thorough testing is on my to-do list.

Final Thoughts

After running four rounds of benchmarks, a few things stood out to me.

Cross-lingual semantic alignment used to be a research topic in academic papers — now you can get it from an API call. Five years ago, text-image retrieval meant training a dedicated CLIP model; now a single general-purpose model handles text, images, video, audio, and PDFs. This field is moving faster than most people realize.

What impressed me most was how fast open-source is catching up. Qwen3-VL-2B has just 2B parameters yet beat every closed-source API in cross-modal accuracy. BGE-M3's cross-lingual performance rivals most commercial services. In the embedding space, data quality and training strategy matter more and more, while model size and compute are becoming less decisive. You don't need to worry about being locked into any single API — there's always an open-source alternative.

One last thought on model selection. The conclusions in this post will probably need updating in a year. Rather than agonizing over "which model is THE one," I'd invest in building an evaluation pipeline — understand your actual use case and data, set up a test workflow that can quickly validate new models when they drop. Public benchmarks like MTEB, MMTEB, and MMEB are useful references, but you ultimately need to validate on your own data. The evaluation code for this post is open-sourced on GitHub if you want to adapt it. In the long run, building this evaluation capability is more valuable than picking the right model at any single point in time.

Why Vercel's agent-browser Is Winning the Token Efficiency War for AI Browser Automation

Chen Zhang — Fri, 06 Mar 2026 15:11:49 +0000

The Problem With MCP-Based Browser Tools

If you've tried connecting an AI agent to a browser, you've probably used something like Playwright MCP or Chrome DevTools MCP. They work, but there's a hidden cost: tool definitions.

MCP tools describe themselves via JSON Schema, and those descriptions get loaded into the agent's context window at the start of every session. Playwright MCP costs roughly 13,700 tokens. Chrome DevTools MCP costs around 17,000. Before your agent has done a single thing, nearly 9% of a 200K context window is gone.

For short tasks, this is fine. For long multi-step automation workflows — the kind where an agent fills forms, navigates pages, extracts data, and interacts across multiple sites — it adds up fast and can push you right into context limits.

agent-browser: The CLI-First Alternative

Vercel's agent-browser takes a fundamentally different approach. Instead of exposing browser capabilities through MCP, it's a CLI tool. The AI agent interacts with the browser by executing shell commands:

agent-browser snapshot -i      # get interactive elements
agent-browser click @e1        # click an element
agent-browser fill @e2 "text"  # fill an input

No JSON Schema. No tool definitions. Zero token overhead for the tooling itself.

The responses are equally lean. A successful button click returns Done — six characters. Compare that with MCP-based tools that return full page state updates running into thousands of characters.

The Architecture Behind the Efficiency

agent-browser uses a three-tier architecture that's worth understanding:

Tier 1 — Rust CLI: A native binary that handles argument parsing and command routing in sub-millisecond time. This eliminates Node.js cold start overhead.
Tier 2 — Node.js Daemon: A long-running process that manages the Playwright browser instance. It stays warm between commands, so you don't pay the 2–5 second browser startup cost on each interaction.
Tier 3 — Browser: The actual browser, connected via CDP (Chrome DevTools Protocol). Supports local Chromium, remote Chrome instances, cloud browsers (Browserbase), and even iOS Safari.

The CLI and daemon communicate through Unix domain sockets, keeping IPC fast and lightweight.

For element interaction, agent-browser uses accessibility tree snapshots with compact refs. Running snapshot -i returns something like:

button "Sign In" [ref=e1]
textbox "Email" [ref=e2]
textbox "Password" [ref=e3]

Roughly 200–400 tokens for a typical page, versus significantly larger outputs from MCP-based alternatives.

The Numbers

Here's a side-by-side comparison I compiled from multiple independent benchmarks:

Metric	agent-browser	Chrome DevTools MCP	Playwright MCP
Tool definition overhead	0 tokens	~17,000 tokens	~13,700 tokens
Single page snapshot	~1,000 tokens	Varies (larger)	~15,000 tokens
Button click response	6 characters	Full state update	12,891 characters
10-step automation flow	~7,000 tokens	~50,000 tokens	~114,000 tokens

Vercel's internal testing showed that simplifying from 17 tools down to 2 produced dramatic improvements:

3.5x faster execution
37% fewer tokens consumed
Success rate from 80% to 100%
42% fewer steps needed

Under the same context budget, agent-browser can run approximately 5.7x more test cycles than Playwright MCP.

Where It Falls Short

It's not all upside. agent-browser is two months old and the rough edges show:

No deep debugging. There's no equivalent to Chrome DevTools MCP's heap snapshots, Lighthouse audits, or detailed performance profiling. If your use case is front-end debugging or performance analysis, this isn't the tool.

Windows is broken. Multiple open issues around socket files, daemon startup, Git Bash compatibility, and path handling. If your agents run on Windows, wait for these to be fixed.

Limited ecosystem compatibility. Because it's a CLI, it only works with tools that can execute shell commands. MCP-only clients like Cursor or GitHub Copilot can't use it directly.

Documentation is thin. Multiple GitHub issues mention incomplete or missing docs. The project moves fast, but the docs haven't kept up.

When to Use What

After spending a week with both tools, my recommendation:

Scenario	Best Choice
Long-running AI automation workflows	agent-browser
Token budget is tight	agent-browser
Front-end debugging and performance analysis	Chrome DevTools MCP
Need MCP compatibility (Cursor, Copilot, etc.)	Chrome DevTools MCP
Windows environment	Chrome DevTools MCP
Network interception and mocking	agent-browser

The two tools aren't really competing — they solve different problems. agent-browser is optimized for AI agents that need to use a browser efficiently. Chrome DevTools MCP is optimized for AI agents that need to debug a browser deeply.

The Signal Worth Watching

Perhaps the most telling development: Google's Chrome DevTools team is now building their own CLI tool. When the team behind the leading MCP server starts shipping a CLI interface, it validates the core thesis that CLI is a better interface than MCP for AI-driven browser automation.

17,000+ stars in two months. This one's worth paying attention to.

Building Code Retrieval for Claude Code from Scratch

Chen Zhang — Tue, 19 Aug 2025 02:05:33 +0000

The story begins with a bug hunt...

When I opened Claude Code and asked it to help me locate a bug, what happened? It repeatedly used grep + read file tools, guessing possible keywords and constantly searching through massive amounts of files. After a minute, it still hadn't found anything.

Then, through hints and guidance from me, after going back and forth for 5 minutes, it finally located the problem file. But I discovered that among all the files it read, only 10 lines of code were actually related to this issue - 99% of the code was irrelevant. Throughout this repetitive dialogue and reading of massive amounts of unrelated code, not only were tokens wasted, but precious time was also squandered.

This experience was clearly problematic. So what's the root cause of this inefficiency?

After some reflection and research, I identified the following key pain points from this scenario:

First Pain Point: Expensive 💸

Each query requires transmitting massive amounts of irrelevant code to the LLM. Token consumption is enormous, causing costs to skyrocket.

Second Pain Point: Slow ⏰

AI needs to repeatedly probe and search, causing excessive waiting time for users. Development efficiency is greatly reduced.

Third Pain Point: Limitations of Keyword Search 🔍

Traditional grep can only match literal meanings and cannot understand semantic relationships and contextual meanings of code. It's like finding a needle in a haystack, relying purely on luck.

Others have raised similar issues with Claude Code, such as issue1 and issue2. You can see that even Claude Code, as powerful as it is, cannot escape these pain points and problems.

How Does Cursor Handle This?

To address these pain points, Cursor's founders actually revealed their solution early on in a forum post - "Codebase Indexing".

The approach is straightforward: split the codebase into small chunks, send them to the server, and use embedding models to embed the code. This is the standard code RAG solution.

But this raises the question: why hasn't Claude Code adopted this approach? After analyzing this deeply, I discovered there are numerous engineering challenges to tackle:

How to split the code?
What to do when code changes?
How to ensure indexing and retrieval speed?
How to build indexes for massive amounts of code embeddings?

Obviously, all of these require substantial engineering support. Claude Code's design philosophy is to be simple and CLI-based without interfaces, so such a large engineering effort clearly doesn't align with its positioning.

More importantly, embedding models and vector databases aren't Anthropic's core strengths. As a result, Claude Code hasn't adopted this approach, leaving these frustrating pain points unresolved.

Why Not Implement It Myself?

Since Claude Code hasn't implemented it and Cursor is a closed-source paid product, why don't I build one myself?

I can integrate vector databases and embedding models to implement an open-source code retrieval MCP tool similar to Cursor. This would not only meet my own needs but also help other developers - wouldn't that be wonderful?

So I named this product: Claude Context. It's an open-source code retrieval MCP tool that can seamlessly integrate with Claude Code while also being compatible with other AI Coding IDEs. It enables LLMs to obtain higher quality and more accurate contextual information.

Next, it was time to design the solution and roll up my sleeves to get to work!

Technology Stack

Given that I was going to build this, choosing the right technology stack was crucial:

🔌 Interface Level: MCP is the First Choice
MCP is like the USB for LLM interactions with the outside world. I needed to expose the product's capabilities through an MCP server, enabling not just Claude Code but also other AI IDEs like Gemini CLI and Qwen Code to utilize it.

💾 Vector Database: Choosing Zilliz Cloud
Zilliz Cloud is a fully managed Milvus vector database service. With high-performance vector search, high QPS and low latency, cloud-native architecture bringing elastic scaling and unlimited storage, plus multi-replica enhanced availability - it's practically tailor-made for Codebase Indexing.

🧠 Embedding Models: Multiple Options

OpenAI embedding models: Widely used and validated, stable and reliable
Voyage embedding: Has specialized models for the Code domain with better performance
Ollama: Suitable for local deployment with stronger privacy
More embedding models to be supported later

⌨️ Programming Language: TypeScript Wins
I deliberated between Python and TypeScript, ultimately settling on TypeScript. The reasoning is straightforward: it offers better compatibility at the application layer, allowing developed modules to seamlessly integrate with higher-level TypeScript applications like VSCode plugins. Additionally, since Claude Code, Gemini CLI, and similar tools are all built with TypeScript, this choice provides better ecosystem alignment.

Architecture Design

Following decoupling and layered design principles, I structured the architecture into two distinct layers:

Core modules: Contains all core logic, with each block's logic also designed separately, such as code parsing, vector indexing, semantic retrieval, synchronization updates, etc.
Front-end modules: Contains MCP, VSCode plugins, and other integrations. Based on core modules, they contain more application-layer logic, especially MCP, which is the best way to interact with AI IDEs like Claude Code.

This design allows core modules to be reused by upper-layer modules, providing flexibility for both horizontal and vertical scaling in the future.

Core Modules

Core modules serve as the foundation - after all, you can only build a solid house on a strong foundation. These modules abstract vector databases, embedding models, and other components into composable modules that form a Context object, enabling the use of different vector databases and embedding models across various scenarios.

import { Context, MilvusVectorDatabase, OpenAIEmbedding } from '@zilliz/claude-context-core';

// Initialize embedding provider
const embedding = new OpenAIEmbedding(...);

// Initialize vector database
const vectorDatabase = new MilvusVectorDatabase(...);

// Create context instance
const context = new Context({embedding, vectorDatabase});

// Index your codebase with progress tracking
const stats = await context.indexCodebase('./your-project');

// Perform semantic search
const results = await context.semanticSearch('./your-project', 'vector database operations');

How to Split Code?

Code splitting can't be handled with a simple, brute-force approach of splitting by lines or characters. Such an approach would result in code blocks with either incomplete logic or lost context.

I designed two complementary splitting strategies:

1. AST Abstract Syntax Tree Splitting (Main Strategy) 🌳

This is the default and recommended strategy. Through the tree-sitter parser, it understands the syntactic structure of code and splits according to semantic units.

The advantages of AST splitting are obvious:

Syntactic Completeness: Each chunk is a complete syntactic unit, avoiding awkward situations where functions are split in half
Logical Coherence: Related code logic remains in the same chunk, allowing AI search to find more accurate context
Multi-language Support: Uses different tree-sitter parsers for different programming languages, accurately identifying and splitting JavaScript function declarations, Python class definitions, Java methods, Go function definitions, etc.

2. LangChain Text Splitting (Fallback Strategy) 🛡️

For languages that AST cannot handle or parsing failures, I use LangChain's RecursiveCharacterTextSplitter as a backup solution.

// Use recursive character splitting to maintain code structure
const splitter = RecursiveCharacterTextSplitter.fromLanguage(language, {
    chunkSize: 1000,
    chunkOverlap: 200,
});

While this strategy isn't as sophisticated as AST (essentially splitting by character count), it's stable and reliable. It ensures that any code can be properly split, providing a dependable fallback option.

Implementing the fallback ensures both semantic completeness and accommodates various use case requirements. Rock solid!

What About Code Changes?

Handling code changes has always been a core challenge for code indexing systems. Imagine if you had to re-index the entire project every time a file had a minor change - that would be a disaster.

I designed a Merkle Tree-based synchronization mechanism to solve this problem.

1. Merkle Tree: The Core of Change Detection

Merkle Tree is like a layered "fingerprint" system:

Each file has its own hash fingerprint
Folders have fingerprints based on their content files
Finally converging into a unique root node fingerprint for the entire codebase

As long as file content changes, its upper-layer hash fingerprints will change layer by layer up to the root node.

This approach allows me to traverse from the root node downward, comparing hash fingerprint changes layer by layer to rapidly detect and pinpoint file modifications. There's no need to re-index the entire project - the efficiency gains are remarkable!

2. Change Detection and Synchronization ⚡

By default, synchronization checks occur every 5 minutes. The synchronization mechanism is clean and efficient, operating in three phases:

🏃‍♂️ Phase One: Rapid Detection
Calculate the Merkle root hash of the entire codebase and compare it with the last saved snapshot. If the root hash matches? Great news - nothing has changed, so we can skip the update entirely! This check completes in milliseconds.

🔍 Phase Two: Precise Comparison
If the root hash differs, we enter precise comparison mode, performing detailed file-level analysis to identify exactly which files have changed:

Newly added files
Deleted files
Modified files

🔄 Phase Three: Incremental Update
Only recalculate vectors for changed files, then update them in the vector database. Time and effort saved!

3. Local Snapshot Management

All synchronization states are saved in the user's local ~/.context/merkle/ directory. Each codebase has its own independent snapshot file containing file hash tables and serialized data of the Merkle tree. This way, even if the program restarts, it can accurately restore the previous synchronization state.

The benefits of this design are substantial. Most of the time, it can detect no changes within milliseconds, and only genuinely modified files get reprocessed, eliminating unnecessary computations. Furthermore, the state persists perfectly even when the program is closed and reopened.

From a user experience perspective, this means when you modify a function, the system will only re-index that file, not the entire project, greatly improving development efficiency.

MCP Module

How to Design Tools? 🛠️

The MCP module is the facade, directly facing users. In this module, user experience is paramount.

First, consider tool design. Abstracting common codebase indexing/search behaviors, it's easy to think of two core tools:

index_codebase - Index codebase
search_code - Search code

But wait, do we need other tools?

There's a delicate balance to strike: too many tools would include numerous edge-case functionalities that burden MCP clients and complicate LLM decision-making, while too few tools might omit essential functionality.

Let me approach this by working backward from actual usage scenarios.

🤔 The Challenge of a Blocking Step

Some large codebases take a long time to index. It isn't the best user experience to block on this slow step.

So we need asynchronous background processing. But MCP doesn't natively support background operations - how do we solve this?

The solution: implement a background process within the MCP server to handle indexing, allowing the server to immediately return a "indexing started" message while users continue with other tasks.

Much better!

But this design brings new problems: how does the user know the indexing progress?

This naturally leads to needing a tool for querying indexing progress or status. The background indexing process asynchronously caches its progress, allowing users to check at any time: Is it 50% complete? Has it finished successfully? Did it fail?

We also need a tool for manually clearing the index. When users suspect the indexed codebase is inaccurate or want to re-index from scratch, they can manually clear everything and start fresh.

Final tool design:

index_codebase - Index codebase
search_code - Search code
get_indexing_status - Query indexing status
clear_index - Clear index

Four tools - clean and sufficient!

How to Manage Environment Variables? ⚙️

Environment variable management is often overlooked but represents a critical user experience consideration.

If every MCP client requires separate API key configuration, users switching between Claude Code and Gemini CLI would need to configure everything twice. That level of redundancy is simply unacceptable.

I designed a global configuration solution to completely solve this pain point.

The solution is simple: create a ~/.context/.env file in the user's home directory as global configuration:

# ~/.context/.env
OPENAI_API_KEY=your-api-key-here
MILVUS_TOKEN=your-milvus-token

The benefits of this approach are obvious:

Only need to configure once to use across all MCP clients
All configurations are centralized in one place, easy to maintain
Sensitive API keys won't be scattered across various configuration files

I then implemented a three-tier priority system:

Highest Priority: Process environment variables
Medium Priority: Global configuration file
Lowest Priority: Default values

This design is very flexible:

Developers can use environment variables to override configurations during temporary testing
In production environments, sensitive configurations can be injected through system environment variables to ensure security

Users configure once and can seamlessly use it across multiple tools like Claude Code and Gemini CLI, significantly lowering the barrier to entry!

With this, we've completed the core architecture design of the MCP server. From code parsing and vector storage to intelligent retrieval and configuration management, every component has been carefully designed and optimized. The resulting system is both powerful and user-friendly.

So how does this solution perform in actual use? Let's look at the final results.

Results Showcase 🎉

After building the architecture code, I open-sourced the project and published it to the npm registry.

Installation and usage couldn't be simpler - just run a single command before starting Claude Code:

claude mcp add claude-context -e OPENAI_API_KEY=your-openai-api-key -e MILVUS_TOKEN=your-zilliz-cloud-api-key -- npx @zilliz/claude-context-mcp@latest

Let's see it in action! I asked it to help me find that bug from earlier. First, I indexed the current codebase, then asked it to locate this bug based on a specific description.

The results were impressive! Using claude-context MCP tool calls, it successfully identified the exact file and line number of the bug and provided detailed explanations.

Perfect! 🎯

With claude-context MCP integrated, Claude Code can leverage it across numerous scenarios, gaining access to higher quality and more precise contextual information:

🐛 Issue fixing
🔧 Code refactoring
🔍 Duplicate code detection
🧪 Code testing

Open Source Community Feedback

https://github.com/zilliztech/claude-context

Since open-sourcing the code, I've received tremendous feedback and suggestions from the community. I'm incredibly grateful for everyone's support and valuable input. The project has now earned 2.6k+ stars.

Many users have requested benchmark evaluations to quantify claude-context's improvements. This area is currently under active testing and experimentation. We already have a preliminary finding: under equivalent recall rate conditions, claude-context can reduce token consumption by 40% compared to baseline approaches.

This translates to 40% savings in both time and cost.

Conversely, with equivalent limited token budgets, claude-context MCP delivers superior retrieval performance.

Additional testing details will be published in the GitHub repository as they become available.

Epilogue

This project evolved from a simple idea into a comprehensive code retrieval solution. The journey was challenging yet incredibly rewarding. By leveraging MCP architecture and Zilliz Cloud's vector database, we successfully addressed the core pain points of code retrieval in Claude Code.

Looking ahead, we plan to continue optimizing retrieval algorithms, expand support for additional programming languages, and continuously enhance the user experience. We welcome everyone to try it out, test it thoroughly, and share feedback. We're also excited to collaborate with more developers who want to contribute code and help make claude-context even more robust and powerful!