<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chen Zhang</title>
    <description>The latest articles on DEV Community by Chen Zhang (@chen_zhang_bac430bc7f6b95).</description>
    <link>https://dev.to/chen_zhang_bac430bc7f6b95</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3441168%2F69fca37c-e540-4b09-8999-3b295854369f.jpg</url>
      <title>DEV Community: Chen Zhang</title>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/chen_zhang_bac430bc7f6b95"/>
    <language>en</language>
    <item>
      <title>Vector Graph RAG: Multi-Hop RAG Without a Graph Database</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Fri, 03 Apr 2026 07:54:52 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/vector-graph-rag-multi-hop-rag-without-a-graph-database-3hgb</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/vector-graph-rag-multi-hop-rag-without-a-graph-database-3hgb</guid>
      <description>&lt;p&gt;Standard RAG falls apart when the answer isn't in one chunk. Ask "What side effects should I watch for with the first-line diabetes medication?" and the system needs to first figure out that metformin is the first-line drug, then look up metformin's side effects. The query never mentions "metformin" — it's a bridge entity the system has to discover on its own. Naive vector search can't do this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzapylr7wow492blns8wr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzapylr7wow492blns8wr.png" alt="Multi-hop problem illustration" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The industry answer has been knowledge graphs plus graph databases. That works, but it means deploying Neo4j or similar, learning a graph query language, and operating two separate storage systems. The complexity doubles for what's essentially one feature: following entity chains across passages.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/zilliztech/vector-graph-rag" rel="noopener noreferrer"&gt;Vector Graph RAG&lt;/a&gt; to get multi-hop reasoning without any of that overhead. The entire graph structure lives inside Milvus — entities, relations, and passages stored as three collections with ID cross-references. No graph database, no Cypher queries, just vector search and metadata lookups.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp5nmt9w5r4l8udp535g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp5nmt9w5r4l8udp535g.png" alt="Architecture comparison" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Logical Graph in Milvus
&lt;/h2&gt;

&lt;p&gt;The key insight is simple: a knowledge graph relation like &lt;code&gt;(metformin, is_first_line_drug_for, type_2_diabetes)&lt;/code&gt; is just text. Text can be embedded into vectors. So why not store the entire graph structure in a vector database?&lt;/p&gt;

&lt;p&gt;Vector Graph RAG uses three Milvus collections with ID cross-references:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entities&lt;/strong&gt;: Deduplicated entity names, embedded for semantic search. Each entity record stores the IDs of relations it participates in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relations&lt;/strong&gt;: Triple-based relations (subject, predicate, object). Each record stores the subject and object entity IDs, plus the IDs of source passages. The relation text is embedded for vector search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Passages&lt;/strong&gt;: Original document chunks. Each record stores the IDs of entities and relations extracted from it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three collections form a logical graph through ID references. "Graph traversal" becomes a series of ID-based metadata queries in Milvus — no graph query language needed.&lt;/p&gt;

&lt;p&gt;The extra ID lookups add maybe 2-3 primary key queries per hop. Each takes under 10ms. The real bottleneck in any RAG pipeline is the LLM call (1-3 seconds), so a few extra milliseconds of metadata lookup is invisible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four-Step Retrieval Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc4yj7p5p2k00j430rj8b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc4yj7p5p2k00j430rj8b.png" alt="4-step pipeline" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Seed Retrieval
&lt;/h3&gt;

&lt;p&gt;An LLM extracts key entities from the user query. These entities are embedded and used to search the Entities and Relations collections. The results are the "seeds" — entry points into the logical graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Subgraph Expansion
&lt;/h3&gt;

&lt;p&gt;This is where multi-hop happens. From each seed entity, the system follows ID references one hop outward: find the entity's relation IDs, fetch those relations, then fetch the entities on the other end of those relations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5boniwrizz38mi1lzyl6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5boniwrizz38mi1lzyl6.png" alt="Subgraph expansion" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the diabetes example, expanding from "type 2 diabetes" discovers the relation &lt;code&gt;(metformin, is_first_line_drug_for, type_2_diabetes)&lt;/code&gt;, which surfaces "metformin" — the bridge entity the original query never mentioned. From "metformin," another expansion finds relations about renal function monitoring and side effects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: LLM Reranking
&lt;/h3&gt;

&lt;p&gt;After expansion, we have a pool of candidate relations and passages. A single LLM call scores and filters them for relevance to the original query. This replaces what iterative approaches do with multiple rounds of LLM-guided search.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Answer Generation
&lt;/h3&gt;

&lt;p&gt;The top-ranked relations and their associated passages go to the LLM for final answer generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two LLM Calls, Not Ten
&lt;/h2&gt;

&lt;p&gt;Most multi-hop RAG approaches are iterative. IRCoT calls the LLM 3-5 times per query. Agentic RAG systems can make 10+ LLM calls.&lt;/p&gt;

&lt;p&gt;Vector Graph RAG front-loads the discovery work into vector search and subgraph expansion. The LLM only gets called twice: once for reranking, once for generation. This cuts API costs by roughly 60% and makes the system 2-3x faster compared to iterative approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Evaluated on three standard multi-hop QA benchmarks using Recall@5:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Naive RAG&lt;/th&gt;
&lt;th&gt;Vector Graph RAG&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MuSiQue (2-4 hop)&lt;/td&gt;
&lt;td&gt;65.2%&lt;/td&gt;
&lt;td&gt;82.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HotpotQA (2 hop)&lt;/td&gt;
&lt;td&gt;78.6%&lt;/td&gt;
&lt;td&gt;91.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2WikiMultiHopQA (2 hop)&lt;/td&gt;
&lt;td&gt;76.4%&lt;/td&gt;
&lt;td&gt;89.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;73.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;87.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlhzak3sc2yea8u1u6i3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqlhzak3sc2yea8u1u6i3.png" alt="Recall@5 vs Naive RAG" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Against SOTA methods, Vector Graph RAG achieves the highest average Recall@5 at 87.8%, beating HippoRAG 2 on average — while using only 2 LLM calls per query and requiring no graph database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfzgzbslvc37c14ijeex.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfzgzbslvc37c14ijeex.png" alt="SOTA comparison" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;vector-graph-rag
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vector_graph_rag&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorGraphRAG&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize - uses Milvus Lite (local .db file) by default
&lt;/span&gt;&lt;span class="n"&gt;rag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorGraphRAG&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Index your documents
&lt;/span&gt;&lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_texts&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metformin is the first-line medication for type 2 diabetes.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metformin requires regular monitoring of renal function.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Type 2 diabetes affects insulin sensitivity in the body.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Query with multi-hop reasoning
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What monitoring is needed for the first-line type 2 diabetes drug?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;By default, it uses Milvus Lite with a local &lt;code&gt;.db&lt;/code&gt; file — no server needed. For production, switch to &lt;a href="https://milvus.io/" rel="noopener noreferrer"&gt;Milvus&lt;/a&gt; standalone/cluster or &lt;a href="https://zilliz.com/cloud" rel="noopener noreferrer"&gt;Zilliz Cloud&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6hx538zcz6an1dlm7fs.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6hx538zcz6an1dlm7fs.gif" alt="Interactive frontend demo" width="720" height="423"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Vector Graph RAG shows that the "graph" in Graph RAG doesn't have to mean a graph database. Store the graph structure as cross-referenced collections in a vector database and you get the same reasoning power with half the infrastructure.&lt;/p&gt;

&lt;p&gt;If your RAG system struggles with multi-hop questions, give &lt;a href="https://github.com/zilliztech/vector-graph-rag" rel="noopener noreferrer"&gt;Vector Graph RAG&lt;/a&gt; a try. It's open source, installs in one command, and runs locally out of the box.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/zilliztech" rel="noopener noreferrer"&gt;
        zilliztech
      &lt;/a&gt; / &lt;a href="https://github.com/zilliztech/vector-graph-rag" rel="noopener noreferrer"&gt;
        vector-graph-rag
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Graph RAG with pure vector search, achieving SOTA performance in multi-hop reasoning scenarios.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;
  &lt;a rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com/17022025/569541915-60afcee1-049a-4d2c-845d-8953b4fae083.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUyMDMxOTIsIm5iZiI6MTc3NTIwMjg5MiwicGF0aCI6Ii8xNzAyMjAyNS81Njk1NDE5MTUtNjBhZmNlZTEtMDQ5YS00ZDJjLTg0NWQtODk1M2I0ZmFlMDgzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDAzVDA3NTQ1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZkMmEyYjkwY2M2NzljNGM2MTQwMDA4MTU3ODMyMDlkOGM1M2I1Mjk0NTFlYjFiMWFlMTBkMDc3ODA3MmFlYzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.B5r-l_3_j_twMTQ9GZjxARVM1v-44vxe3bizunmGYWU"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fprivate-user-images.githubusercontent.com%2F17022025%2F569541915-60afcee1-049a-4d2c-845d-8953b4fae083.png%3Fjwt%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUyMDMxOTIsIm5iZiI6MTc3NTIwMjg5MiwicGF0aCI6Ii8xNzAyMjAyNS81Njk1NDE5MTUtNjBhZmNlZTEtMDQ5YS00ZDJjLTg0NWQtODk1M2I0ZmFlMDgzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDAzVDA3NTQ1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZkMmEyYjkwY2M2NzljNGM2MTQwMDA4MTU3ODMyMDlkOGM1M2I1Mjk0NTFlYjFiMWFlMTBkMDc3ODA3MmFlYzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.B5r-l_3_j_twMTQ9GZjxARVM1v-44vxe3bizunmGYWU" alt="" width="120"&gt;&lt;/a&gt;
  &lt;br&gt;
  Vector Graph RAG
&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;
  &lt;strong&gt;Graph RAG with pure vector search — no graph database needed.&lt;/strong&gt;
&lt;/p&gt;

&lt;p&gt;
  &lt;a href="https://pypi.org/project/vector-graph-rag/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ac9165ae44a97f768ede810c1978c4fabcf40dab6e9a00589263c4b1efc30cef/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f766563746f722d67726170682d7261673f7374796c653d666c61742d73717561726526636f6c6f723d626c7565" alt="PyPI"&gt;&lt;/a&gt;
  &lt;a href="https://pypi.org/project/vector-graph-rag/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/63a7563e96ce73fab13ca542e9ff27338cbeb85f4a8cf946541591b432269b4d/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d253345253344332e31302d626c75653f7374796c653d666c61742d737175617265266c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465" alt="Python"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/zilliztech/vector-graph-rag/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/181d5f02ad0ad71e99d20695a1882ecba6ce7fd99fa83a90f4ddd99e27e93b4d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f7a696c6c697a746563682f766563746f722d67726170682d7261673f7374796c653d666c61742d737175617265" alt="License"&gt;&lt;/a&gt;
  &lt;a href="https://zilliztech.github.io/vector-graph-rag/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/04149deb92bb7cf33ec4cc18df0216de0860132c35cfeb6103b27e359ed1c768/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f63732d766563746f722d2d67726170682d2d7261672d626c75653f7374796c653d666c61742d737175617265" alt="Docs"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/zilliztech/vector-graph-rag/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/b8431c89a07fffba654e2fa5b709bff0fcd6e2dbe9fb53e5c88e3153d14313e0/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f7a696c6c697a746563682f766563746f722d67726170682d7261673f7374796c653d666c61742d737175617265" alt="Stars"&gt;&lt;/a&gt;
  &lt;a href="https://discord.com/invite/FG6hMJStWu" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/83be151409ef6198d98bba837d8fa015ae9b9f6f756f7b4b17f5e17e21edbed6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d636861742d3732383964613f7374796c653d666c61742d737175617265266c6f676f3d646973636f7264266c6f676f436f6c6f723d7768697465" alt="Discord"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 Encode entities and relations as vectors in &lt;a href="https://milvus.io/" rel="nofollow noopener noreferrer"&gt;Milvus&lt;/a&gt;, replace iterative LLM agents with a single reranking pass — achieve state-of-the-art multi-hop retrieval at a fraction of the operational and computational cost.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com/17022025/569496071-1185b651-ed72-4408-9dcd-25a74b12835b.gif?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUyMDMxOTIsIm5iZiI6MTc3NTIwMjg5MiwicGF0aCI6Ii8xNzAyMjAyNS81Njk0OTYwNzEtMTE4NWI2NTEtZWQ3Mi00NDA4LTlkY2QtMjVhNzRiMTI4MzViLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDAzVDA3NTQ1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTMzYWU4YzZlY2UxMjEwYjI1NGVjZjU5M2JjYTg3YTMyZjdjNzQxYzA5M2RjMTFhOWEyZDcyMDQyMGZhYWNiY2ImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1TYQb2qiNNLGriuuatmPvsLrrykF-uDiwucWBBMNAMo"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fprivate-user-images.githubusercontent.com%2F17022025%2F569496071-1185b651-ed72-4408-9dcd-25a74b12835b.gif%3Fjwt%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUyMDMxOTIsIm5iZiI6MTc3NTIwMjg5MiwicGF0aCI6Ii8xNzAyMjAyNS81Njk0OTYwNzEtMTE4NWI2NTEtZWQ3Mi00NDA4LTlkY2QtMjVhNzRiMTI4MzViLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDAzVDA3NTQ1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTMzYWU4YzZlY2UxMjEwYjI1NGVjZjU5M2JjYTg3YTMyZjdjNzQxYzA5M2RjMTFhOWEyZDcyMDQyMGZhYWNiY2ImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1TYQb2qiNNLGriuuatmPvsLrrykF-uDiwucWBBMNAMo" alt="Vector Graph RAG Demo" width="800"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;✨ Features&lt;/h2&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Graph Database Required&lt;/strong&gt; — Pure vector search with Milvus, no Neo4j or other graph databases needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-Pass LLM Reranking&lt;/strong&gt; — One LLM call to rerank, no iterative agent loops (unlike IRCoT or multi-step reflection)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge-Intensive Friendly&lt;/strong&gt; — Optimized for domains with dense factual content: legal, finance, medical, literature, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Configuration&lt;/strong&gt; — Uses Milvus Lite by default, works out of the box with a single file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-hop Reasoning&lt;/strong&gt; — Subgraph expansion enables complex multi-hop question answering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State-of-the-Art Performance&lt;/strong&gt; — 87.8% avg Recall@5 on multi-hop QA benchmarks, outperforming HippoRAG&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;📦 Installation&lt;/h2&gt;
&lt;/div&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install vector-graph-rag
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; or&lt;/span&gt;
uv add vector-graph-rag&lt;/pre&gt;

&lt;/div&gt;

&lt;b&gt;With document&lt;/b&gt;…&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/zilliztech/vector-graph-rag" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



</description>
      <category>vectordatabase</category>
      <category>llm</category>
      <category>python</category>
      <category>rag</category>
    </item>
    <item>
      <title>I Built an AI Agent That Watches the Market While I Sleep</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Fri, 03 Apr 2026 06:18:24 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/i-built-an-ai-agent-that-watches-the-market-while-i-sleep-3go4</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/i-built-an-ai-agent-that-watches-the-market-while-i-sleep-3go4</guid>
      <description>&lt;p&gt;I have a full-time job and no time to watch the stock market all day. But I still trade — mostly US tech stocks. Last year I made at least three bad decisions because I was too tired or too rushed to think clearly. So I built an AI agent to do the watching for me.&lt;/p&gt;

&lt;p&gt;The stack: &lt;a href="https://github.com/anthropics/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; as the agent framework, &lt;a href="https://exa.ai" rel="noopener noreferrer"&gt;Exa&lt;/a&gt; for information gathering, and &lt;a href="https://milvus.io" rel="noopener noreferrer"&gt;Milvus&lt;/a&gt; as a personal memory store. Total cost: about $20/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  The NVIDIA Moment
&lt;/h2&gt;

&lt;p&gt;On February 26th, NVIDIA reported Q4 earnings — revenue up 65% year-over-year. The stock dropped 5.5%. I didn't find out until the next morning.&lt;/p&gt;

&lt;p&gt;But when I checked my phone, there was already a message from my agent, sent the previous evening:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;NVDA earnings analysis: Revenue beat expectations, but the market is skeptical about AI capex sustainability. In similar past situations, the stock tended to drop short-term. You had a similar experience in September 2024 — you panic-sold and the stock recovered within three weeks. Recommendation: hold, don't panic-sell.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It didn't just analyze the earnings report. It pulled up my own past trading notes and reminded me not to repeat the same mistake.&lt;/p&gt;

&lt;h2&gt;
  
  
  Information Gathering with Exa
&lt;/h2&gt;

&lt;p&gt;The first problem: where does the information come from?&lt;/p&gt;

&lt;p&gt;Exa is a search API designed for AI agents. It does semantic search — describe what you're looking for in plain language, and it understands what you mean. The index refreshes every minute and filters out SEO spam.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;exa_py&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Exa&lt;/span&gt;

&lt;span class="n"&gt;exa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Exa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Why did NVIDIA stock drop despite strong Q4 2026 earnings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neural&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;start_published_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-02-25&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_characters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;highlights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_sentences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What caused the stock drop?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;contents&lt;/code&gt; parameter is the killer feature — it extracts full text, highlights key sentences, and generates a summary, all in one request. No need to click through links one by one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Personal Memory with Milvus
&lt;/h2&gt;

&lt;p&gt;Information gathering solves "what's happening out there." But making good decisions also requires knowing yourself — what you got right, what you got wrong, what your blind spots are.&lt;/p&gt;

&lt;p&gt;Milvus is a vector database. You convert text into vectors and store them. When you search later, it finds results by meaning, not keywords. So "Middle East conflict tanks tech stocks" and "geopolitical tensions trigger semiconductor selloff" match each other.&lt;/p&gt;

&lt;p&gt;I set up three collections: past decisions/lessons, personal preferences/biases, and observed market patterns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pymilvus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MilvusClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;milvus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MilvusClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./my_investment_brain.db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;

&lt;span class="n"&gt;milvus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decisions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auto_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;milvus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preferences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auto_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;milvus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auto_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A memory extractor runs after every conversation — it pulls out decisions, preferences, patterns, and lessons, then stores them automatically with deduplication (similarity &amp;gt; 0.92 = skip).&lt;/p&gt;

&lt;p&gt;When the agent analyzes a current situation, it searches all three collections for relevant past experience:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recall_my_experience&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;situation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;query_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;situation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;past&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;milvus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decisions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_vec&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;output_fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;prefs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;milvus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preferences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_vec&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;output_fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;milvus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_vec&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;output_fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;past_decisions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;past&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preferences&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prefs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's why the NVIDIA alert referenced my trading history from a year ago — the agent found the lesson in my own notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analysis Framework: Writing My Logic as a Skill
&lt;/h2&gt;

&lt;p&gt;OpenClaw's Skills system lets you define situation-specific analysis rules in markdown. I wrote a post-earnings evaluation skill with my personal criteria:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;post-earnings-eval&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;"&lt;/span&gt;
  &lt;span class="s"&gt;Evaluate whether to buy, hold, or sell after an earnings report.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important line in the skill: "I have a tendency to let fear override data. If my Milvus history shows I regretted selling after a dip, say so explicitly." A personal bias-correction mechanism baked into the agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Run Automatically
&lt;/h2&gt;

&lt;p&gt;OpenClaw's Heartbeat mechanism handles scheduling. The Gateway sends a pulse every 30 minutes, and the agent acts based on a &lt;code&gt;HEARTBEAT.md&lt;/code&gt; file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Morning brief (6:30-7:30 AM)&lt;/strong&gt;: Search overnight news via Exa, query Milvus for positions and past experience, generate a personalized summary, push to phone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Price alerts (market hours)&lt;/strong&gt;: Monitor watchlist stocks, alert on &amp;gt;3% moves with context from past decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End of day summary&lt;/strong&gt;: Recap the day, compare with morning expectations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No cron jobs, no server. Just a markdown file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxg7tw0n57m0tp6nxnte.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnxg7tw0n57m0tp6nxnte.png" alt="Architecture overview" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;My weekly market-tracking time went from ~15 hours to ~2 hours. The agent runs 24/7, so nothing slips through. And because it uses my own past experiences in every analysis, the recommendations are personalized, not generic.&lt;/p&gt;

&lt;p&gt;The NVIDIA situation was the proof point. Without the agent, I probably would have panic-sold again. Instead, I had complete information — including a reminder of my own past mistake — and made the right call.&lt;/p&gt;

&lt;p&gt;Total monthly cost: ~$10 Exa API + ~$10 LLM calls. OpenClaw and Milvus run locally for free.&lt;/p&gt;

&lt;p&gt;If you want to try it, start with the smallest goal: "get a market summary on my phone every morning." One weekend to set up, then iterate from there.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclaimer: This post shares a personal technical project. All market analysis examples are illustrative and do not constitute investment advice.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>milvus</category>
      <category>python</category>
    </item>
    <item>
      <title>Claude Code's Memory: 4 Layers of Complexity, Still Just Grep and a 200-Line Cap</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Thu, 02 Apr 2026 04:25:07 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/claude-codes-memory-4-layers-of-complexity-still-just-grep-and-a-200-line-cap-2kn9</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/claude-codes-memory-4-layers-of-complexity-still-just-grep-and-a-200-line-cap-2kn9</guid>
      <description>&lt;p&gt;Claude Code's memory system? Grep search, a 200-line index cap, and zero cross-agent sharing. After studying the leaked source, it's way more primitive than the hype suggests.&lt;/p&gt;

&lt;p&gt;It does a fair amount of work — multiple layers, even a mechanism that lets the agent "dream" to consolidate memories. Sounds sophisticated. But peel it open and you find an agent trapped in its own sandbox. Memory can't leave, and it doesn't last long.&lt;/p&gt;

&lt;p&gt;Let's break it down layer by layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: CLAUDE.md — Rules You Write for the Agent
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; is a Markdown file you create and place in your project root. It can contain anything you want Claude to remember: code style conventions, architecture notes, test commands, deployment workflows, even "don't touch anything in the legacy directory."&lt;/p&gt;

&lt;p&gt;Every session, Claude Code loads the entire file into context. Shorter files get followed better.&lt;/p&gt;

&lt;p&gt;It supports three scopes: project-level &lt;code&gt;CLAUDE.md&lt;/code&gt; in the project root, personal-level at &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;, and organization-level in enterprise configs. You write it, Claude reads it. You don't write it, Claude has nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Auto Memory — The Agent Takes Notes
&lt;/h2&gt;

&lt;p&gt;CLAUDE.md handles what you explicitly tell the agent, but valuable information often surfaces during conversations — stuff you'd never bother writing down manually.&lt;/p&gt;

&lt;p&gt;Auto Memory does exactly this: Claude decides what's worth remembering during work and writes it into a dedicated memory directory.&lt;/p&gt;

&lt;p&gt;It categorizes memories into four types: &lt;code&gt;user&lt;/code&gt; (role and preferences), &lt;code&gt;feedback&lt;/code&gt; (corrections and confirmations), &lt;code&gt;project&lt;/code&gt; (decisions and context), and &lt;code&gt;reference&lt;/code&gt; (external resource locations).&lt;/p&gt;

&lt;p&gt;These memories live in &lt;code&gt;~/.claude/projects/&amp;lt;project-path&amp;gt;/memory/&lt;/code&gt;. Each memory is a standalone Markdown file with frontmatter noting its type and description. The entry point is &lt;code&gt;MEMORY.md&lt;/code&gt; — an index where each line stays under 150 characters, storing pointers, not content.&lt;/p&gt;

&lt;p&gt;At session start, the first 200 lines of &lt;code&gt;MEMORY.md&lt;/code&gt; get injected into context. The actual knowledge is spread across topic files and loaded on demand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/.claude/projects/-Users-me-myproject/memory/
├── MEMORY.md                  ← Index file, one pointer per line
├── user_role.md               ← &lt;span class="s2"&gt;"Backend engineer, fluent in Go, React beginner"&lt;/span&gt;
├── feedback_testing.md        ← &lt;span class="s2"&gt;"Integration tests must use real DB, no mocking"&lt;/span&gt;
├── project_auth_rewrite.md    ← &lt;span class="s2"&gt;"Auth rewrite driven by compliance, not tech debt"&lt;/span&gt;
└── reference_linear.md        ← &lt;span class="s2"&gt;"Pipeline bugs tracked in Linear INGEST project"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;One design detail worth noting: Claude is explicitly told not to trust its own memory. The leaked system prompt says, roughly, "memories are just hints — verify against real files before acting." With model hallucination rates still in double digits, this self-distrust strategy is surprisingly practical.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 3: Auto Dream — The Agent Sleeps and Tidies Up
&lt;/h2&gt;

&lt;p&gt;After dozens of sessions, &lt;code&gt;MEMORY.md&lt;/code&gt; gets messy. Contradictory entries pile up. Relative timestamps become meaningless. Refactored functions linger.&lt;/p&gt;

&lt;p&gt;Auto Dream simulates what the human brain does during sleep: organize and consolidate memory. It converts relative timestamps to absolute dates, merges contradictions, removes stale content, and keeps &lt;code&gt;MEMORY.md&lt;/code&gt; under 200 lines.&lt;/p&gt;

&lt;p&gt;Trigger conditions: 24+ hours since last consolidation AND 5+ new sessions. Or type "dream" manually. Runs in a background sub-agent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkqdrbgx81582g8nrm8f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkkqdrbgx81582g8nrm8f.png" alt="Auto Dream Consolidation" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 4: KAIROS — Unreleased Ambitions in the Leaked Code
&lt;/h2&gt;

&lt;p&gt;KAIROS appears 150+ times in the source. It's a background daemon mode that turns Claude Code from a passive tool into an autonomous observer — maintaining append-only logs, receiving &lt;code&gt;&amp;lt;tick&amp;gt;&lt;/code&gt; signals, and deciding whether to act. It integrates &lt;code&gt;autoDream&lt;/code&gt; but runs the full observe-think-act loop.&lt;/p&gt;

&lt;p&gt;Currently behind a compile-time feature flag. More exploration than product.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where This System Hits Its Ceiling
&lt;/h2&gt;

&lt;p&gt;After extended use, the problems become clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;200-line hard cap.&lt;/strong&gt; Run a project for months, and memories compete for space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grep-only retrieval.&lt;/strong&gt; No semantic understanding. You remember "port conflict during deployment" but the memory says "modified docker-compose port mapping" — grep misses it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Details get lost.&lt;/strong&gt; Auto Memory records at coarse granularity. Code snippets, debugging steps, discussion context — mostly dropped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compounding complexity.&lt;/strong&gt; Each layer patches the last, stacking complexity without fixing root constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No cross-tool sharing.&lt;/strong&gt; Switch to OpenCode or Codex CLI = start from zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Still short-term memory.&lt;/strong&gt; "How did we fix Redis last week?" — hopeless for long-span recall.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmf1qck8xq6uxgyqf5fsx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmf1qck8xq6uxgyqf5fsx.png" alt="Agent Memory Isolation" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This isn't a design failure. Single agent, session granularity, local storage — the ceiling is architectural.&lt;/p&gt;
&lt;h2&gt;
  
  
  memsearch: Memory Should Outlive Any Single Agent
&lt;/h2&gt;

&lt;p&gt;This is the core idea behind memsearch. Agents change, tools change, but project knowledge shouldn't disappear with them. memsearch pulls memory out of the agent into an independent persistence layer.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│         Agent Plugins (User Layer)       │
│  Claude Code · OpenClaw · OpenCode · Codex│
└──────────────────┬──────────────────────┘
                   ↓
┌──────────────────┴──────────────────────┐
│       memsearch CLI / Python API         │
│            (Developer Layer)             │
└──────────────────┬──────────────────────┘
                   ↓
┌──────────────────┴──────────────────────┐
│    Core: Chunker → Embedder → Milvus     │
│            (Engine Layer)                │
└──────────────────┬──────────────────────┘
                   ↓
┌──────────────────┴──────────────────────┐
│     Markdown Files (Source of Truth)      │
│          (Persistent Storage)            │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Memory writing is automatic.&lt;/strong&gt; Each conversation gets summarized by Haiku and appended to that day's Markdown file, then asynchronously vectorized. Background, invisible to users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recall triggers through skills.&lt;/strong&gt; A &lt;code&gt;memory-recall&lt;/code&gt; skill runs in a forked sub-agent (&lt;code&gt;context: fork&lt;/code&gt;) — zero token overhead in the main session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid retrieval.&lt;/strong&gt; Semantic vectors + BM25 keyword matching + RRF fusion. Ask "how did we fix that Redis timeout" — semantic search gets it. Say "search handleTimeout" — BM25 nails it.&lt;/p&gt;

&lt;p&gt;The sub-agent drills from L1 (semantic search, truncated previews) through L2 (full context expansion) to L3 (raw conversation transcripts) as needed.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L1: Semantic search → top results with scores and previews
L2: memsearch expand &amp;lt;hash&amp;gt; → full paragraph, all details
L3: memsearch transcript → raw user/agent conversation + tool calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Cross-Agent Sharing
&lt;/h3&gt;

&lt;p&gt;This is the fundamental difference. memsearch supports Claude Code, OpenClaw, OpenCode, and Codex CLI — all sharing the same Markdown memory format with collection names computed from project paths.&lt;/p&gt;

&lt;p&gt;Memory written from any agent is searchable by all others.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2edqsyyhedu77e48ecu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp2edqsyyhedu77e48ecu.png" alt="Four Agents One Memory" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Spend an afternoon in Claude Code debugging deployment, switch to OpenCode next day — it finds yesterday's memories and gives the right answer. Point Milvus at Zilliz Cloud for team collaboration, and new members get instant project context without digging through Slack.&lt;/p&gt;
&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Claude Code&lt;/span&gt;
/plugin marketplace add zilliztech/memsearch
/plugin &lt;span class="nb"&gt;install &lt;/span&gt;memsearch

&lt;span class="c"&gt;# Python&lt;/span&gt;
uv tool &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"memsearch[onnx]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;CLI and Python API for custom integrations:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memsearch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MemSearch&lt;/span&gt;

&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemSearch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Redis config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Milvus Lite for local (zero config), Zilliz Cloud for teams (free tier), or self-hosted Docker. Default ONNX embeddings run on CPU — no GPU, no API calls needed.&lt;/p&gt;



&lt;p&gt;Claude Code's memory design has real value, and KAIROS is worth watching. But single-agent memory optimization can only go so far. Memory should outlive any single agent.&lt;/p&gt;

&lt;p&gt;Full analysis with architecture diagrams: &lt;a href="https://zc277584121.github.io/ai-coding/2026/04/01/claude-code-memory-vs-memsearch.html" rel="noopener noreferrer"&gt;https://zc277584121.github.io/ai-coding/2026/04/01/claude-code-memory-vs-memsearch.html&lt;/a&gt;&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/zilliztech" rel="noopener noreferrer"&gt;
        zilliztech
      &lt;/a&gt; / &lt;a href="https://github.com/zilliztech/memsearch" rel="noopener noreferrer"&gt;
        memsearch
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A Markdown-first memory system, a standalone library for any AI agent. Inspired by OpenClaw.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;
  &lt;a rel="noopener noreferrer" href="https://github.com/zilliztech/memsearch/assets/logo-icon.jpg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fzilliztech%2Fmemsearch%2FHEAD%2Fassets%2Flogo-icon.jpg" alt="" width="100"&gt;&lt;/a&gt;
   
  memsearch
&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;
  &lt;strong&gt;Cross-platform semantic memory for AI coding agents.&lt;/strong&gt;
&lt;/p&gt;

&lt;p&gt;
  &lt;a href="https://pypi.org/project/memsearch/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/7628d1b8b01de0290bffc19d387df0ecce9b15db945c2807b4cd808d850de2f5/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f6d656d7365617263683f7374796c653d666c61742d73717561726526636f6c6f723d626c7565" alt="PyPI"&gt;&lt;/a&gt;
  &lt;a href="https://zilliztech.github.io/memsearch/platforms/claude-code/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/92ac413ff856d71133f14644686eb14c7f1cb6824fa85a67a2da7a31df9e2e59/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c617564655f436f64652d706c7567696e2d6339373533393f7374796c653d666c61742d737175617265266c6f676f3d636c61756465266c6f676f436f6c6f723d7768697465" alt="Claude Code"&gt;&lt;/a&gt;
  &lt;a href="https://zilliztech.github.io/memsearch/platforms/openclaw/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0baf12e20f6f37a4d94286248d503a3e9a4defab55dc4de3857d57d13c472606/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4f70656e436c61772d706c7567696e2d3461396566663f7374796c653d666c61742d737175617265" alt="OpenClaw"&gt;&lt;/a&gt;
  &lt;a href="https://zilliztech.github.io/memsearch/platforms/opencode/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/07a2aedbb58456dfbc3de8f1e389b1b2e793bd8fb3bcf45bd3e782b13a7b6dc8/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4f70656e436f64652d706c7567696e2d3232633535653f7374796c653d666c61742d737175617265" alt="OpenCode"&gt;&lt;/a&gt;
  &lt;a href="https://zilliztech.github.io/memsearch/platforms/codex/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/478052163166392ec2292ad6be76208d5a983f235d51c141a8af3119db300caf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6465785f434c492d706c7567696e2d6666366233353f7374796c653d666c61742d737175617265" alt="Codex CLI"&gt;&lt;/a&gt;
  &lt;a href="https://pypi.org/project/memsearch/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/63a7563e96ce73fab13ca542e9ff27338cbeb85f4a8cf946541591b432269b4d/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d253345253344332e31302d626c75653f7374796c653d666c61742d737175617265266c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465" alt="Python"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/zilliztech/memsearch/blob/main/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/c6e4c5cb5b9f444da0b9044aa7e7de5f33d704d3b23c827255331db88fe5f05b/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f7a696c6c697a746563682f6d656d7365617263683f7374796c653d666c61742d737175617265" alt="License"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/zilliztech/memsearch/actions/workflows/test.yml" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/3e8327350f6a6cbb8388470922edc34cf63014fb3a740416c3fd29a86840b4fe/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f7a696c6c697a746563682f6d656d7365617263682f746573742e796d6c3f6272616e63683d6d61696e267374796c653d666c61742d737175617265" alt="Tests"&gt;&lt;/a&gt;
  &lt;a href="https://zilliztech.github.io/memsearch/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/539181fdc55ff80db540dcf956b016ebe52e6156d8a092fdf179769c743a2bdd/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f63732d6d656d7365617263682d626c75653f7374796c653d666c61742d737175617265" alt="Docs"&gt;&lt;/a&gt;
  &lt;a href="https://github.com/zilliztech/memsearch/stargazers" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/c37b9a35754f6f0d2b99d23cd673d007ad264d51f0782afc7ce162fd6c6e0c00/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f7a696c6c697a746563682f6d656d7365617263683f7374796c653d666c61742d737175617265" alt="Stars"&gt;&lt;/a&gt;
  &lt;a href="https://discord.com/invite/FG6hMJStWu" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/83be151409ef6198d98bba837d8fa015ae9b9f6f756f7b4b17f5e17e21edbed6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446973636f72642d636861742d3732383964613f7374796c653d666c61742d737175617265266c6f676f3d646973636f7264266c6f676f436f6c6f723d7768697465" alt="Discord"&gt;&lt;/a&gt;
  &lt;a href="https://x.com/zilliz_universe" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/74562eb34e3695b6b198445771d8394cfbbd0ceaa4774bbb85658e8cd48f218b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f666f6c6c6f772d2534307a696c6c697a5f5f756e6976657273652d3030303030303f7374796c653d666c61742d737175617265266c6f676f3d78266c6f676f436f6c6f723d7768697465" alt="X (Twitter)"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;
  &lt;a rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com/17022025/572363076-427b7152-bc16-408c-a8b0-59a2b05fd1e0.gif?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUxMDQyMDcsIm5iZiI6MTc3NTEwMzkwNywicGF0aCI6Ii8xNzAyMjAyNS81NzIzNjMwNzYtNDI3YjcxNTItYmMxNi00MDhjLWE4YjAtNTlhMmIwNWZkMWUwLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDAyVDA0MjUwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM0NzdiYzliMjY5M2UyNzQ4MWE1YzExZWJjZGJkNDEyOGJhNzc1NTczNGMwM2IzOGI2ZjllYWE1YTIwMzRlMGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.eS1L8szM2GuxzN2Y7bI-rJgIwpkmTp3oHr591ydvBps"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fprivate-user-images.githubusercontent.com%2F17022025%2F572363076-427b7152-bc16-408c-a8b0-59a2b05fd1e0.gif%3Fjwt%3DeyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzUxMDQyMDcsIm5iZiI6MTc3NTEwMzkwNywicGF0aCI6Ii8xNzAyMjAyNS81NzIzNjMwNzYtNDI3YjcxNTItYmMxNi00MDhjLWE4YjAtNTlhMmIwNWZkMWUwLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDAyVDA0MjUwN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWM0NzdiYzliMjY5M2UyNzQ4MWE1YzExZWJjZGJkNDEyOGJhNzc1NTczNGMwM2IzOGI2ZjllYWE1YTIwMzRlMGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.eS1L8szM2GuxzN2Y7bI-rJgIwpkmTp3oHr591ydvBps" alt="memsearch demo" width="800"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Why memsearch?&lt;/h3&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;🌐 &lt;strong&gt;All Platforms, One Memory&lt;/strong&gt; — memories flow across &lt;a href="https://github.com/zilliztech/memsearch/plugins/claude-code/README.md" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;, &lt;a href="https://github.com/zilliztech/memsearch/plugins/openclaw/README.md" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt;, &lt;a href="https://github.com/zilliztech/memsearch/plugins/opencode/README.md" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt;, and &lt;a href="https://github.com/zilliztech/memsearch/plugins/codex/README.md" rel="noopener noreferrer"&gt;Codex CLI&lt;/a&gt;. A conversation in one agent becomes searchable context in all others — no extra setup&lt;/li&gt;
&lt;li&gt;👥 &lt;strong&gt;For Agent Users&lt;/strong&gt;, install a plugin and get persistent memory with zero effort; &lt;strong&gt;for Agent Developers&lt;/strong&gt;, use the full &lt;a href="https://zilliztech.github.io/memsearch/cli/" rel="nofollow noopener noreferrer"&gt;CLI&lt;/a&gt; and &lt;a href="https://zilliztech.github.io/memsearch/python-api/" rel="nofollow noopener noreferrer"&gt;Python API&lt;/a&gt; to build memory and harness engineering into your own agents&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;Markdown is the source of truth&lt;/strong&gt; — inspired by &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt;. Your memories are just &lt;code&gt;.md&lt;/code&gt; files — human-readable, editable, version-controllable. Milvus is a "shadow index": a derived, rebuildable cache&lt;/li&gt;
&lt;li&gt;🔍 &lt;strong&gt;Progressive retrieval, hybrid search, smart dedup, live sync&lt;/strong&gt; — 3-layer recall (search → expand → transcript); dense vector + BM25 sparse + RRF reranking; SHA-256 content hashing skips unchanged content; file watcher auto-indexes in real time&lt;/li&gt;
&lt;/ul&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🧑‍💻&lt;/h2&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/zilliztech/memsearch" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



</description>
      <category>ai</category>
      <category>coding</category>
      <category>devtools</category>
      <category>milvus</category>
    </item>
    <item>
      <title>Claude Code's Leaked Source: A Real-World Masterclass in Harness Engineering</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Wed, 01 Apr 2026 08:42:08 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/claude-codes-leaked-source-a-real-world-masterclass-in-harness-engineering-2d9n</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/claude-codes-leaked-source-a-real-world-masterclass-in-harness-engineering-2d9n</guid>
      <description>&lt;p&gt;Earlier this year, Mitchell Hashimoto coined the term "harness engineering" â€” the discipline of building everything &lt;em&gt;around&lt;/em&gt; the model that makes an AI agent actually work in production. OpenAI wrote about it. Anthropic published guides. Martin Fowler analyzed it.&lt;/p&gt;

&lt;p&gt;After studying Claude Code's leaked source â€” particularly its memory system, caching architecture, and security layers â€” the harness turns out to be far more interesting than the LLM calls themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution: From Prompt to Harness
&lt;/h2&gt;

&lt;p&gt;The AI engineering discipline has shifted rapidly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2023-2024: Prompt Engineering    â†’ "How to ask the model"
2025:      Context Engineering   â†’ "What information to feed the model"
2026:      Harness Engineering   â†’ "How the entire system runs around the model"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompt engineering is the question. Context engineering is the blueprint. Harness engineering is the construction site â€” tools, permissions, safety checks, cost controls, feedback loops, and state management that let the agent operate reliably.&lt;/p&gt;

&lt;p&gt;The leaked Claude Code source is a concrete case study for each of these harness layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Cache Economics: A Cost Center, Not an Optimization
&lt;/h2&gt;

&lt;p&gt;One of the most revealing modules is &lt;code&gt;promptCacheBreakDetection.ts&lt;/code&gt;. It tracks &lt;strong&gt;14 distinct cache invalidation vectors&lt;/strong&gt; and uses "sticky latches" â€” mechanisms that prevent mode switches from breaking cached prompt prefixes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚              Prompt Cache Layer                  â”‚
â”‚                                                  â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”         â”‚
â”‚  â”‚ Vector 1â”‚  â”‚ Vector 2â”‚  â”‚Vector 14â”‚  ...     â”‚
â”‚  â”‚ Mode    â”‚  â”‚ Tool    â”‚  â”‚ Context â”‚         â”‚
â”‚  â”‚ Switch  â”‚  â”‚ Change  â”‚  â”‚ Rotate  â”‚         â”‚
â”‚  â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜  â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜  â””â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”˜         â”‚
â”‚       â”‚            â”‚            â”‚                â”‚
â”‚       â–¼            â–¼            â–¼                â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”       â”‚
â”‚  â”‚         Sticky Latch Layer           â”‚       â”‚
â”‚  â”‚  "Hold current prefix until forced"  â”‚       â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜       â”‚
â”‚                         â”‚                        â”‚
â”‚                         â–¼                        â”‚
â”‚              â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”                 â”‚
â”‚              â”‚  Cache Decision  â”‚                â”‚
â”‚              â”‚  KEEP / BREAK   â”‚                 â”‚
â”‚              â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜                 â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reframes prompt caching from a performance trick into a &lt;strong&gt;billing optimization problem&lt;/strong&gt;. At scale, each cache miss is real money. The code treats cache management with the same rigor as database query planning â€” monitoring invalidation patterns, measuring hit rates, and making explicit keep-or-break decisions.&lt;/p&gt;

&lt;p&gt;The takeaway for agent builders: if your agent makes repeated API calls (and it does), prompt caching is not optional â€” it's a cost center that needs active management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Agent Coordination: The Prompt IS the Harness
&lt;/h2&gt;

&lt;p&gt;Claude Code's sub-agent system is internally called "swarms." The surprising part: coordination between agents is not handled by a state machine, a DAG executor, or an orchestration framework. It's done through &lt;strong&gt;natural language prompts&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚            Main Agent (Orchestrator)       â”‚
â”‚                                           â”‚
â”‚  System prompt includes:                  â”‚
â”‚  - "Do not rubber-stamp weak work"        â”‚
â”‚  - Tool permission boundaries             â”‚
â”‚  - Task decomposition strategy            â”‚
â”‚                                           â”‚
â”‚         â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¼â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”           â”‚
â”‚         â–¼          â–¼          â–¼           â”‚
â”‚    â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”   â”‚
â”‚    â”‚ Sub-    â”‚ â”‚ Sub-    â”‚ â”‚ Sub-    â”‚   â”‚
â”‚    â”‚ Agent A â”‚ â”‚ Agent B â”‚ â”‚ Agent C â”‚   â”‚
â”‚    â”‚         â”‚ â”‚         â”‚ â”‚         â”‚   â”‚
â”‚    â”‚ Isolatedâ”‚ â”‚ Isolatedâ”‚ â”‚ Isolatedâ”‚   â”‚
â”‚    â”‚ Context â”‚ â”‚ Context â”‚ â”‚ Context â”‚   â”‚
â”‚    â”‚ Scoped  â”‚ â”‚ Scoped  â”‚ â”‚ Scoped  â”‚   â”‚
â”‚    â”‚ Tools   â”‚ â”‚ Tools   â”‚ â”‚ Tools   â”‚   â”‚
â”‚    â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜   â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each sub-agent runs in an isolated context with specific tool permissions. The orchestrator coordinates them through instructions embedded in prompts â€” quality standards, scope boundaries, conflict resolution rules. All in natural language.&lt;/p&gt;

&lt;p&gt;This is a strong signal: for LLM-based multi-agent systems, traditional orchestration frameworks may add unnecessary complexity. The model already understands natural language instructions. Why build a state machine when a well-written prompt can express the same coordination logic?&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory and State: Progressive Disclosure at Every Layer
&lt;/h2&gt;

&lt;p&gt;One of the more practical patterns in the codebase is the &lt;strong&gt;file-based memory system&lt;/strong&gt; with progressive disclosure.&lt;/p&gt;

&lt;p&gt;The design uses a two-tier structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚             Memory Architecture               â”‚
â”‚                                               â”‚
â”‚  Tier 1: Index (always loaded)                â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”    â”‚
â”‚  â”‚ MEMORY.md                            â”‚    â”‚
â”‚  â”‚                                      â”‚    â”‚
â”‚  â”‚ - [User role](user_role.md) â€” ...    â”‚    â”‚
â”‚  â”‚ - [Testing](feedback_test.md) â€” ...  â”‚    â”‚
â”‚  â”‚ - [Auth rewrite](project_auth.md)    â”‚    â”‚
â”‚  â”‚                                      â”‚    â”‚
â”‚  â”‚ Cost: ~200 tokens (one-line hooks)   â”‚    â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜    â”‚
â”‚                                               â”‚
â”‚  Tier 2: Full content (loaded on demand)      â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚user_role.mdâ”‚ â”‚feedback_   â”‚ â”‚project_  â”‚ â”‚
â”‚  â”‚            â”‚ â”‚test.md     â”‚ â”‚auth.md   â”‚ â”‚
â”‚  â”‚ Full user  â”‚ â”‚ Full test  â”‚ â”‚ Full     â”‚ â”‚
â”‚  â”‚ context    â”‚ â”‚ guidelines â”‚ â”‚ context  â”‚ â”‚
â”‚  â”‚ (~500 tok) â”‚ â”‚ (~300 tok) â”‚ â”‚(~400 tok)â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚                                               â”‚
â”‚  Only fetched when the index line matches     â”‚
â”‚  the current task context                     â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index (&lt;code&gt;MEMORY.md&lt;/code&gt;) is always loaded into the context window â€” cheap, at ~200 tokens. Full memory files are only fetched when a one-line hook in the index matches the current task. This keeps the context window lean while still giving the agent access to rich historical context when needed.&lt;/p&gt;

&lt;p&gt;This is essentially the same pattern as database indexing: maintain a small, fast lookup structure that points to larger data. Applied to LLM context windows, it's a practical solution to the "agents forget everything between sessions" problem without paying the token cost of loading everything upfront.&lt;/p&gt;

&lt;p&gt;The memory system also categorizes memories by type â€” user preferences, feedback corrections, project context, external references â€” each with different update and retrieval patterns. This is more sophisticated than a flat memory store and mirrors how humans organize knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security: 23 Checks and Adversarial Hardening
&lt;/h2&gt;

&lt;p&gt;Every bash command execution passes through &lt;strong&gt;23 security checks&lt;/strong&gt;. This is not a theoretical threat model â€” it's the result of real-world adversarial usage.&lt;/p&gt;

&lt;p&gt;The defenses include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-width character injection&lt;/strong&gt; â€” invisible Unicode characters that can alter command semantics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zsh expansion tricks&lt;/strong&gt; â€” shell-specific syntax that can escape sandboxes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native client authentication&lt;/strong&gt; â€” a DRM-style mechanism where the Zig HTTP layer computes a hash (&lt;code&gt;cch=56670&lt;/code&gt; placeholder replaced at transport time) to verify client legitimacy
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input
    â”‚
    â–¼
â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚  23 Security Checks  â”‚
â”‚                      â”‚
â”‚  â€¢ Zero-width chars  â”‚
â”‚  â€¢ Zsh expansion     â”‚
â”‚  â€¢ Path traversal    â”‚
â”‚  â€¢ Injection patternsâ”‚
â”‚  â€¢ ...19 more        â”‚
â”‚                      â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
           â”‚
     PASS? â”‚
     â”Œâ”€â”€â”€â”€â”€â”´â”€â”€â”€â”€â”€â”
     â”‚           â”‚
    YES         NO
     â”‚           â”‚
     â–¼           â–¼
  Execute     Block +
  Command     Log Event
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lesson: agent security is not just sandboxing. It's adversarial input hardening at every boundary. If an agent can execute shell commands, assume someone will try to make it execute the wrong ones â€” intentionally or not.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to Call the Model
&lt;/h2&gt;

&lt;p&gt;Claude Code detects user frustration using &lt;strong&gt;regex&lt;/strong&gt;, not LLM inference.&lt;/p&gt;

&lt;p&gt;Patterns like &lt;code&gt;"wtf"&lt;/code&gt;, &lt;code&gt;"so frustrating"&lt;/code&gt;, &lt;code&gt;"this is broken"&lt;/code&gt; are matched via simple pattern rules and trigger tone adjustments in subsequent responses. No API call needed.&lt;/p&gt;

&lt;p&gt;This sounds almost trivially simple. But it embodies a core harness engineering principle: &lt;strong&gt;use the cheapest, fastest tool that solves the problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The codebase applies this principle consistently:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;th&gt;Why not LLM?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Frustration detection&lt;/td&gt;
&lt;td&gt;Regex&lt;/td&gt;
&lt;td&gt;Fast, free, reliable enough&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal rendering&lt;/td&gt;
&lt;td&gt;React + Ink with Int32Array buffers&lt;/td&gt;
&lt;td&gt;Rendering is a solved problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache invalidation tracking&lt;/td&gt;
&lt;td&gt;Dedicated TS module&lt;/td&gt;
&lt;td&gt;Deterministic logic, no ambiguity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Client auth&lt;/td&gt;
&lt;td&gt;Zig HTTP layer hash&lt;/td&gt;
&lt;td&gt;Security must be deterministic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LLM calls are reserved for tasks that genuinely require language understanding. Everything else uses conventional engineering. A $0 regex beats a $0.01 model call when accuracy is comparable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rendering Layer: Game Engine Meets Terminal
&lt;/h2&gt;

&lt;p&gt;An unexpected finding: the CLI terminal interface is built with &lt;strong&gt;React + Ink&lt;/strong&gt; and uses game-engine-style rendering optimizations.&lt;/p&gt;

&lt;p&gt;The implementation uses &lt;code&gt;Int32Array&lt;/code&gt; buffers and patch-based updates â€” similar to how game engines minimize draw calls by only updating changed pixels. The team claims this achieves &lt;strong&gt;~50x fewer &lt;code&gt;stringWidth&lt;/code&gt; calls&lt;/strong&gt; during token streaming.&lt;/p&gt;

&lt;p&gt;This makes sense when you think about it. A terminal UI streaming LLM output has similar challenges to a game render loop: frequent partial updates, variable-length content, frame-rate sensitivity. The harness applies domain-appropriate engineering rather than treating the terminal as an afterthought.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Big Picture: Model as Commodity, Harness as Moat
&lt;/h2&gt;

&lt;p&gt;The leaked source paints a clear picture of where the real engineering effort lives in a production AI agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚                 Agent Harness                    â”‚
â”‚                                                  â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚ Cache    â”‚ â”‚ Security â”‚ â”‚ Tool Orchestrationâ”‚ â”‚
â”‚  â”‚ Economicsâ”‚ â”‚ Hardeningâ”‚ â”‚ &amp;amp; Permissions     â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚ Memory &amp;amp; â”‚ â”‚ State    â”‚ â”‚ Multi-Agent      â”‚ â”‚
â”‚  â”‚ Retrievalâ”‚ â”‚ Persist  â”‚ â”‚ Coordination     â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚  â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â” â”‚
â”‚  â”‚ Cost     â”‚ â”‚ UI/UX    â”‚ â”‚ Observability    â”‚ â”‚
â”‚  â”‚ Control  â”‚ â”‚ Renderingâ”‚ â”‚ &amp;amp; Logging        â”‚ â”‚
â”‚  â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜ â”‚
â”‚                                                  â”‚
â”‚              â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”                    â”‚
â”‚              â”‚   LLM API    â”‚                    â”‚
â”‚              â”‚  (the easy   â”‚                    â”‚
â”‚              â”‚    part)     â”‚                    â”‚
â”‚              â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜                    â”‚
â”‚                                                  â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”˜
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM API call is the smallest box. Everything around it â€” caching, memory, security, cost control, rendering, coordination â€” is the actual product.&lt;/p&gt;

&lt;p&gt;For anyone building AI agents: the model selection matters less than you think. The harness is where the engineering lives â€” and where the differentiation happens.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>architecture</category>
      <category>harnessengineering</category>
    </item>
    <item>
      <title>MCP Is Being Abandoned: How Fast Can a 'Standard' Die?</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Thu, 26 Mar 2026 14:52:34 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/mcp-is-being-abandoned-how-fast-can-a-standard-die-2f7c</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/mcp-is-being-abandoned-how-fast-can-a-standard-die-2f7c</guid>
      <description>&lt;p&gt;In mid-March, Perplexity's CTO Denis Yarats casually dropped a bombshell at the Ask 2026 conference: the company is moving away from MCP internally, going back to REST APIs and CLIs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fzc277584121.github.io%2Fima%2520ges%2Fmcp-vs-skills%2Fmorgan-perplexity-drops-mcp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fzc277584121.github.io%2Fima%2520ges%2Fmcp-vs-skills%2Fmorgan-perplexity-drops-mcp.png" alt="Morgan relaying Perplexity CTO's announcement to drop MCP" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The audience barely reacted, but the statement exploded on social media. YC CEO Garry Tan retweeted it with a blunt "MCP sucks honestly" — it eats too much context window, authentication is broken, and he wrote a CLI wrapper in 30 minutes to replace it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y5s86sm7hp5xcp0bgbs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y5s86sm7hp5xcp0bgbs.png" alt="Garry Tan publicly criticizing MCP on X" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A year ago, this kind of pushback would have been unthinkable. MCP was hailed as the ultimate standard for AI tool integration, ecosystem growth was explosive, and server counts doubled weekly. Now it's been hyped, overused, and rejected.&lt;/p&gt;

&lt;p&gt;So what actually went wrong with MCP?&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Is Too Heavy: Context Windows Can't Handle It
&lt;/h2&gt;

&lt;p&gt;A standard MCP setup consumes roughly 72% of the context window. Someone measured it: three servers (GitHub, Playwright, and an IDE integration) burned through 143K tokens of tool definitions on a 200K-token model. Before the agent even starts working, less than 30% of the space remains.&lt;/p&gt;

&lt;p&gt;The cost isn't just financial. The more noise packed into the context, the worse the model focuses on what matters. With 100 tool schemas sitting there, the agent has to wade through all of them for every single decision.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn46pyyaep510n21vf5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn46pyyaep510n21vf5n.png" alt="MCP vs Skills context window usage comparison" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Researchers call this "context rot." The data is stark: tool selection accuracy drops from 43% to below 14%. The more tools you add, the worse the agent gets at picking the right one.&lt;/p&gt;

&lt;p&gt;The root cause is MCP's loading strategy — it dumps all tool descriptions into the session at the start, regardless of whether they'll be used. This is a protocol-level design choice, not a bug, but the cost scales with the number of tools.&lt;/p&gt;

&lt;p&gt;Skills take a different approach: &lt;strong&gt;progressive disclosure&lt;/strong&gt;. At session start, the agent only sees metadata for each Skill — name, one-line description, and trigger conditions. A few dozen tokens total. Only when the agent determines a Skill is relevant does it load the full content.&lt;/p&gt;

&lt;p&gt;MCP lines up every tool at the door and asks you to pick. Skills hand you an index and let you look things up as needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Is Too Dumb: It Just Waits to Be Called
&lt;/h2&gt;

&lt;p&gt;MCP is essentially a tool-calling protocol: how to discover tools, how to invoke them, how to get results. Clean for small-scale use, but the "cleanness" is also the limitation — it's too flat.&lt;/p&gt;

&lt;h3&gt;
  
  
  No Hierarchy, No Subcommands
&lt;/h3&gt;

&lt;p&gt;In MCP, a tool is a function signature. No subcommands, no awareness of session lifecycle, no knowledge of where the agent is in its workflow.&lt;/p&gt;

&lt;p&gt;CLI tools are fundamentally different. A CLI naturally has subcommands — &lt;code&gt;git commit&lt;/code&gt;, &lt;code&gt;git push&lt;/code&gt;, &lt;code&gt;git log&lt;/code&gt; are completely different behavior paths. An agent can run &lt;code&gt;--help&lt;/code&gt; first, explore what's available, and expand as needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6f3f2e3nab9pyhhpwul.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6f3f2e3nab9pyhhpwul.png" alt="MCP flat tool space vs CLI + Skills hierarchical structure" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  No SOPs, No Guidance for the Agent
&lt;/h3&gt;

&lt;p&gt;A Skill is essentially a markdown file containing an SOP: what to do first, what to do next, how to retry on failure, when to notify the user. The agent doesn't get an isolated tool — it gets a complete operational playbook.&lt;/p&gt;

&lt;p&gt;MCP tools passively wait. Skills actively participate in the agent's workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lp0apshdefpxaahek3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lp0apshdefpxaahek3u.png" alt="Passive tools vs active skills" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Can't Reuse the Agent's LLM
&lt;/h3&gt;

&lt;p&gt;This one hits close to home. Six months ago, we built the claude-context MCP project to provide contextual code retrieval for Claude Code. When a user asked a question, MCP retrieved relevant conversation fragments from a Milvus vector database and fed them back into the context.&lt;/p&gt;

&lt;p&gt;The problem: out of top 10 retrieved results, maybe 3 were useful. The other 7 were noise. I tried several approaches — dumping everything to the agent (too noisy), adding reranking inside the MCP server (small models weren't accurate enough), using a large model (but the MCP server is a separate process that can't access the outer agent's LLM).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3y1eos3mlsus4jk6z328.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3y1eos3mlsus4jk6z328.png" alt="MCP needs separate LLM vs Skills reuse the agent's own LLM" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Skills solve this. A retrieval Skill can be structured as: run vector search for top 10, then let the agent's own LLM judge relevance and filter noise. No extra model, no extra API key.&lt;/p&gt;

&lt;p&gt;Our later project memsearch used this approach — three-layer progressive retrieval where the agent's LLM participates throughout the decision-making process, keeping only curated results visible to the main conversation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9i111rod4wdblnb1515.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9i111rod4wdblnb1515.png" alt="memsearch three-layer progressive recall flow" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  But Does MCP Deserve to Die?
&lt;/h2&gt;

&lt;p&gt;MCP has been donated to the Linux Foundation, has over 10,000 active servers, and SDK downloads hit 97 million per month. That ecosystem doesn't just die overnight.&lt;/p&gt;

&lt;p&gt;Some developers take a moderate view: Skills are like a detailed recipe; MCP is the tool. Both have their place.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fupw194mbrffaw5oghsbx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fupw194mbrffaw5oghsbx.png" alt="Abhay Bhargav: Skills are recipes, MCP is tools" width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It depends on the scenario. MCP over stdio — the mode most developers run locally — has the most issues. But MCP over HTTP is different: enterprise tool platforms need unified permission management, centralized OAuth, standardized telemetry — things scattered CLI tools genuinely struggle to provide.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08q6dvaox9fse4cxqcbp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08q6dvaox9fse4cxqcbp.png" alt="MCP's two faces: local stdio mode vs enterprise HTTP mode" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP probably won't disappear. It will retreat to where it fits — enterprise tool platforms, scenarios requiring centralized governance. But the claim that "every agent should use MCP" no longer holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The most pragmatic combination right now is &lt;strong&gt;CLI + RESTful + Skills&lt;/strong&gt;: CLIs for everyday operations, RESTful APIs for external system integration, Skills for agent behavioral knowledge. MCP hasn't vanished, but its position has shifted — from "the standard" to "one option among several."&lt;/p&gt;

&lt;p&gt;MCP left the industry with a valuable question: what kind of tool interface do agents actually need? The answer is slowly taking shape — it's just no longer something a single protocol can provide.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
      <category>skills</category>
    </item>
    <item>
      <title>Which Embedding Model Should You Actually Use in 2026? I Benchmarked 10 Models to Find Out</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Fri, 20 Mar 2026 08:53:57 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/which-embedding-model-should-you-actually-use-in-2026-i-benchmarked-10-models-to-find-out-58bc</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/which-embedding-model-should-you-actually-use-in-2026-i-benchmarked-10-models-to-find-out-58bc</guid>
      <description>&lt;p&gt;Still using OpenAI's text-embedding-3-small without a second thought? If you're building RAG or vector search systems, you've probably noticed that new embedding models drop every few weeks, each claiming SOTA on some leaderboard. But when it comes to picking one for production, those MTEB scores don't always translate to real-world performance.&lt;/p&gt;

&lt;p&gt;On March 10, 2026, Google released &lt;strong&gt;Gemini Embedding 2 Preview&lt;/strong&gt; — a model that supports &lt;strong&gt;five modalities&lt;/strong&gt; (text, image, video, audio, PDF) natively, 100+ languages, native MRL (Matryoshka Representation Learning), and 3072-dimensional output. On paper, it checks every box.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxe2dsdxzju0uu9osi5gp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxe2dsdxzju0uu9osi5gp.png" alt="Gemini Embedding 2 multimodal architecture: five modality inputs mapped to a unified embedding space" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The official benchmarks look impressive too:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk48zj0weyxa9fx7yqtza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk48zj0weyxa9fx7yqtza.png" alt="Gemini Embedding 2 official benchmark comparison table" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But official benchmarks tend to highlight the best scenarios. So I decided to test things myself: pick a batch of 2025-2026 models and run them through tasks that public benchmarks don't cover well.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contenders
&lt;/h2&gt;

&lt;p&gt;I selected &lt;strong&gt;10 models&lt;/strong&gt; spanning API services and open-source local deployment, plus classic baselines like OpenAI text-embedding-3-large and CLIP ViT-L-14.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;From&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;Modalities&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini Embedding 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;Text/Image/Video/Audio/PDF&lt;/td&gt;
&lt;td&gt;All-modality universal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jina Embeddings v4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jina AI&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;Text/Image/PDF&lt;/td&gt;
&lt;td&gt;MRL + LoRA multi-task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Voyage Multimodal 3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Voyage AI (MongoDB)&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Text/Image/Video&lt;/td&gt;
&lt;td&gt;Balanced across the board&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3-VL-Embedding-2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alibaba Qwen&lt;/td&gt;
&lt;td&gt;2B&lt;/td&gt;
&lt;td&gt;2048&lt;/td&gt;
&lt;td&gt;Text/Image/Video&lt;/td&gt;
&lt;td&gt;Open-source, lightweight multimodal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jina CLIP v2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jina AI&lt;/td&gt;
&lt;td&gt;~1B&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Text/Image&lt;/td&gt;
&lt;td&gt;Modern CLIP architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cohere Embed v4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cohere&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;Fixed&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Enterprise retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI 3-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Most widely used&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BGE-M3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BAAI&lt;/td&gt;
&lt;td&gt;568M&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Open-source multilingual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mixedbread AI&lt;/td&gt;
&lt;td&gt;335M&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Lightweight, English-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;nomic-embed-text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nomic AI&lt;/td&gt;
&lt;td&gt;137M&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Ultra-lightweight&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;CLIP ViT-L-14&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;OpenAI (2021)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;428M&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;768&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Text/Image&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;Classic baseline&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A quick rundown of the newer ones:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini Embedding 2&lt;/strong&gt; is Google's first all-modality embedding model, released March 2026, supporting all five modalities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxyotxcvubiieivv9i17f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxyotxcvubiieivv9i17f.png" alt="Gemini Embedding 2 — Google AI docs page" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jina Embeddings v4&lt;/strong&gt; is built on Qwen2.5-VL-3B (3.8B params). It uses three LoRA adapters (retrieval.query / retrieval.passage / text-matching) to switch between retrieval scenarios. Supports text, images, and PDFs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6woxrro5ap5gk2qtcj47.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6woxrro5ap5gk2qtcj47.png" alt="Jina Embeddings v4 — Jina AI product page" width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jina CLIP v2&lt;/strong&gt; is Jina AI's modernized CLIP architecture focused on text-image cross-modal alignment with multilingual support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Voyage Multimodal 3.5&lt;/strong&gt; comes from Voyage AI, acquired by MongoDB for $220M in February 2025. Supports text, images, and video.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flkw8jwwz074avegcw8xb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flkw8jwwz074avegcw8xb.png" alt="Voyage AI homepage" width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen3-VL-Embedding&lt;/strong&gt; is Alibaba Qwen's open-source multimodal embedding series (2B and 8B variants). I tested the 2B version since it fits on a single 11GB consumer GPU — a good test of lightweight deployment viability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsir5nl3u3b4jvdf33h96.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsir5nl3u3b4jvdf33h96.png" alt="Qwen3-VL-Embedding-2B — Hugging Face model card" width="800" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cohere Embed v4&lt;/strong&gt; and &lt;strong&gt;OpenAI 3-large&lt;/strong&gt; are text-only stalwarts, regulars on MTEB leaderboards and the most common choices for RAG.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpov3jfvrzaapkagwxtz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjpov3jfvrzaapkagwxtz.png" alt="Cohere Embed v4" width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BGE-M3&lt;/strong&gt; from BAAI is an open-source multilingual model (568M params, 100+ languages) — the benchmark in Chinese open-source embeddings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fge5xdzwyompc5smgfipd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fge5xdzwyompc5smgfipd.png" alt="BGE-M3 — Hugging Face model card" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt; (335M) and &lt;strong&gt;nomic-embed-text&lt;/strong&gt; (137M) are lightweight open-source options. mxbai excels at English MRL, while nomic is the smallest model in this benchmark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaucp8oy7ksx7l21cb7t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaucp8oy7ksx7l21cb7t.png" alt="mxbai-embed-large — Hugging Face model card" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbivckj654csxaz5knk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbivckj654csxaz5knk6.png" alt="nomic-embed-text — Hugging Face model card" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Existing Benchmarks Aren't Enough
&lt;/h2&gt;

&lt;p&gt;Before designing my tests, I looked at what's already out there and found gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MTEB&lt;/strong&gt; (Massive Text Embedding Benchmark) is the gold standard, but it's text-only, doesn't test cross-lingual retrieval (e.g., Chinese query → English document), doesn't evaluate MRL dimension truncation, and has limited coverage of truly long documents (10K+ tokens).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MMEB&lt;/strong&gt; (Massive Multimodal Embedding Benchmark) adds multimodal, but lacks hard negatives — distractors are too easy, making it hard to differentiate models on fine-grained understanding.&lt;/p&gt;

&lt;p&gt;Neither tests cross-lingual retrieval, MRL compression quality, or long-document needle retrieval. These happen to be exactly the pain points developers face when building RAG / Agent / vector search systems. So I designed four evaluation tasks: &lt;strong&gt;cross-modal retrieval, cross-lingual retrieval, needle-in-a-haystack, and MRL dimension compression&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Tasks and Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Round 1: Cross-Modal Retrieval (Text ↔ Image)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: E-commerce visual search, multimodal knowledge bases, multimedia content understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task design&lt;/strong&gt;: 200 image-text pairs from COCO val2017. Text descriptions generated by GPT-4o-mini, each image paired with &lt;strong&gt;3 hard negatives&lt;/strong&gt; — descriptions that differ from the correct one by just one or two details. Models must retrieve correctly from a pool of 200 images + 600 distractor descriptions.&lt;/p&gt;

&lt;p&gt;Here's an actual sample from the dataset:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ovl7lvj2aus0d4hir1g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ovl7lvj2aus0d4hir1g.jpg" alt="COCO sample: travel suitcases with stickers" width="640" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Correct description&lt;/strong&gt;:&lt;br&gt;
&lt;em&gt;"The image features vintage brown leather suitcases with various travel stickers including 'California', 'Cuba', and 'New York', placed on a metal luggage rack against a clear blue sky."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard negatives (single keyword swaps)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;leather suitcases&lt;/em&gt; → &lt;em&gt;canvas backpacks&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;California&lt;/em&gt; → &lt;em&gt;Florida&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;metal luggage rack&lt;/em&gt; → &lt;em&gt;wooden shelf&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model must truly &lt;em&gt;understand&lt;/em&gt; visual details to distinguish these hard negatives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Scoring&lt;/strong&gt;: Bidirectional R@1 — text-to-image and image-to-text, averaged as &lt;code&gt;hard_avg_R@1&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Results
&lt;/h4&gt;

&lt;p&gt;This one surprised me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffabx850ifhc1bwon791.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffabx850ifhc1bwon791.png" alt="Cross-modal retrieval ranking" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qwen3-VL-2B took first with hard_avg_R@1 = 0.945, beating Gemini (0.928) and Voyage (0.900). A 2B open-source model outperformed closed-source APIs.&lt;/p&gt;

&lt;p&gt;Why? Look at the &lt;strong&gt;Modality Gap&lt;/strong&gt; — the L2 distance between the mean text embedding vector and the mean image embedding vector. A smaller gap means text and image vectors live closer together in the embedding space, making cross-modal retrieval easier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyeuxmxa9bwabvbpqbqj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyeuxmxa9bwabvbpqbqj4.png" alt="Modality gap concept diagram" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;hard_avg_R@1&lt;/th&gt;
&lt;th&gt;Modality Gap&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-VL-2B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.945&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2B (open-source)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini Embed 2&lt;/td&gt;
&lt;td&gt;0.928&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;Unknown (closed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voyage MM-3.5&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.59&lt;/td&gt;
&lt;td&gt;Unknown (closed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jina CLIP v2&lt;/td&gt;
&lt;td&gt;0.873&lt;/td&gt;
&lt;td&gt;0.87&lt;/td&gt;
&lt;td&gt;~1B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLIP ViT-L-14&lt;/td&gt;
&lt;td&gt;0.768&lt;/td&gt;
&lt;td&gt;0.83&lt;/td&gt;
&lt;td&gt;428M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Qwen3-VL-2B's modality gap of 0.25 is far smaller than Gemini's 0.73. If you're building a mixed text-image collection in Milvus, a smaller modality gap means text and image vectors can coexist in the same index without extra alignment tricks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway from Round 1&lt;/strong&gt;: In cross-modal capability, open-source small models can already compete with closed-source APIs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 2: Cross-Lingual Retrieval (Chinese ↔ English)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Bilingual knowledge bases where users ask in Chinese but answers live in English documents, or vice versa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task design&lt;/strong&gt;: 166 manually constructed Chinese-English parallel sentence pairs across three difficulty levels, plus 152 hard negative distractors per language.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Difficulty levels&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Chinese&lt;/th&gt;
&lt;th&gt;English&lt;/th&gt;
&lt;th&gt;Hard Negative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;我爱你。&lt;/td&gt;
&lt;td&gt;I love you.&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;这道菜太咸了。&lt;/td&gt;
&lt;td&gt;This dish is too salty.&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;"This dish is too **sweet&lt;/em&gt;&lt;em&gt;."&lt;/em&gt; / &lt;em&gt;"This **soup&lt;/em&gt;* is too salty."*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;画蛇添足&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;To gild the lily&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;"To add fuel to the fire"&lt;/em&gt; / &lt;em&gt;"To let the cat out of the bag"&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mapping "画蛇添足" (literally "drawing legs on a snake") to "To gild the lily" — this kind of cultural concept alignment is the hardest part.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Results
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsnv4ujqcya3zvdd2b9q7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsnv4ujqcya3zvdd2b9q7.png" alt="Crosslingual retrieval ranking" width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gemini dominated here with 0.997, near-perfect, nailing even idiomatic expressions. It was the only model with R@1 = 1.000 on the Hard subset.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;hard_avg_R@1&lt;/th&gt;
&lt;th&gt;Easy&lt;/th&gt;
&lt;th&gt;Medium&lt;/th&gt;
&lt;th&gt;Hard (idioms)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini Embed 2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.997&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-VL-2B&lt;/td&gt;
&lt;td&gt;0.988&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.969&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jina v4&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.969&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voyage MM-3.5&lt;/td&gt;
&lt;td&gt;0.982&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.938&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI 3-large&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.906&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohere v4&lt;/td&gt;
&lt;td&gt;0.955&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.980&lt;/td&gt;
&lt;td&gt;0.875&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BGE-M3 (568M)&lt;/td&gt;
&lt;td&gt;0.940&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.960&lt;/td&gt;
&lt;td&gt;0.844&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic (137M)&lt;/td&gt;
&lt;td&gt;0.154&lt;/td&gt;
&lt;td&gt;0.300&lt;/td&gt;
&lt;td&gt;0.120&lt;/td&gt;
&lt;td&gt;0.031&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai (335M)&lt;/td&gt;
&lt;td&gt;0.120&lt;/td&gt;
&lt;td&gt;0.220&lt;/td&gt;
&lt;td&gt;0.080&lt;/td&gt;
&lt;td&gt;0.031&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This task split models into two clear groups: the top 8 (R@1 &amp;gt; 0.93) have genuine multilingual capability, while nomic and mxbai (R@1 &amp;lt; 0.16) essentially only understand English. No middle ground.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 3: Needle-in-a-Haystack
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: RAG systems processing lengthy legal contracts, research papers. Can the embedding model still find key information buried in tens of thousands of characters?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task design&lt;/strong&gt;: Wikipedia articles as the "haystack" (4K-32K characters), with a fabricated fact inserted at different positions (start / 25% / 50% / 75% / end) as the "needle." The model must correctly rank the needle-containing document higher than the needle-free version via embedding similarity.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Needle&lt;/em&gt;: &lt;em&gt;"The Meridian Corporation reported quarterly revenue of $847.3 million in Q3 2025."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Query&lt;/em&gt;: &lt;em&gt;"What was Meridian Corporation's quarterly revenue?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Haystack&lt;/em&gt;: A 32,000-character Wikipedia article about photosynthesis, with the revenue fact hidden somewhere inside.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Results
&lt;/h4&gt;

&lt;p&gt;The discrimination here was bigger than I expected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnn52buhxcs5uakcxcuw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnn52buhxcs5uakcxcuw.png" alt="Needle-in-a-Haystack heatmap" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;1K&lt;/th&gt;
&lt;th&gt;4K&lt;/th&gt;
&lt;th&gt;8K&lt;/th&gt;
&lt;th&gt;16K&lt;/th&gt;
&lt;th&gt;32K&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;th&gt;Degradation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini Embed 2&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI 3-large&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jina v4&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cohere v4&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-VL-2B&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voyage MM-3.5&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jina CLIP v2&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BGE-M3 (568M)&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.920&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.973&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai (335M)&lt;/td&gt;
&lt;td&gt;0.980&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.600&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.400&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.660&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic (137M)&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.460&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.440&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;56%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"—" means the length exceeds the model's context window or wasn't tested.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three tiers emerged. Gemini, OpenAI, Jina v4, and Cohere scored near-perfect within their context windows. BGE-M3 (568M) showed slight degradation at 8K (0.92). Models under 335M (mxbai, nomic) &lt;strong&gt;dropped significantly at 4K&lt;/strong&gt;, hitting 0.40-0.44 accuracy at 8K.&lt;/p&gt;

&lt;p&gt;Gemini was the only model that completed the full 4K-32K range with a perfect score. On the other end, sub-335M models fell to 0.46-0.60 at just 4K characters (~1000 tokens) — if your RAG documents average over 2000 words, keep this in mind.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 4: MRL Dimension Compression
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What is MRL?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MRL (Matryoshka Representation Learning) is a training technique that makes the first N dimensions of an embedding vector form a meaningful low-dimensional representation on their own. For example, a 3072-dim vector truncated to its first 256 dimensions can still retain decent semantic quality. Half the dimensions = half the storage cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcyd1aouujy8dc3x6rpaj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcyd1aouujy8dc3x6rpaj.png" alt="MRL concept diagram" width="800" height="536"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Task design&lt;/strong&gt;: 150 sentence pairs from STS-B (Semantic Textual Similarity Benchmark), each with human-annotated similarity scores (0-5). Models generate embeddings at full dimensions, then truncated to 256 / 512 / 1024 dims, measuring Spearman rank correlation (ρ) with human scores at each dimension.&lt;/p&gt;

&lt;h4&gt;
  
  
  Results
&lt;/h4&gt;

&lt;p&gt;If you're planning to reduce storage costs by truncating embedding dimensions in your vector database, pay attention here.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fan185r1kwtn0x8ol5zwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fan185r1kwtn0x8ol5zwd.png" alt="MRL: Full Dimension vs 256 Dimension Quality" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;ρ (Full dim)&lt;/th&gt;
&lt;th&gt;ρ (256 dim)&lt;/th&gt;
&lt;th&gt;Degradation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Voyage MM-3.5&lt;/td&gt;
&lt;td&gt;0.880&lt;/td&gt;
&lt;td&gt;0.874&lt;/td&gt;
&lt;td&gt;0.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Jina v4&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.828&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai (335M)&lt;/td&gt;
&lt;td&gt;0.815&lt;/td&gt;
&lt;td&gt;0.795&lt;/td&gt;
&lt;td&gt;2.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic (137M)&lt;/td&gt;
&lt;td&gt;0.781&lt;/td&gt;
&lt;td&gt;0.774&lt;/td&gt;
&lt;td&gt;0.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI 3-large&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.762&lt;/td&gt;
&lt;td&gt;0.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini Embed 2&lt;/td&gt;
&lt;td&gt;0.683&lt;/td&gt;
&lt;td&gt;0.689&lt;/td&gt;
&lt;td&gt;-0.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemini ranked last in this round. mxbai-embed-large (just 335M params) placed third in MRL, beating OpenAI 3-large. Jina v4 and Voyage led because they were specifically trained with MRL objectives. Dimension compression ability has little to do with model size — what matters is whether it was explicitly trained for it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: MRL rankings reflect dimension-compression resilience, which is different from full-dimension semantic quality. Gemini's full-dimension retrieval is strong (proven in cross-lingual and cross-modal rounds), but it scored low on this slimming test. If you don't need dimension compression, this round's results matter less.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Full Scorecard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Cross-Modal&lt;/th&gt;
&lt;th&gt;Cross-Lingual&lt;/th&gt;
&lt;th&gt;Needle&lt;/th&gt;
&lt;th&gt;MRL ρ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini Embed 2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;0.928&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.997&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.668&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Voyage MM-3.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.982&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.880&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jina v4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3-VL-2B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.945&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.988&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.774&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mxbai-embed-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;335M&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.120&lt;/td&gt;
&lt;td&gt;0.660&lt;/td&gt;
&lt;td&gt;0.815&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI 3-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.760&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BGE-M3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;568M&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.940&lt;/td&gt;
&lt;td&gt;0.973&lt;/td&gt;
&lt;td&gt;0.744&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;nomic-embed-text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;137M&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.154&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.780&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cohere v4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unknown&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.955&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jina CLIP v2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1B&lt;/td&gt;
&lt;td&gt;0.873&lt;/td&gt;
&lt;td&gt;0.934&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;CLIP ViT-L-14&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;428M&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.768&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;0.030&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;"—" means the model doesn't support that capability or wasn't tested. CLIP included as a 2021 baseline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One thing is clear: &lt;strong&gt;no single model wins every round&lt;/strong&gt;. Gemini leads in cross-lingual and long documents but ranks last in MRL. Qwen3-VL-2B takes first in cross-modal but is mid-pack on MRL. Voyage is consistently strong but never first. Every model's scorecard has a different shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions and Selection Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Round-by-Round Summary
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cross-modal&lt;/strong&gt;: Qwen3-VL-2B (0.945) took first, Gemini (0.928) second, Voyage (0.900) third. Open-source 2B model beat closed-source APIs — modality gap was the key differentiator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-lingual&lt;/strong&gt;: Gemini (0.997) led by a wide margin, handling even idiom-level Chinese-English alignment perfectly. Top 8 models all scored above 0.93; English-only lightweight models essentially scored zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Needle-in-a-haystack&lt;/strong&gt;: API and large open-source models scored perfectly within 8K; sub-335M models degraded starting at 4K. Gemini was the only model to achieve a perfect score across the full 32K range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MRL compression&lt;/strong&gt;: Voyage (0.880) and Jina v4 (0.833) led, with less than 1% degradation when truncated to 256 dims. Gemini (0.668) ranked last.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini Embedding 2 Verdict
&lt;/h3&gt;

&lt;p&gt;Back to the question I started with — how did Gemini Embedding 2 actually perform?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths&lt;/strong&gt;: Cross-lingual #1 (0.997), needle-in-a-haystack #1 (1.000), cross-modal #2 (0.928), broadest modality coverage (five modalities — other models max out at three).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses&lt;/strong&gt;: MRL compression ranked last (ρ=0.668), cross-modal accuracy beaten by open-source Qwen3-VL-2B.&lt;/p&gt;

&lt;p&gt;If you don't need dimension compression, Gemini is currently unmatched for cross-lingual + long-document scenarios. But for cross-modal precision and dimension compression, specialized models do better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selection Decision Tree
&lt;/h3&gt;

&lt;p&gt;Based on these benchmark results, here's a simple decision flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp4jpgcowbd7xvdw73j1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp4jpgcowbd7xvdw73j1.png" alt="Embedding model selection decision tree" width="800" height="679"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;p&gt;A few models I didn't get to test: NVIDIA's NV-Embed-v2, Jina v5-text. I also didn't cover video, audio, or PDF/table modalities even though some models claim support, nor did I test domain-specific scenarios like code retrieval. The sample sizes are relatively small — ranking differences between some models may fall within statistical margin of error. More thorough testing is on my to-do list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;After running four rounds of benchmarks, a few things stood out to me.&lt;/p&gt;

&lt;p&gt;Cross-lingual semantic alignment used to be a research topic in academic papers — now you can get it from an API call. Five years ago, text-image retrieval meant training a dedicated CLIP model; now a single general-purpose model handles text, images, video, audio, and PDFs. This field is moving faster than most people realize.&lt;/p&gt;

&lt;p&gt;What impressed me most was how fast open-source is catching up. Qwen3-VL-2B has just 2B parameters yet beat every closed-source API in cross-modal accuracy. BGE-M3's cross-lingual performance rivals most commercial services. In the embedding space, data quality and training strategy matter more and more, while model size and compute are becoming less decisive. You don't need to worry about being locked into any single API — there's always an open-source alternative.&lt;/p&gt;

&lt;p&gt;One last thought on model selection. The conclusions in this post will probably need updating in a year. Rather than agonizing over "which model is THE one," I'd invest in building an evaluation pipeline — understand your actual use case and data, set up a test workflow that can quickly validate new models when they drop. Public benchmarks like &lt;a href="https://huggingface.co/spaces/mteb/leaderboard" rel="noopener noreferrer"&gt;MTEB&lt;/a&gt;, &lt;a href="https://huggingface.co/spaces/mteb/mmteb" rel="noopener noreferrer"&gt;MMTEB&lt;/a&gt;, and &lt;a href="https://mmeb.github.io/" rel="noopener noreferrer"&gt;MMEB&lt;/a&gt; are useful references, but you ultimately need to validate on your own data. The evaluation code for this post is open-sourced on &lt;a href="https://github.com/zc277584121/mm-embedding-bench" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; if you want to adapt it. In the long run, building this evaluation capability is more valuable than picking the right model at any single point in time.&lt;/p&gt;

</description>
      <category>embeddings</category>
      <category>rag</category>
      <category>machinelearning</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>Why Vercel's agent-browser Is Winning the Token Efficiency War for AI Browser Automation</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Fri, 06 Mar 2026 15:11:49 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/why-vercels-agent-browser-is-winning-the-token-efficiency-war-for-ai-browser-automation-4p87</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/why-vercels-agent-browser-is-winning-the-token-efficiency-war-for-ai-browser-automation-4p87</guid>
      <description>&lt;h2&gt;
  
  
  The Problem With MCP-Based Browser Tools
&lt;/h2&gt;

&lt;p&gt;If you've tried connecting an AI agent to a browser, you've probably used something like Playwright MCP or Chrome DevTools MCP. They work, but there's a hidden cost: tool definitions.&lt;/p&gt;

&lt;p&gt;MCP tools describe themselves via JSON Schema, and those descriptions get loaded into the agent's context window at the start of every session. Playwright MCP costs roughly 13,700 tokens. Chrome DevTools MCP costs around 17,000. Before your agent has done a single thing, nearly 9% of a 200K context window is gone.&lt;/p&gt;

&lt;p&gt;For short tasks, this is fine. For long multi-step automation workflows — the kind where an agent fills forms, navigates pages, extracts data, and interacts across multiple sites — it adds up fast and can push you right into context limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  agent-browser: The CLI-First Alternative
&lt;/h2&gt;

&lt;p&gt;Vercel's &lt;a href="https://github.com/vercel-labs/agent-browser" rel="noopener noreferrer"&gt;agent-browser&lt;/a&gt; takes a fundamentally different approach. Instead of exposing browser capabilities through MCP, it's a CLI tool. The AI agent interacts with the browser by executing shell commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-browser snapshot &lt;span class="nt"&gt;-i&lt;/span&gt;      &lt;span class="c"&gt;# get interactive elements&lt;/span&gt;
agent-browser click @e1        &lt;span class="c"&gt;# click an element&lt;/span&gt;
agent-browser fill @e2 &lt;span class="s2"&gt;"text"&lt;/span&gt;  &lt;span class="c"&gt;# fill an input&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No JSON Schema. No tool definitions. Zero token overhead for the tooling itself.&lt;/p&gt;

&lt;p&gt;The responses are equally lean. A successful button click returns &lt;code&gt;Done&lt;/code&gt; — six characters. Compare that with MCP-based tools that return full page state updates running into thousands of characters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Behind the Efficiency
&lt;/h2&gt;

&lt;p&gt;agent-browser uses a three-tier architecture that's worth understanding:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tier 1 — Rust CLI&lt;/strong&gt;: A native binary that handles argument parsing and command routing in sub-millisecond time. This eliminates Node.js cold start overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 2 — Node.js Daemon&lt;/strong&gt;: A long-running process that manages the Playwright browser instance. It stays warm between commands, so you don't pay the 2–5 second browser startup cost on each interaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier 3 — Browser&lt;/strong&gt;: The actual browser, connected via CDP (Chrome DevTools Protocol). Supports local Chromium, remote Chrome instances, cloud browsers (Browserbase), and even iOS Safari.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CLI and daemon communicate through Unix domain sockets, keeping IPC fast and lightweight.&lt;/p&gt;

&lt;p&gt;For element interaction, agent-browser uses accessibility tree snapshots with compact refs. Running &lt;code&gt;snapshot -i&lt;/code&gt; returns something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;button "Sign In" [ref=e1]
textbox "Email" [ref=e2]
textbox "Password" [ref=e3]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Roughly 200–400 tokens for a typical page, versus significantly larger outputs from MCP-based alternatives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;Here's a side-by-side comparison I compiled from multiple independent benchmarks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;agent-browser&lt;/th&gt;
&lt;th&gt;Chrome DevTools MCP&lt;/th&gt;
&lt;th&gt;Playwright MCP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tool definition overhead&lt;/td&gt;
&lt;td&gt;0 tokens&lt;/td&gt;
&lt;td&gt;~17,000 tokens&lt;/td&gt;
&lt;td&gt;~13,700 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single page snapshot&lt;/td&gt;
&lt;td&gt;~1,000 tokens&lt;/td&gt;
&lt;td&gt;Varies (larger)&lt;/td&gt;
&lt;td&gt;~15,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Button click response&lt;/td&gt;
&lt;td&gt;6 characters&lt;/td&gt;
&lt;td&gt;Full state update&lt;/td&gt;
&lt;td&gt;12,891 characters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-step automation flow&lt;/td&gt;
&lt;td&gt;~7,000 tokens&lt;/td&gt;
&lt;td&gt;~50,000 tokens&lt;/td&gt;
&lt;td&gt;~114,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Vercel's internal testing showed that simplifying from 17 tools down to 2 produced dramatic improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3.5x&lt;/strong&gt; faster execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;37%&lt;/strong&gt; fewer tokens consumed&lt;/li&gt;
&lt;li&gt;Success rate from 80% to &lt;strong&gt;100%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;42%&lt;/strong&gt; fewer steps needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Under the same context budget, agent-browser can run approximately &lt;strong&gt;5.7x&lt;/strong&gt; more test cycles than Playwright MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Falls Short
&lt;/h2&gt;

&lt;p&gt;It's not all upside. agent-browser is two months old and the rough edges show:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No deep debugging.&lt;/strong&gt; There's no equivalent to Chrome DevTools MCP's heap snapshots, Lighthouse audits, or detailed performance profiling. If your use case is front-end debugging or performance analysis, this isn't the tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Windows is broken.&lt;/strong&gt; Multiple open issues around socket files, daemon startup, Git Bash compatibility, and path handling. If your agents run on Windows, wait for these to be fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limited ecosystem compatibility.&lt;/strong&gt; Because it's a CLI, it only works with tools that can execute shell commands. MCP-only clients like Cursor or GitHub Copilot can't use it directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Documentation is thin.&lt;/strong&gt; Multiple GitHub issues mention incomplete or missing docs. The project moves fast, but the docs haven't kept up.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;After spending a week with both tools, my recommendation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Best Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Long-running AI automation workflows&lt;/td&gt;
&lt;td&gt;agent-browser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token budget is tight&lt;/td&gt;
&lt;td&gt;agent-browser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Front-end debugging and performance analysis&lt;/td&gt;
&lt;td&gt;Chrome DevTools MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need MCP compatibility (Cursor, Copilot, etc.)&lt;/td&gt;
&lt;td&gt;Chrome DevTools MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows environment&lt;/td&gt;
&lt;td&gt;Chrome DevTools MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network interception and mocking&lt;/td&gt;
&lt;td&gt;agent-browser&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two tools aren't really competing — they solve different problems. agent-browser is optimized for AI agents that need to &lt;em&gt;use&lt;/em&gt; a browser efficiently. Chrome DevTools MCP is optimized for AI agents that need to &lt;em&gt;debug&lt;/em&gt; a browser deeply.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Signal Worth Watching
&lt;/h2&gt;

&lt;p&gt;Perhaps the most telling development: Google's Chrome DevTools team is now &lt;a href="https://github.com/nicholasgasior/chrome-devtools-mcp/issues/1079" rel="noopener noreferrer"&gt;building their own CLI tool&lt;/a&gt;. When the team behind the leading MCP server starts shipping a CLI interface, it validates the core thesis that CLI is a better interface than MCP for AI-driven browser automation.&lt;/p&gt;

&lt;p&gt;17,000+ stars in two months. This one's worth paying attention to.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Building Code Retrieval for Claude Code from Scratch</title>
      <dc:creator>Chen Zhang</dc:creator>
      <pubDate>Tue, 19 Aug 2025 02:05:33 +0000</pubDate>
      <link>https://dev.to/chen_zhang_bac430bc7f6b95/building-code-retrieval-for-claude-code-from-scratch-3n8c</link>
      <guid>https://dev.to/chen_zhang_bac430bc7f6b95/building-code-retrieval-for-claude-code-from-scratch-3n8c</guid>
      <description>&lt;p&gt;The story begins with a bug hunt...&lt;/p&gt;

&lt;p&gt;When I opened Claude Code and asked it to help me locate a bug, what happened? It repeatedly used grep + read file tools, guessing possible keywords and constantly searching through massive amounts of files. After a minute, it still hadn't found anything.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frxgecv35t6fq1jrjj4xk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frxgecv35t6fq1jrjj4xk.png" alt="Claude Code struggling to locate a bug" width="800" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, through hints and guidance from me, after going back and forth for 5 minutes, it finally located the problem file. But I discovered that among all the files it read, only 10 lines of code were actually related to this issue - 99% of the code was irrelevant. Throughout this repetitive dialogue and reading of massive amounts of unrelated code, not only were tokens wasted, but precious time was also squandered.&lt;/p&gt;

&lt;p&gt;This experience was clearly problematic. So what's the root cause of this inefficiency?&lt;/p&gt;

&lt;p&gt;After some reflection and research, I identified the following key pain points from this scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First Pain Point: Expensive&lt;/strong&gt; 💸&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each query requires transmitting massive amounts of irrelevant code to the LLM. Token consumption is enormous, causing costs to skyrocket.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Second Pain Point: Slow&lt;/strong&gt; ⏰&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI needs to repeatedly probe and search, causing excessive waiting time for users. Development efficiency is greatly reduced.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Third Pain Point: Limitations of Keyword Search&lt;/strong&gt; 🔍&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditional grep can only match literal meanings and cannot understand semantic relationships and contextual meanings of code. It's like finding a needle in a haystack, relying purely on luck.&lt;/p&gt;

&lt;p&gt;Others have raised similar issues with Claude Code, such as &lt;a href="https://github.com/anthropics/claude-code/issues/1315" rel="noopener noreferrer"&gt;issue1&lt;/a&gt; and &lt;a href="https://github.com/anthropics/claude-code/issues/4556" rel="noopener noreferrer"&gt;issue2&lt;/a&gt;. You can see that even Claude Code, as powerful as it is, cannot escape these pain points and problems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws0thsbqymvg22xhssxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws0thsbqymvg22xhssxj.png" alt="Issue in Claude Code repo" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does Cursor Handle This?
&lt;/h2&gt;

&lt;p&gt;To address these pain points, Cursor's founders actually revealed their solution early on in a &lt;a href="https://forum.cursor.com/t/codebase-indexing/36" rel="noopener noreferrer"&gt;forum post&lt;/a&gt; - "Codebase Indexing".&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cwmhuk15s151xw97245.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cwmhuk15s151xw97245.png" alt="Cursor's Codebase Indexing introduction" width="800" height="181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The approach is straightforward: split the codebase into small chunks, send them to the server, and use embedding models to embed the code. This is the standard code RAG solution.&lt;/p&gt;

&lt;p&gt;But this raises the question: why hasn't Claude Code adopted this approach? After analyzing this deeply, I discovered there are numerous engineering challenges to tackle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to split the code?&lt;/li&gt;
&lt;li&gt;What to do when code changes?&lt;/li&gt;
&lt;li&gt;How to ensure indexing and retrieval speed?&lt;/li&gt;
&lt;li&gt;How to build indexes for massive amounts of code embeddings?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Obviously, all of these require substantial engineering support. Claude Code's design philosophy is to be &lt;strong&gt;simple&lt;/strong&gt; and &lt;strong&gt;CLI-based without interfaces&lt;/strong&gt;, so such a large engineering effort clearly doesn't align with its positioning.&lt;/p&gt;

&lt;p&gt;More importantly, embedding models and vector databases aren't Anthropic's core strengths. As a result, Claude Code hasn't adopted this approach, leaving these frustrating pain points unresolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Implement It Myself?
&lt;/h2&gt;

&lt;p&gt;Since Claude Code hasn't implemented it and Cursor is a closed-source paid product, why don't I build one myself?&lt;/p&gt;

&lt;p&gt;I can integrate vector databases and embedding models to implement an open-source code retrieval MCP tool similar to Cursor. This would not only meet my own needs but also help other developers - wouldn't that be wonderful?&lt;/p&gt;

&lt;p&gt;So I named this product: &lt;strong&gt;Claude Context&lt;/strong&gt;. It's an open-source code retrieval MCP tool that can seamlessly integrate with Claude Code while also being compatible with other AI Coding IDEs. It enables LLMs to obtain higher quality and more accurate contextual information.&lt;/p&gt;

&lt;p&gt;Next, it was time to design the solution and roll up my sleeves to get to work!&lt;/p&gt;

&lt;h3&gt;
  
  
  Technology Stack
&lt;/h3&gt;

&lt;p&gt;Given that I was going to build this, choosing the right technology stack was crucial:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔌 Interface Level: MCP is the First Choice&lt;/strong&gt;&lt;br&gt;
MCP is like the &lt;strong&gt;USB&lt;/strong&gt; for LLM interactions with the outside world. I needed to expose the product's capabilities through an MCP server, enabling not just Claude Code but also other AI IDEs like Gemini CLI and Qwen Code to utilize it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💾 Vector Database: Choosing Zilliz Cloud&lt;/strong&gt;&lt;br&gt;
Zilliz Cloud is a fully managed Milvus vector database service. With high-performance vector search, high QPS and low latency, cloud-native architecture bringing elastic scaling and unlimited storage, plus multi-replica enhanced availability - it's practically tailor-made for Codebase Indexing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 Embedding Models: Multiple Options&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI embedding models: Widely used and validated, stable and reliable&lt;/li&gt;
&lt;li&gt;Voyage embedding: Has specialized models for the Code domain with better performance&lt;/li&gt;
&lt;li&gt;Ollama: Suitable for local deployment with stronger privacy&lt;/li&gt;
&lt;li&gt;More embedding models to be supported later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;⌨️ Programming Language: TypeScript Wins&lt;/strong&gt;&lt;br&gt;
I deliberated between Python and TypeScript, ultimately settling on TypeScript. The reasoning is straightforward: it offers better compatibility at the application layer, allowing developed modules to seamlessly integrate with higher-level TypeScript applications like VSCode plugins. Additionally, since Claude Code, Gemini CLI, and similar tools are all built with TypeScript, this choice provides better ecosystem alignment.&lt;/p&gt;
&lt;h3&gt;
  
  
  Architecture Design
&lt;/h3&gt;

&lt;p&gt;Following decoupling and layered design principles, I structured the architecture into two distinct layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core modules: Contains all core logic, with each block's logic also designed separately, such as code parsing, vector indexing, semantic retrieval, synchronization updates, etc.&lt;/li&gt;
&lt;li&gt;Front-end modules: Contains MCP, VSCode plugins, and other integrations. Based on core modules, they contain more application-layer logic, especially MCP, which is the best way to interact with AI IDEs like Claude Code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design allows core modules to be reused by upper-layer modules, providing flexibility for both horizontal and vertical scaling in the future.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghb2sdvb22yxdcfawzh2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fghb2sdvb22yxdcfawzh2.png" alt="Architecture" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Core Modules
&lt;/h3&gt;

&lt;p&gt;Core modules serve as the foundation - after all, you can only build a solid house on a strong foundation. These modules abstract vector databases, embedding models, and other components into composable modules that form a Context object, enabling the use of different vector databases and embedding models across various scenarios.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;MilvusVectorDatabase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;OpenAIEmbedding&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@zilliz/claude-context-core&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize embedding provider&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbedding&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize vector database&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;vectorDatabase&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MilvusVectorDatabase&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;

&lt;span class="c1"&gt;// Create context instance&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;vectorDatabase&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Index your codebase with progress tracking&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexCodebase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./your-project&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Perform semantic search&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;semanticSearch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./your-project&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;vector database operations&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to Split Code?
&lt;/h2&gt;

&lt;p&gt;Code splitting can't be handled with a simple, brute-force approach of splitting by lines or characters. Such an approach would result in code blocks with either incomplete logic or lost context.&lt;/p&gt;

&lt;p&gt;I designed two complementary splitting strategies:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. AST Abstract Syntax Tree Splitting (Main Strategy) 🌳
&lt;/h3&gt;

&lt;p&gt;This is the default and recommended strategy. Through the &lt;a href="https://github.com/tree-sitter/tree-sitter" rel="noopener noreferrer"&gt;tree-sitter&lt;/a&gt; parser, it understands the syntactic structure of code and splits according to semantic units.&lt;/p&gt;

&lt;p&gt;The advantages of AST splitting are obvious:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Syntactic Completeness&lt;/strong&gt;: Each chunk is a complete syntactic unit, avoiding awkward situations where functions are split in half&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical Coherence&lt;/strong&gt;: Related code logic remains in the same chunk, allowing AI search to find more accurate context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language Support&lt;/strong&gt;: Uses different tree-sitter parsers for different programming languages, accurately identifying and splitting JavaScript function declarations, Python class definitions, Java methods, Go function definitions, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjr27uv0cvaw7fifu58wa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjr27uv0cvaw7fifu58wa.png" alt="AST example" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. LangChain Text Splitting (Fallback Strategy) 🛡️
&lt;/h3&gt;

&lt;p&gt;For languages that AST cannot handle or parsing failures, I use LangChain's RecursiveCharacterTextSplitter as a backup solution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Use recursive character splitting to maintain code structure&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromLanguage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;chunkSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;chunkOverlap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this strategy isn't as sophisticated as AST (essentially splitting by character count), it's stable and reliable. It ensures that any code can be properly split, providing a dependable fallback option.&lt;/p&gt;

&lt;p&gt;Implementing the fallback ensures both semantic completeness and accommodates various use case requirements. Rock solid!&lt;/p&gt;

&lt;h2&gt;
  
  
  What About Code Changes?
&lt;/h2&gt;

&lt;p&gt;Handling code changes has always been a core challenge for code indexing systems. Imagine if you had to re-index the entire project every time a file had a minor change - that would be a disaster.&lt;/p&gt;

&lt;p&gt;I designed a Merkle Tree-based synchronization mechanism to solve this problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Merkle Tree: The Core of Change Detection
&lt;/h3&gt;

&lt;p&gt;Merkle Tree is like a layered "fingerprint" system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each file has its own hash fingerprint&lt;/li&gt;
&lt;li&gt;Folders have fingerprints based on their content files&lt;/li&gt;
&lt;li&gt;Finally converging into a unique root node fingerprint for the entire codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm78e99r3aaar36z34ww.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm78e99r3aaar36z34ww.jpeg" alt="Merkle tree" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As long as file content changes, its upper-layer hash fingerprints will change layer by layer up to the root node.&lt;/p&gt;

&lt;p&gt;This approach allows me to traverse from the root node downward, comparing hash fingerprint changes layer by layer to rapidly detect and pinpoint file modifications. There's no need to re-index the entire project - the efficiency gains are remarkable!&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Change Detection and Synchronization ⚡
&lt;/h3&gt;

&lt;p&gt;By default, synchronization checks occur every 5 minutes. The synchronization mechanism is clean and efficient, operating in three phases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🏃‍♂️ Phase One: Rapid Detection&lt;/strong&gt;&lt;br&gt;
Calculate the Merkle root hash of the entire codebase and compare it with the last saved snapshot. If the root hash matches? Great news - nothing has changed, so we can skip the update entirely! This check completes in milliseconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔍 Phase Two: Precise Comparison&lt;/strong&gt;&lt;br&gt;
If the root hash differs, we enter precise comparison mode, performing detailed file-level analysis to identify exactly which files have changed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Newly added files&lt;/li&gt;
&lt;li&gt;Deleted files&lt;/li&gt;
&lt;li&gt;Modified files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🔄 Phase Three: Incremental Update&lt;/strong&gt;&lt;br&gt;
Only recalculate vectors for changed files, then update them in the vector database. Time and effort saved!&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Local Snapshot Management
&lt;/h3&gt;

&lt;p&gt;All synchronization states are saved in the user's local &lt;code&gt;~/.context/merkle/&lt;/code&gt; directory. Each codebase has its own independent snapshot file containing file hash tables and serialized data of the Merkle tree. This way, even if the program restarts, it can accurately restore the previous synchronization state.&lt;/p&gt;

&lt;p&gt;The benefits of this design are substantial. Most of the time, it can detect no changes within milliseconds, and only genuinely modified files get reprocessed, eliminating unnecessary computations. Furthermore, the state persists perfectly even when the program is closed and reopened.&lt;/p&gt;

&lt;p&gt;From a user experience perspective, this means when you modify a function, the system will only re-index that file, not the entire project, greatly improving development efficiency.&lt;/p&gt;
&lt;h2&gt;
  
  
  MCP Module
&lt;/h2&gt;
&lt;h3&gt;
  
  
  How to Design Tools? 🛠️
&lt;/h3&gt;

&lt;p&gt;The MCP module is the facade, directly facing users. In this module, &lt;strong&gt;user experience is paramount&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;First, consider tool design. Abstracting common codebase indexing/search behaviors, it's easy to think of two core tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;index_codebase&lt;/code&gt; - Index codebase&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;search_code&lt;/code&gt; - Search code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But wait, do we need other tools?&lt;/p&gt;

&lt;p&gt;There's a delicate balance to strike: too many tools would include numerous edge-case functionalities that burden MCP clients and complicate LLM decision-making, while too few tools might omit essential functionality.&lt;/p&gt;

&lt;p&gt;Let me approach this by working backward from actual usage scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🤔 The Challenge of a Blocking Step&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some large codebases take a long time to index. It isn't the best user experience to block on this slow step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fui8m2fjfsz430ribpemh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fui8m2fjfsz430ribpemh.png" alt="Synchronous indexing workflow" width="800" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So we need asynchronous background processing. But MCP doesn't natively support background operations - how do we solve this?&lt;/p&gt;

&lt;p&gt;The solution: implement a background process within the MCP server to handle indexing, allowing the server to immediately return a "indexing started" message while users continue with other tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskqr7crqpbzhfmdgh6ev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskqr7crqpbzhfmdgh6ev.png" alt="Asynchronous background indexing workflow" width="800" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Much better!&lt;/p&gt;

&lt;p&gt;But this design brings new problems: how does the user know the indexing progress?&lt;/p&gt;

&lt;p&gt;This naturally leads to needing a tool for querying indexing progress or status. The background indexing process asynchronously caches its progress, allowing users to check at any time: Is it 50% complete? Has it finished successfully? Did it fail?&lt;/p&gt;

&lt;p&gt;We also need a tool for manually clearing the index. When users suspect the indexed codebase is inaccurate or want to re-index from scratch, they can manually clear everything and start fresh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final tool design:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;index_codebase&lt;/code&gt; - Index codebase&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;search_code&lt;/code&gt; - Search code&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_indexing_status&lt;/code&gt; - Query indexing status&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clear_index&lt;/code&gt; - Clear index&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Four tools - clean and sufficient!&lt;/p&gt;
&lt;h3&gt;
  
  
  How to Manage Environment Variables? ⚙️
&lt;/h3&gt;

&lt;p&gt;Environment variable management is often overlooked but represents a critical user experience consideration.&lt;/p&gt;

&lt;p&gt;If every MCP client requires separate API key configuration, users switching between Claude Code and Gemini CLI would need to configure everything twice. That level of redundancy is simply unacceptable.&lt;/p&gt;

&lt;p&gt;I designed a global configuration solution to completely solve this pain point.&lt;/p&gt;

&lt;p&gt;The solution is simple: create a &lt;code&gt;~/.context/.env&lt;/code&gt; file in the user's home directory as global configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ~/.context/.env&lt;/span&gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-api-key-here
&lt;span class="nv"&gt;MILVUS_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-milvus-token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The benefits of this approach are obvious:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only need to configure once to use across all MCP clients&lt;/li&gt;
&lt;li&gt;All configurations are centralized in one place, easy to maintain&lt;/li&gt;
&lt;li&gt;Sensitive API keys won't be scattered across various configuration files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I then implemented a three-tier priority system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Highest Priority&lt;/strong&gt;: Process environment variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium Priority&lt;/strong&gt;: Global configuration file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lowest Priority&lt;/strong&gt;: Default values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This design is very flexible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers can use environment variables to override configurations during temporary testing&lt;/li&gt;
&lt;li&gt;In production environments, sensitive configurations can be injected through system environment variables to ensure security&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Users configure once and can seamlessly use it across multiple tools like Claude Code and Gemini CLI, significantly lowering the barrier to entry!&lt;/p&gt;

&lt;p&gt;With this, we've completed the core architecture design of the MCP server. From code parsing and vector storage to intelligent retrieval and configuration management, every component has been carefully designed and optimized. The resulting system is both powerful and user-friendly.&lt;/p&gt;

&lt;p&gt;So how does this solution perform in actual use? Let's look at the final results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results Showcase 🎉
&lt;/h2&gt;

&lt;p&gt;After building the architecture code, I &lt;a href="https://github.com/zilliztech/claude-context" rel="noopener noreferrer"&gt;open-sourced&lt;/a&gt; the project and published it to the &lt;a href="https://www.npmjs.com/package/@zilliz/claude-context-mcp" rel="noopener noreferrer"&gt;npm registry&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Installation and usage couldn't be simpler - just run a single command before starting Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add claude-context &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-openai-api-key &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;MILVUS_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-zilliz-cloud-api-key &lt;span class="nt"&gt;--&lt;/span&gt; npx @zilliz/claude-context-mcp@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's see it in action! I asked it to help me find that bug from earlier. First, I indexed the current codebase, then asked it to locate this bug based on a specific description.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfauhgowqrz5ininyxim.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfauhgowqrz5ininyxim.gif" alt="Powered by claude-context MCP Claude Code found the bug" width="720" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The results were impressive! Using claude-context MCP tool calls, it successfully identified the exact file and line number of the bug and provided detailed explanations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perfect!&lt;/strong&gt; 🎯&lt;/p&gt;

&lt;p&gt;With claude-context MCP integrated, Claude Code can leverage it across numerous scenarios, gaining access to higher quality and more precise contextual information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐛 Issue fixing&lt;/li&gt;
&lt;li&gt;🔧 Code refactoring&lt;/li&gt;
&lt;li&gt;🔍 Duplicate code detection&lt;/li&gt;
&lt;li&gt;🧪 Code testing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Open Source Community Feedback
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxflij7ceikq20fpwjqpq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxflij7ceikq20fpwjqpq.png" alt="GitHub repo star history" width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/zilliztech/claude-context" rel="noopener noreferrer"&gt;https://github.com/zilliztech/claude-context&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since open-sourcing the code, I've received tremendous feedback and suggestions from the community. I'm incredibly grateful for everyone's support and valuable input. The project has now earned &lt;strong&gt;2.6k+&lt;/strong&gt; stars.&lt;/p&gt;

&lt;p&gt;Many users have requested benchmark evaluations to quantify claude-context's improvements. This area is currently under active testing and experimentation. We already have a preliminary finding: under equivalent recall rate conditions, claude-context can reduce token consumption by 40% compared to baseline approaches.&lt;/p&gt;

&lt;p&gt;This translates to 40% savings in both time and cost.&lt;/p&gt;

&lt;p&gt;Conversely, with equivalent limited token budgets, claude-context MCP delivers superior retrieval performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzm1ano9fha73m20sp26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzm1ano9fha73m20sp26.png" alt="Performance benchmark results" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additional testing details will be published in the GitHub repository as they become available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Epilogue
&lt;/h2&gt;

&lt;p&gt;This project evolved from a simple idea into a comprehensive code retrieval solution. The journey was challenging yet incredibly rewarding. By leveraging MCP architecture and Zilliz Cloud's vector database, we successfully addressed the core pain points of code retrieval in Claude Code.&lt;/p&gt;

&lt;p&gt;Looking ahead, we plan to continue optimizing retrieval algorithms, expand support for additional programming languages, and continuously enhance the user experience. We welcome everyone to try it out, test it thoroughly, and share feedback. We're also excited to collaborate with more developers who want to contribute code and help make claude-context even more robust and powerful!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
