<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: WonderLab</title>
    <description>The latest articles on DEV Community by WonderLab (@wonderlab).</description>
    <link>https://dev.to/wonderlab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3797373%2F25beba30-d8d4-4d2e-9ec6-170356089350.jpg</url>
      <title>DEV Community: WonderLab</title>
      <link>https://dev.to/wonderlab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wonderlab"/>
    <language>en</language>
    <item>
      <title>RAG Series (4): Document Processing — From Raw Files to High-Quality Chunks</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Sat, 02 May 2026 03:43:58 +0000</pubDate>
      <link>https://dev.to/wonderlab/rag-series-4-document-processing-from-raw-files-to-high-quality-chunks-4ec0</link>
      <guid>https://dev.to/wonderlab/rag-series-4-document-processing-from-raw-files-to-high-quality-chunks-4ec0</guid>
      <description>&lt;h2&gt;
  
  
  Why "How You Cut" Matters as Much as "What You Cut"
&lt;/h2&gt;

&lt;p&gt;In the first three articles, we built a working RAG pipeline and tuned the core parameters. But if you look closely at the retrieval results, you may notice a strange phenomenon:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The answer is clearly in the document, yet the Retriever can't find it. Or it finds it, but the answer is cut in half — the LLM only sees the first half of the sentence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem usually lies in the &lt;strong&gt;chunking&lt;/strong&gt; step.&lt;/p&gt;

&lt;p&gt;Chunking is essentially an &lt;strong&gt;information splitting strategy&lt;/strong&gt; — how you divide a 500-page book, how large each piece is, and where you make the cuts directly determines whether the reader (here, the Retriever) can quickly find what they need.&lt;/p&gt;

&lt;p&gt;In this article, we'll process the &lt;strong&gt;same technical document&lt;/strong&gt; with four different strategies so you can see the dramatic differences that "how you cut" makes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📎 &lt;strong&gt;Source Code&lt;/strong&gt;: All experiment code is open-sourced at &lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/04-chunking-strategies" rel="noopener noreferrer"&gt;&lt;code&gt;llm-in-action/04-chunking-strategies&lt;/code&gt;&lt;/a&gt;. Clone it to reproduce the results.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Four Chunking Strategies at a Glance
&lt;/h2&gt;

&lt;p&gt;Before diving in, here's a quick reference table to build intuition:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Core Idea&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fixed Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cut at fixed character intervals, like scissors cutting paper&lt;/td&gt;
&lt;td&gt;Simple, uniform chunk sizes&lt;/td&gt;
&lt;td&gt;May cut through sentences, poor semantic integrity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recursive Character&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Try separators in priority order: paragraph → line → sentence → word&lt;/td&gt;
&lt;td&gt;Balances semantics and uniformity&lt;/td&gt;
&lt;td&gt;Limited Chinese support (uses English punctuation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic Chunking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compute semantic similarity between adjacent sentences, cut where similarity drops&lt;/td&gt;
&lt;td&gt;Highly semantically coherent chunks&lt;/td&gt;
&lt;td&gt;Requires Embedding API, higher cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Document Structure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Split by Markdown/HTML heading hierarchy&lt;/td&gt;
&lt;td&gt;Preserves document structure, retrieved chunks carry chapter context&lt;/td&gt;
&lt;td&gt;Only works for structured documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Experimental Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test Document and Source Code
&lt;/h3&gt;

&lt;p&gt;The full runnable code is available at &lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/04-chunking-strategies" rel="noopener noreferrer"&gt;&lt;code&gt;llm-in-action/04-chunking-strategies&lt;/code&gt;&lt;/a&gt;, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;chunking_compare.py&lt;/code&gt; — The 4-strategy comparison script&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;data/sample-tech-doc.md&lt;/code&gt; — Sample Markdown technical document&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.env.example&lt;/code&gt; — Environment variable template (SemanticChunker requires an Embedding API)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Test Document
&lt;/h4&gt;

&lt;p&gt;We'll use a ~5,400-character Markdown technical document titled "Microservices Architecture Design Guide," containing 7 top-level chapters with multiple level-2 and level-3 headings, covering service decomposition, communication protocols, data consistency, observability, security, and deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy Configurations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Key Configuration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed Size&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CharacterTextSplitter(chunk_size=512, chunk_overlap=50)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recursive Character&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""])&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SemanticChunker(embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=85, sentence_split_regex=r"(?&amp;lt;=[。！？.?!])\s+", buffer_size=0)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Structure&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MarkdownHeaderTextSplitter(headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")])&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;About &lt;code&gt;buffer_size=0&lt;/code&gt;: SemanticChunker defaults to concatenating neighboring sentences before computing embeddings (&lt;code&gt;buffer_size=1&lt;/code&gt; means 1 sentence on each side). But SiliconFlow's BGE model limits single inputs to &amp;lt; 512 tokens, so concatenation often exceeds this. Setting it to 0 makes each sentence independent — we lose some context, but it runs stably.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Strategy 1: Fixed-Size Chunking
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Principle
&lt;/h3&gt;

&lt;p&gt;The most brute-force approach: cut at fixed-length intervals regardless of content.&lt;/p&gt;

&lt;p&gt;Imagine using scissors to snip every 512 characters. Simple and efficient, but you might cut right through the middle of a sentence.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Prefer line breaks; hard-cut if none exist
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk count&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average length&lt;/td&gt;
&lt;td&gt;453.5 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max length&lt;/td&gt;
&lt;td&gt;506 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min length&lt;/td&gt;
&lt;td&gt;128 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;First 3 chunks:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Chunk 1 (489 chars):
&lt;span class="gh"&gt;# Microservices Architecture Design Guide This article covers...&lt;/span&gt;

Chunk 2 (504 chars):
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Read Service vs Write Service**&lt;/span&gt;: In read-heavy scenarios...

Chunk 3 (457 chars):
&lt;span class="gs"&gt;**gRPC**&lt;/span&gt; is based on HTTP/2 and Protocol Buffers. Advantages:...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Notice how Chunk 2 starts: &lt;code&gt;- **Read Service vs Write Service**...&lt;/code&gt; — this is the middle of a list item. Fixed-size chunking brutally cut off the list at the end of the previous chunk, so Chunk 2 starts with an incomplete list item. If the user asks "What are the advantages of read-write separation?", the Retriever might return this chunk, but the LLM sees incomplete information.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strategy 2: Recursive Character Chunking
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Principle
&lt;/h3&gt;

&lt;p&gt;Slightly smarter than fixed-size: it has a priority list of separators and tries them in order — first by paragraph (&lt;code&gt;\n\n&lt;/code&gt;), then by line (&lt;code&gt;\n&lt;/code&gt;), then by sentence (&lt;code&gt;.&lt;/code&gt;), and finally by word (&lt;code&gt;&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Like an experienced editor: prefer cutting at paragraph boundaries, fall back to sentence boundaries if necessary, and never cut in the middle of a word.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk count&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average length&lt;/td&gt;
&lt;td&gt;431.5 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max length&lt;/td&gt;
&lt;td&gt;507 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min length&lt;/td&gt;
&lt;td&gt;88 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;First 3 chunks:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;Chunk 1 (441 chars):
&lt;span class="gh"&gt;# Microservices Architecture Design Guide  This article covers...&lt;/span&gt;

Chunk 2 (452 chars):
&lt;span class="gu"&gt;### 1.2 Split by Technical Characteristics  Besides business boundaries...&lt;/span&gt;

Chunk 3 (457 chars):
The most common synchronous communication methods between microservices are...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Improvement over fixed-size:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chunk 2 now starts with &lt;code&gt;### 1.2 Split by Technical Characteristics&lt;/code&gt; — a complete heading. Recursive character chunking successfully cut at a heading boundary instead of slicing through a list item.&lt;/p&gt;

&lt;p&gt;But note that the &lt;code&gt;separators&lt;/code&gt; list uses &lt;code&gt;.&lt;/code&gt; (English period + space), so for Chinese documents, it won't split on Chinese periods (。). Its behavior on Chinese text is therefore close to fixed-size, relying mainly on &lt;code&gt;\n\n&lt;/code&gt; and &lt;code&gt;\n&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strategy 3: Semantic Chunking
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Principle
&lt;/h3&gt;

&lt;p&gt;The previous two strategies cut by length. Semantic chunking cuts by meaning.&lt;/p&gt;

&lt;p&gt;Here's how it works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Split the document into sentences&lt;/li&gt;
&lt;li&gt;Compute each sentence's embedding (semantic vector)&lt;/li&gt;
&lt;li&gt;Compare semantic similarity between adjacent sentences&lt;/li&gt;
&lt;li&gt;If similarity suddenly drops (below the threshold), cut there&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Imagine watching a movie where the scene suddenly shifts from an office to a beach — that's a semantic boundary. Semantic chunking recognizes these "scene changes" and ensures each chunk discusses one coherent topic.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_experimental.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SemanticChunker&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;

&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-large-zh-v1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMBEDDING_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMBEDDING_API_BASE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.siliconflow.cn/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# SiliconFlow limits batch_size to 32
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Key: Custom Chinese sentence-splitting regex, or SemanticChunker defaults to English punctuation only
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticChunker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;breakpoint_threshold_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;breakpoint_threshold_amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sentence_split_regex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?&amp;lt;=[。！？.?!])\s+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;buffer_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Avoid exceeding 512-token limit when concatenating sentences
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pitfalls We Hit
&lt;/h3&gt;

&lt;p&gt;Implementing semantic chunking, we ran into three issues:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 1: Batch size exceeded&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ValueError: input batch size 1000 &amp;gt; maximum allowed batch size 32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Fix: &lt;code&gt;OpenAIEmbeddings(chunk_size=32)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 2: Single-input token limit exceeded&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error code: 413 - input must have less than 512 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Fix: Set &lt;code&gt;buffer_size=0&lt;/code&gt; to prevent SemanticChunker from concatenating neighboring sentences&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitfall 3: Empty strings cause 400 errors&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error code: 400 - The parameter is invalid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;→ Fix: Subclass &lt;code&gt;SemanticChunker&lt;/code&gt; and override &lt;code&gt;_get_single_sentences_list&lt;/code&gt; to filter empty strings&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FilteredSemanticChunker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SemanticChunker&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_single_sentences_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sentence_split_regex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk count&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;9&lt;/strong&gt; (fewest)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average length&lt;/td&gt;
&lt;td&gt;590.9 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max length&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2047 chars&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min length&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17 chars&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key Finding:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Semantic chunking produces the fewest chunks (9), but with extreme size variation — smallest 17 chars, largest 2047 chars. This confirms it's truly grouping by semantic boundaries: semantically similar sentences are merged into large chunks, while topic transitions become tiny chunks.&lt;/p&gt;

&lt;p&gt;For example, the entire "Service Communication" chapter (REST vs gRPC vs message queues) was aggregated into one 1,189-character chunk — because it all discusses the same topic. Transition sentences between chapters became tiny fragments (like a 28-character decision tree snippet).&lt;/p&gt;




&lt;h2&gt;
  
  
  Strategy 4: Document Structure Chunking (Markdown Header)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Principle
&lt;/h3&gt;

&lt;p&gt;The first three strategies are like "blind folding" — they don't know the document structure and purely use text features. Document structure chunking, in contrast, "keeps its eyes open": it recognizes Markdown &lt;code&gt;#&lt;/code&gt;, &lt;code&gt;##&lt;/code&gt;, &lt;code&gt;###&lt;/code&gt; headings and splits strictly by heading hierarchy.&lt;/p&gt;

&lt;p&gt;Each chunk's boundary is a heading boundary: starts at one heading, ends before the next heading at the same or higher level.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;headers_to_split_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;##&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;###&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;strip_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Keep headings inside chunk content
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk count&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;20&lt;/strong&gt; (most)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average length&lt;/td&gt;
&lt;td&gt;266.5 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max length&lt;/td&gt;
&lt;td&gt;402 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min length&lt;/td&gt;
&lt;td&gt;71 chars&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key Finding:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Document structure chunking produces the most chunks (20), but each one carries an "ID card" — metadata records which heading level it belongs to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Microservices Architecture Design Guide&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1. Service Decomposition Strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header 3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.1 Split by Business Boundary (DDD)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means during retrieval, you get not just the content but also its chapter origin. This is extremely valuable for &lt;strong&gt;citation tracing&lt;/strong&gt; ("The answer comes from Chapter X of the document").&lt;/p&gt;




&lt;h2&gt;
  
  
  Side-by-Side Comparison
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Statistics Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Chunks&lt;/th&gt;
&lt;th&gt;Avg Length&lt;/th&gt;
&lt;th&gt;Median&lt;/th&gt;
&lt;th&gt;Max&lt;/th&gt;
&lt;th&gt;Min&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed Size&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;453.5&lt;/td&gt;
&lt;td&gt;476.5&lt;/td&gt;
&lt;td&gt;506&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recursive Character&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;431.5&lt;/td&gt;
&lt;td&gt;457.0&lt;/td&gt;
&lt;td&gt;507&lt;/td&gt;
&lt;td&gt;88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;590.9&lt;/td&gt;
&lt;td&gt;422.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2047&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Structure&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;266.5&lt;/td&gt;
&lt;td&gt;259.0&lt;/td&gt;
&lt;td&gt;402&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Retrieval Difference for the Same Query
&lt;/h3&gt;

&lt;p&gt;Suppose the user asks: &lt;strong&gt;"What are the anti-patterns of microservice decomposition?"&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Retrieved Chunk&lt;/th&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed Size&lt;/td&gt;
&lt;td&gt;Chunk 4 (contains partial anti-pattern content, but starts mid-sentence)&lt;/td&gt;
&lt;td&gt;List item starts in the middle; LLM lacks full context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recursive Character&lt;/td&gt;
&lt;td&gt;Chunk 5 (fully contains "1.3 Common Anti-patterns" section)&lt;/td&gt;
&lt;td&gt;Good, but may truncate if the section is long&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic&lt;/td&gt;
&lt;td&gt;Chunk 3 (aggregates anti-patterns + some following content)&lt;/td&gt;
&lt;td&gt;May include irrelevant content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Structure&lt;/td&gt;
&lt;td&gt;Chunk 6 (exactly matches "### 1.3 Common Anti-patterns")&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Best&lt;/strong&gt; — precise structural match&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Strategy Selection Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended Strategy&lt;/th&gt;
&lt;th&gt;Reasoning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;General technical docs (PDF/Word)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Recursive Character&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Most reliable baseline, no special formatting required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown / Papers / Books&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Document Structure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Preserves chapter structure, retrievable with provenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminology-dense docs (legal/medical)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Semantic Chunking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantically coherent chunks, reduces cross-topic noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ultra-high-speed chunking (real-time)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fixed Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero computation overhead, pure string operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code documentation&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Recursive Character&lt;/strong&gt; + custom separators&lt;/td&gt;
&lt;td&gt;Split by function/class boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Selection Advice
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: Start with recursive character chunking as your baseline
    ↓
Step 2: If documents are Markdown/HTML, try document structure chunking
    ↓
Step 3: If retrieval quality is unsatisfactory, upgrade to semantic chunking
         (highest cost but best quality)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This article used the same document and four strategies to show you how "how you cut" affects RAG quality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fixed Size&lt;/strong&gt;: Simple but brutal. Good for rapid prototyping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive Character&lt;/strong&gt;: The most universal baseline. Sufficient for 80% of scenarios.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Chunking&lt;/strong&gt;: Best quality but highest cost. Use when precision is critical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Structure&lt;/strong&gt;: Best choice for structured documents. Retrieved chunks carry built-in context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaway:&lt;/strong&gt; There is no perfect chunking strategy — only the strategy that fits your document type and business scenario. In real projects, use the comparison script from this article, run it on your own documents, and let the data guide your decision.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>llm</category>
      <category>chunk</category>
    </item>
    <item>
      <title>RAG Series (3): Tuning These 4 Parameters to Go From 'It Works' to 'It Works Well'</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Sat, 02 May 2026 02:49:49 +0000</pubDate>
      <link>https://dev.to/wonderlab/rag-series-3-tuning-these-4-parameters-to-go-from-it-works-to-it-works-well-2kp6</link>
      <guid>https://dev.to/wonderlab/rag-series-3-tuning-these-4-parameters-to-go-from-it-works-to-it-works-well-2kp6</guid>
      <description>&lt;h2&gt;
  
  
  Why Does Your RAG Give Wrong Answers When Someone Else's Doesn't?
&lt;/h2&gt;

&lt;p&gt;In the first two articles, we built a RAG pipeline that runs. But many people find that while the code works, &lt;strong&gt;answer quality is inconsistent&lt;/strong&gt; — sometimes spot-on, sometimes missing information that's clearly in the document, sometimes drifting off-topic even when the right chunks were retrieved.&lt;/p&gt;

&lt;p&gt;The problem is usually not the code. It's the &lt;strong&gt;parameters&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;RAG has four core parameters, like four knobs on a radio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunk Size&lt;/strong&gt;: How long is each text chunk?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk Overlap&lt;/strong&gt;: How much do adjacent chunks overlap?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-K&lt;/strong&gt;: How many chunks does the retriever return?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding Model&lt;/strong&gt;: How is text converted into vectors?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The combination of these four parameters directly determines whether the system can find relevant information and whether that information is enough to answer the question. In this article, we'll use a &lt;strong&gt;controlled-variable experiment&lt;/strong&gt; so you can see the effect of different parameters with your own eyes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parameter 1: Chunk Size — How Long Is Each Chunk?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is Chunk Size?
&lt;/h3&gt;

&lt;p&gt;Imagine you're organizing a 500-page technical manual. Chunk Size is how many pages you read at a time — 1 page, 5 pages, or 50 pages?&lt;/p&gt;

&lt;p&gt;In RAG, Chunk Size is the &lt;strong&gt;maximum number of characters&lt;/strong&gt; (or tokens) in each text chunk. The document is cut into many chunks, each no longer than this limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does It Matter?
&lt;/h3&gt;

&lt;p&gt;Chunk Size directly impacts two metrics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chunk Size&lt;/th&gt;
&lt;th&gt;Retrieval Precision&lt;/th&gt;
&lt;th&gt;Context Completeness&lt;/th&gt;
&lt;th&gt;Analogy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small (128)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Like reading dictionary entries — precise but isolated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium (512)&lt;/td&gt;
&lt;td&gt;Balanced&lt;/td&gt;
&lt;td&gt;Balanced&lt;/td&gt;
&lt;td&gt;Like reading a paragraph — enough context without bloat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large (2048)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Like reading an entire chapter — complete but noisy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What's wrong with too small?&lt;/strong&gt; Suppose the document says: "The system uses Redis for caching with a default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged." If Chunk Size=128, this sentence might be split into two chunks: "The system uses Redis for caching with a default TTL of 3600 seconds." and "If this timeout is exceeded, data is automatically purged." When the user asks "What happens when Redis cache expires?", the retriever might only return the first chunk. The LLM sees "3600 seconds" but doesn't know about "automatically purged" — the answer is incomplete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's wrong with too large?&lt;/strong&gt; Suppose Chunk Size=2048, and one chunk contains five unrelated topics. When the user asks a specific question, this chunk gets retrieved, but the LLM's attention is scattered by irrelevant content — like trying to hear one person speak in a noisy marketplace.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Choose?
&lt;/h3&gt;

&lt;p&gt;There's no silver bullet, but there's a rule of thumb:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk Size ≈ 1.5 ~ 2 × the expected answer length
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document Type&lt;/th&gt;
&lt;th&gt;Recommended Chunk Size&lt;/th&gt;
&lt;th&gt;Reasoning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FAQ / Q&amp;amp;A pairs&lt;/td&gt;
&lt;td&gt;256 ~ 384&lt;/td&gt;
&lt;td&gt;Short answers, precision matters more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical docs / API manuals&lt;/td&gt;
&lt;td&gt;512 ~ 768&lt;/td&gt;
&lt;td&gt;Medium-length answers, need some context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Papers / book chapters&lt;/td&gt;
&lt;td&gt;1024 ~ 1536&lt;/td&gt;
&lt;td&gt;Argument-heavy, need large context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal contracts / medical records&lt;/td&gt;
&lt;td&gt;768 ~ 1024&lt;/td&gt;
&lt;td&gt;Dense terminology, need inference from context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heuristic&lt;/strong&gt;: Start with 512, then observe retrieval results. If you notice "answers are cut off", increase it. If you notice "retrieved chunks contain too much irrelevant content", decrease it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Parameter 2: Chunk Overlap — How Much Should Adjacent Chunks Overlap?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is Chunk Overlap?
&lt;/h3&gt;

&lt;p&gt;Back to that technical manual. If you read 5 pages at a time, Overlap is how many pages from the previous chunk you keep when starting the next one. For example, Overlap=1 means: first read pages 1-5, then read pages 5-9 (page 5 appears twice).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Is Overlap Needed?
&lt;/h3&gt;

&lt;p&gt;Without overlap, critical information can get "cut at the seam":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk A: "The system uses Redis for caching with a default TTL of 3600 seconds."
Chunk B: "If this timeout is exceeded, data is automatically purged."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the user asks "What happens when Redis cache expires?", the embedding model might think Chunk B is more relevant (both mention "expires"), and only return Chunk B. But Chunk B starts with "If this timeout is exceeded" — without Chunk A, the LLM doesn't know what "this timeout" refers to.&lt;/p&gt;

&lt;p&gt;With Overlap=50, Chunk B starts with the last 50 characters of Chunk A:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk B (with overlap): "...default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now even if only Chunk B is retrieved, the LLM can infer "this timeout = 3600 seconds".&lt;/p&gt;

&lt;h3&gt;
  
  
  How Much Overlap?
&lt;/h3&gt;

&lt;p&gt;Generally set to &lt;strong&gt;10% ~ 20%&lt;/strong&gt; of Chunk Size:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chunk Size&lt;/th&gt;
&lt;th&gt;Recommended Overlap&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;25 ~ 50&lt;/td&gt;
&lt;td&gt;Short text, small overlap preserves context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;50 ~ 100&lt;/td&gt;
&lt;td&gt;The sweet spot for general use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;100 ~ 200&lt;/td&gt;
&lt;td&gt;Long text needs more overlap to preserve continuity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: More overlap is not always better. Too much overlap leads to storing massive amounts of duplicate content in the vector database, increasing storage cost and deduplication burden during retrieval.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Parameter 3: Top-K — How Many Chunks to Retrieve?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is Top-K?
&lt;/h3&gt;

&lt;p&gt;Top-K is the number of text chunks the retriever returns each time. K=4 means "give me the 4 most relevant chunks", K=10 means "give me the 10 most relevant chunks".&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does It Matter?
&lt;/h3&gt;

&lt;p&gt;K too small = missing information. K too large = introducing noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario A: K=2, missing critical information&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;User asks: "How do I configure the database connection pool and log level?" This question involves two topics. If K=2, the retriever might only return two chunks about "database connection pool" and completely miss "log level" — the LLM can only answer half the question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario B: K=20, noise drowning out the answer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;User asks: "What's the default timeout?" The document has a clear answer. But K=20 retrieves 20 chunks, 19 of which discuss unrelated topics. The LLM's context window is filled with irrelevant content, and it can't find that simple number.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Choose?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Top-K = Number of topics the answer is expected to cover × 2 ~ 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query Type&lt;/th&gt;
&lt;th&gt;Recommended K&lt;/th&gt;
&lt;th&gt;Reasoning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single-point fact ("What's the default port?")&lt;/td&gt;
&lt;td&gt;3 ~ 5&lt;/td&gt;
&lt;td&gt;Focused answer, fewer is better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-condition ("How do I configure A and B?")&lt;/td&gt;
&lt;td&gt;5 ~ 8&lt;/td&gt;
&lt;td&gt;Might involve multiple topics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Comprehensive summary ("Summarize Chapter 3")&lt;/td&gt;
&lt;td&gt;8 ~ 12&lt;/td&gt;
&lt;td&gt;Need to cover multiple points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Heuristic&lt;/strong&gt;: Start with K=4. If you notice "the answer is missing a part", increase it. If you notice "the answer contains irrelevant content", decrease it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Parameter 4: Embedding Model — Who Does the "Semantic Translation"?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Embedding Is RAG's "Translator"
&lt;/h3&gt;

&lt;p&gt;What an embedding model does is simple: convert text into a sequence of numbers (a vector). Semantically similar texts have vectors that are close together; semantically dissimilar texts have vectors that are far apart.&lt;/p&gt;

&lt;p&gt;The retriever relies on this — it converts the user's question into a vector, then finds the nearest vectors in the database.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Big Is the Difference Between Models?
&lt;/h3&gt;

&lt;p&gt;Very big. For the same question, different models can return completely different results.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Strong Language&lt;/th&gt;
&lt;th&gt;Dimensions&lt;/th&gt;
&lt;th&gt;Positioning&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;text-embedding-3-small&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;Cheap &amp;amp; fast&lt;/td&gt;
&lt;td&gt;English docs, budget-sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;text-embedding-3-large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;High precision&lt;/td&gt;
&lt;td&gt;English docs, precision-first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BAAI/bge-large-zh-v1.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chinese&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Best for Chinese&lt;/td&gt;
&lt;td&gt;Chinese docs, China-first choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BAAI/bge-m3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;Mixed Chinese-English, cross-lingual&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  A Real Comparison Experiment
&lt;/h3&gt;

&lt;p&gt;We use the same Chinese technical document (&lt;em&gt;Automotive SPICE PAM v4.0&lt;/em&gt;), the same question, and compare retrieval results between &lt;code&gt;text-embedding-3-small&lt;/code&gt; and &lt;code&gt;BAAI/bge-large-zh-v1.5&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question&lt;/strong&gt;: "What is process capability level 1?"&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;1st Retrieved Result&lt;/th&gt;
&lt;th&gt;2nd Retrieved Result&lt;/th&gt;
&lt;th&gt;Assessment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-small&lt;/td&gt;
&lt;td&gt;Page 12: paragraph about project management&lt;/td&gt;
&lt;td&gt;Page 89: paragraph about risk assessment&lt;/td&gt;
&lt;td&gt;❌ Neither mentions "process capability level"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BAAI/bge-large-zh-v1.5&lt;/td&gt;
&lt;td&gt;Page 45: &lt;strong&gt;Definition of process capability level 1&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Page 46: Example practices for level 1&lt;/td&gt;
&lt;td&gt;✅ Direct hit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Reason&lt;/strong&gt;: OpenAI's models are primarily trained on English corpora. Their understanding of Chinese technical terminology is not as strong as BGE, which is specifically fine-tuned on Chinese corpora.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Choose an Embedding Model?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Decision tree:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What language is your document?
    ├─ Pure English → text-embedding-3-small (best value)
    │                  or text-embedding-3-large (best precision)
    ├─ Pure Chinese → BAAI/bge-large-zh-v1.5 (China-first choice)
    │                  or BAAI/bge-m3 (if mixed Chinese-English)
    └─ Mixed Chinese-English → BAAI/bge-m3 (best multilingual support)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Switching models is a one-line change&lt;/strong&gt;: Just change &lt;code&gt;model="..."&lt;/code&gt; in the &lt;code&gt;build_embeddings()&lt;/code&gt; function. Everything else stays the same — that's the beauty of LangChain.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Hands-On: Controlled-Variable Experiment
&lt;/h2&gt;

&lt;p&gt;Let's run an experiment: same document, same question, only changing Chunk Size, and observe how answer quality changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experimental Design
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
RAG Parameter Controlled-Variable Experiment
Fixed: document, question, embedding model, Top-K, LLM
Variable: Chunk Size
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_chroma&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PyPDFLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.prompts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.output_parsers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StrOutputParser&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.runnables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RunnablePassthrough&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;

&lt;span class="c1"&gt;# Load document
&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/Automotive-SPICE-PAM-v40.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Embedding (fixed)
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-large-zh-v1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMBEDDING_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.siliconflow.cn/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# LLM (fixed)
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm-4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLM_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://open.bigmodel.cn/api/paas/v4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Test different Chunk Sizes
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_chunk_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chunk Size=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, Overlap=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Split
&lt;/span&gt;    &lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generated &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Build vector store
&lt;/span&gt;    &lt;span class="n"&gt;persist_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./chroma_db_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_dir&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rmtree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;persist_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Build RAG Chain (LCEL style)
&lt;/span&gt;    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_messages&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer based on reference content. Reference:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;{context}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{question}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;rag_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RunnablePassthrough&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;StrOutputParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Ask question
&lt;/span&gt;    &lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is process capability level 1?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Print retrieved sources
&lt;/span&gt;    &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Retrieved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; sources:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run three experiments
&lt;/span&gt;&lt;span class="nf"&gt;test_chunk_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;test_chunk_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;test_chunk_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Expected Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chunk Size&lt;/th&gt;
&lt;th&gt;Number of Chunks&lt;/th&gt;
&lt;th&gt;Retrieval Quality&lt;/th&gt;
&lt;th&gt;Typical Observation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;Many (~4000)&lt;/td&gt;
&lt;td&gt;High precision but broken context&lt;/td&gt;
&lt;td&gt;Retrieved chunks have the keyword "process capability level" but lack sufficient context; LLM answers are fragmented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;Medium (~1000)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Best balance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieved chunks contain complete definitions + examples; LLM answers are coherent and accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;Few (~500)&lt;/td&gt;
&lt;td&gt;Complete context but low precision&lt;/td&gt;
&lt;td&gt;Retrieved chunks contain lots of irrelevant content (e.g., descriptions of other levels); LLM answers are verbose&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key insight&lt;/strong&gt;: Chunk Size is not "bigger is better" nor "smaller is better". &lt;strong&gt;512 characters&lt;/strong&gt; is a safe starting point for most Chinese technical documents.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The 5 Most Common Pitfalls
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pitfall 1: Setting Chunk Size by Token Count, but length_function Uses Character Count
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Wrong: You think chunk_size=512 means 512 tokens
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Actually the default length_function=len counts characters!
# 512 characters ≈ 256 tokens (Chinese), so chunks are half the size you expected
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: If you want to count by tokens, explicitly specify a tokenizer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;token_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encoding_for_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;token_length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ✅ Count by tokens
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pitfall 2: Overlap Too Large, Causing 30% Duplicate Content in the Vector Store
&lt;/h3&gt;

&lt;p&gt;Overlap is not free. Every overlapping character requires an embedding computation and takes up storage space in the vector database. Overlap=100 with Chunk Size=200 means &lt;strong&gt;50% of your storage is redundant&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Set Overlap to &lt;strong&gt;10%~15%&lt;/strong&gt; of Chunk Size, never exceed 20%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 3: Swapped Embedding Model Without Clearing the Old Vector Store
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Wrong: Yesterday you built the index with BGE, today you switch to OpenAI
# and reuse the same chroma_db/ directory
&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Result: Query vectors and index vectors come from different models — completely mismatched
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: When switching embedding models, &lt;strong&gt;always delete the old vector store and re-index&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rmtree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ✅ Clear old data
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pitfall 4: Hardcoding Top-K Without Adjusting for Question Complexity
&lt;/h3&gt;

&lt;p&gt;Using K=4 for all questions, but "What's the default port?" (simple fact) and "Summarize all key points from Chapter 3" (comprehensive overview) require vastly different amounts of information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Simple questions use K=3~4, complex questions use K=8~10. A more advanced approach is to use an LLM to judge question complexity first, then dynamically decide K (covered in a later article).&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 5: Not Monitoring "Zero Retrieval"
&lt;/h3&gt;

&lt;p&gt;Sometimes the retriever returns 0 relevant chunks (e.g., the user asks about something completely absent from the document), but you don't know. The LLM has no choice but to hallucinate from memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Add a threshold filter after retrieval — if the similarity score of the most relevant chunk is below a threshold (e.g., 0.6), directly tell the user "The document doesn't contain relevant information" instead of feeding irrelevant chunks to the LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add a filter layer after retrieval
&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;max_similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sorry, I cannot answer this question based on the available documents.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Parameter Selection Cheat Sheet
&lt;/h2&gt;

&lt;p&gt;Condense everything above into one table, tape it next to your monitor:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default for Beginners&lt;/th&gt;
&lt;th&gt;When to Increase&lt;/th&gt;
&lt;th&gt;When to Decrease&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chunk Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;Answer needs large context (books/papers)&lt;/td&gt;
&lt;td&gt;Answer is short (FAQ/config items)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chunk Overlap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50 (~10%)&lt;/td&gt;
&lt;td&gt;Sentences often span pages/paragraphs&lt;/td&gt;
&lt;td&gt;Document is highly structured with clear boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Question involves multiple topics&lt;/td&gt;
&lt;td&gt;Question is very specific with a unique answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BGE (Chinese) / OpenAI (English)&lt;/td&gt;
&lt;td&gt;Chinese professional documents&lt;/td&gt;
&lt;td&gt;English general-purpose documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this article, we covered the four core parameters of RAG:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Chunk Size&lt;/strong&gt;: Determines how long each text chunk is. Default 512. Use 256 for short answers, 1024 for long arguments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk Overlap&lt;/strong&gt;: Determines how much adjacent chunks overlap. Default 10% of Chunk Size. Prevents cross-chunk information from being severed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-K&lt;/strong&gt;: Determines how many chunks to retrieve. Default 4. Increase to 8 for complex questions, decrease to 3 for simple facts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding Model&lt;/strong&gt;: Chinese documents use BGE, English documents use OpenAI. Remember to clear the vector store when switching models.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Through the controlled-variable experiment, we demonstrated that &lt;strong&gt;parameters are not "bigger is better" nor "smaller is better" — the key is finding the balance that suits your document type and query patterns&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Next up, we enter Part 2: Core Components — a deep dive into &lt;strong&gt;4 chunking strategies&lt;/strong&gt; (Fixed Size, Recursive Character, Semantic Chunking, Document Structure), thoroughly unpacking the "how to cut" problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://python.langchain.com/docs/concepts/text_splitters/" rel="noopener noreferrer"&gt;LangChain Text Splitters Documentation&lt;/a&gt; — Official chunking strategy guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/FlagOpen/FlagEmbedding" rel="noopener noreferrer"&gt;BGE Embedding Models GitHub&lt;/a&gt; — Best practices for Chinese embeddings&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/spaces/mteb/leaderboard" rel="noopener noreferrer"&gt;MTEB Leaderboard&lt;/a&gt; — Authoritative embedding model ranking&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.trychroma.com/docs/collections/configure#configuring-the-hnsw-index" rel="noopener noreferrer"&gt;ChromaDB Distance Metrics&lt;/a&gt; — Cosine similarity vs Euclidean distance&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rag</category>
      <category>chunk</category>
      <category>vectordatabase</category>
      <category>tuning</category>
    </item>
    <item>
      <title>RAG Series (1): Why LLMs Need External Memory</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Sat, 02 May 2026 02:47:17 +0000</pubDate>
      <link>https://dev.to/wonderlab/rag-series-1-why-llms-need-external-memory-1645</link>
      <guid>https://dev.to/wonderlab/rag-series-1-why-llms-need-external-memory-1645</guid>
      <description>&lt;h2&gt;
  
  
  Two Root Causes Behind LLM "Hallucinations"
&lt;/h2&gt;

&lt;p&gt;Anyone who has worked with large language models has run into these two situations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Situation 1: Knowledge Cutoff&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: What were our company's Q1 sales figures?
GPT: I'm sorry, my training data only goes up to early 2024 and I have
     no access to your company's internal data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Situation 2: Hallucination&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: How do I use LangChain's RunnablePassthrough?
GPT: RunnablePassthrough can be enabled by calling .with_config(pass_through=True)...
     (This parameter doesn't exist.)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both problems share the same root cause: &lt;strong&gt;an LLM's knowledge is frozen at training time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The moment training completes, the model's "memory" is locked in place. It has no idea what happened today, knows nothing about your internal documents, and won't go look things up—it can only answer from memory. When memory runs dry, it either admits ignorance or invents a plausible-sounding answer.&lt;/p&gt;

&lt;p&gt;That's where hallucinations come from: &lt;strong&gt;the model uses fluent language to fill in the gaps in its knowledge.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Solutions and How They Compare
&lt;/h2&gt;

&lt;p&gt;There are three engineering approaches to this problem:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrain on new data, "bake" knowledge into parameters&lt;/td&gt;
&lt;td&gt;Fixed-domain language style, output format&lt;/td&gt;
&lt;td&gt;Expensive, slow to update, limited factual recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stuff all documents into the prompt&lt;/td&gt;
&lt;td&gt;Small document sets, one-off queries&lt;/td&gt;
&lt;td&gt;Token cost grows exponentially; quality degrades at extreme lengths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dynamically retrieve relevant content at query time, inject into prompt&lt;/td&gt;
&lt;td&gt;Large knowledge bases, continuously updated data&lt;/td&gt;
&lt;td&gt;Requires retrieval infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A common misconception: &lt;strong&gt;fine-tuning is not good at injecting new facts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fine-tuning changes a model's behavioral patterns and language style—it doesn't "store a book" inside the parameters. Experiments consistently show that fine-tuning on specific Q&amp;amp;A pairs produces limited accuracy gains on related questions, and if training data contains errors, the model confidently repeats those errors.&lt;/p&gt;

&lt;p&gt;RAG's core advantage is &lt;strong&gt;separating "what to know" from "how to say it"&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knowledge lives in an external database and can be updated anytime&lt;/li&gt;
&lt;li&gt;The model focuses purely on understanding and generation, not memorization&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When should you use long context instead of RAG?&lt;/strong&gt;&lt;br&gt;
When the total document volume is under ~100K tokens, the query is one-off (not recurring), and API costs are acceptable, long context is often simpler. Claude and Gemini's extended context windows make "stuffing a whole book in" genuinely viable. But for enterprise knowledge bases—thousands of documents, continuous updates, multiple concurrent users—RAG remains the more sensible architecture.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What RAG Is: An Open-Book Exam Analogy
&lt;/h2&gt;

&lt;p&gt;RAG = &lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The most intuitive way to think about it: &lt;strong&gt;turning a closed-book exam into an open-book exam.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Closed-book (pure LLM): The student answers purely from memory. Anything not memorized gets guessed.&lt;/p&gt;

&lt;p&gt;Open-book (RAG): The student can consult reference materials, but still needs to understand the question, find the relevant content, and compose the answer. The reference materials are the external knowledge base; looking things up is the retrieval step.&lt;/p&gt;

&lt;p&gt;This analogy reveals two key properties of RAG:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge lives outside the model&lt;/strong&gt; — it can be swapped and updated independently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The model handles understanding and generation&lt;/strong&gt; — after retrieval, the model still needs to "read" the content and produce a coherent response&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Complete RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;RAG operates in two distinct phases: the &lt;strong&gt;indexing phase&lt;/strong&gt; (one-time, offline) and the &lt;strong&gt;query phase&lt;/strong&gt; (real-time, per request).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxj13rs0wbpqh5dvf0hmp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxj13rs0wbpqh5dvf0hmp.png" alt="RAG Architecture Overview"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The two-phase RAG architecture — top: offline indexing pipeline; bottom: real-time query pipeline; both share the same Vector DB&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Indexing Phase
&lt;/h3&gt;

&lt;p&gt;This phase completes before any user query arrives. It's a one-time preprocessing step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw Documents → Document Loading → Text Splitting → Embedding → Vector Database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 1: Document Loading&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Convert raw content from various formats into plain text. PDFs, Word docs, Markdown, web pages, code—each format has its own parsing challenges (tables and images in PDFs are notoriously tricky).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Text Splitting (Chunking)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cut long documents into smaller chunks. This step has a significant impact on final quality—chunks too large reduce retrieval precision; chunks too small lose semantic coherence. (Chunking strategies are covered in depth in a later article; for now, just understand &lt;em&gt;why&lt;/em&gt; we split.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Embedding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use an embedding model to convert each text chunk into a high-dimensional vector. This vector captures the semantic meaning of the text—semantically similar texts produce vectors that are close together in the vector space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Store in Vector Database&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Store all vectors along with their original text in a vector database that supports similarity search (Chroma, Qdrant, Weaviate, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Phase
&lt;/h3&gt;

&lt;p&gt;Executed in real time for every user request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Question → Embedding → Similarity Search → Retrieved Chunks → Prompt Assembly → LLM → Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 1: Query Embedding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Convert the user's question into a vector using the same embedding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Similarity Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Find the Top-K most similar text chunks in the vector database. Similarity is measured by distance in vector space (cosine similarity, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Prompt Assembly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combine the retrieved text chunks with the user's question into a complete prompt and send it to the LLM. A typical format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a knowledge assistant. Answer the user's question based solely on
the reference content provided below.

Reference content:
[Retrieved chunk 1]
[Retrieved chunk 2]
...

User question: [original question]

Please base your answer on the reference content. If the reference content
does not contain relevant information, say so clearly.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: LLM Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM generates an answer grounded in the provided context, rather than from its internal parameters alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hands-On: A Minimal RAG in 100 Lines
&lt;/h2&gt;

&lt;p&gt;No frameworks—just the OpenAI API. Let's implement a working RAG from scratch. The goal is to &lt;strong&gt;see exactly what each step does&lt;/strong&gt;, without the abstraction layers of a framework hiding the details.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Minimal RAG implementation — no frameworks, OpenAI API only.
Demonstrates the complete RAG pipeline: indexing + querying.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# requires OPENAI_API_KEY environment variable
&lt;/span&gt;
&lt;span class="c1"&gt;# ─────────────────────────────────────────
# Simulated knowledge base: 5 technical docs
# ─────────────────────────────────────────
&lt;/span&gt;&lt;span class="n"&gt;DOCUMENTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LangChain is a framework for building LLM applications, providing chaining, memory management, and tool integration.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Vector databases enable semantic search by converting text into high-dimensional vectors. Common options include Chroma, Qdrant, Weaviate, and Pinecone.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RAG (Retrieval-Augmented Generation) reduces LLM hallucinations by retrieving relevant documents before generation, improving answer accuracy.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Embedding models convert text into fixed-dimension vectors, where semantically similar texts are positioned closer together in vector space.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fine-tuning retrains a model on specific data to adjust its behavior — it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s best suited for changing output style, not injecting new factual knowledge.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="c1"&gt;# ─────────────────────────────────────────
# Indexing phase: convert documents to vectors
# ─────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Convert each document into a vector.
    Returns [{text, embedding}, ...]
    In production, store these in a vector database.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Indexing &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; documents...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;
        &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Index built successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec_a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;vec_b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Cosine similarity between two vectors. Range: -1 to 1, higher = more similar.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;


&lt;span class="c1"&gt;# ─────────────────────────────────────────
# Query phase: retrieve + generate
# ─────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Embed the query and find the top_k most similar documents.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;

    &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

    &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Assemble retrieved docs + user question into a prompt and call the LLM.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;context_docs&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a knowledge assistant. Answer the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question based
solely on the reference content below. If the reference content does not contain
relevant information, say so clearly — do not make anything up.

Reference content:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

User question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Full RAG query pipeline.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; relevant documents:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;


&lt;span class="c1"&gt;# ─────────────────────────────────────────
# Run the demo
# ─────────────────────────────────────────
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Build the index (only needs to be done once in a real system)
&lt;/span&gt;    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DOCUMENTS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Test with a few questions
&lt;/span&gt;    &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is a vector database?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the difference between RAG and fine-tuning?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is Python&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s GIL?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Not in the knowledge base — testing refusal
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Indexing 5 documents...
Index built successfully.

Question: What is a vector database?
Retrieved 2 relevant documents:
  [1] Vector databases enable semantic search by converting text into high-dim...
  [2] Embedding models convert text into fixed-dimension vectors, where semant...

Answer: A vector database is a database system that enables semantic search by
converting text into high-dimensional vectors. Common examples include Chroma,
Qdrant, Weaviate, and Pinecone...

Question: What is Python's GIL?
Retrieved 2 relevant documents:
  [1] LangChain is a framework for building LLM applications...
  [2] Fine-tuning retrains a model on specific data...

Answer: Based on the provided reference content, I cannot answer your question
about Python's GIL — the reference material does not contain relevant information.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the last question: the knowledge base has nothing about Python's GIL, and the LLM explicitly says it can't answer rather than inventing a response. This is how RAG controls hallucinations: &lt;strong&gt;a constraint in the prompt instructs the model to answer only from retrieved content.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations of This Implementation
&lt;/h2&gt;

&lt;p&gt;The 100 lines above demonstrate the complete RAG pipeline, but there are obvious shortcomings:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Engineering Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vectors live in memory, lost on restart&lt;/td&gt;
&lt;td&gt;No persistence&lt;/td&gt;
&lt;td&gt;Vector database (Chroma / Qdrant)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long documents passed in directly will exceed token limits&lt;/td&gt;
&lt;td&gt;No chunking&lt;/td&gt;
&lt;td&gt;Text Splitter strategies &lt;em&gt;(next article)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Poor keyword matching; pure vector retrieval only&lt;/td&gt;
&lt;td&gt;No hybrid search&lt;/td&gt;
&lt;td&gt;Hybrid search &lt;em&gt;(later in series)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No quality measurement&lt;/td&gt;
&lt;td&gt;No evaluation&lt;/td&gt;
&lt;td&gt;RAGAS evaluation framework &lt;em&gt;(later in series)&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each limitation maps directly to a topic covered in the upcoming articles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This article addressed three core questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Why RAG?&lt;/strong&gt; — LLM knowledge cutoff and hallucinations both stem from knowledge being frozen in model parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is RAG?&lt;/strong&gt; — Dynamically retrieve external knowledge at query time, inject it into the prompt, and let the LLM answer based on evidence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG vs. alternatives&lt;/strong&gt; — Fine-tuning changes behavior; long context works for small documents; RAG is built for large-scale, continuously-updated knowledge bases&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Next up: the first deep dive into RAG's core components — &lt;strong&gt;text chunking strategies&lt;/strong&gt;. Why does the chunking approach have such a dramatic impact on quality, and how do you choose between the four main strategies?&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2005.11401" rel="noopener noreferrer"&gt;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks&lt;/a&gt; — Original RAG paper (Lewis et al., 2020)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;OpenAI Embeddings Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.langchain.com/docs/tutorials/rag/" rel="noopener noreferrer"&gt;LangChain RAG Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>ai</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No. 54): Warp - The AI-Native Rust Terminal</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Sat, 02 May 2026 02:42:27 +0000</pubDate>
      <link>https://dev.to/wonderlab/one-open-source-project-a-day-no-54-warp-the-ai-native-rust-terminal-2hlj</link>
      <guid>https://dev.to/wonderlab/one-open-source-project-a-day-no-54-warp-the-ai-native-rust-terminal-2hlj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"The terminal hasn't fundamentally changed in 40 years. It's time it did." — The Warp Team&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the No.54 article in the "One Open Source Project a Day" series. Today, we are exploring &lt;strong&gt;Warp&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The terminal is perhaps the most ancient part of a developer's daily toolkit. While every other tool has evolved for the AI era, most of us still interact with our command line as a simple "stream of text" based on 40-year-old logic. &lt;strong&gt;Warp&lt;/strong&gt; is here to change that. Rebuilt from the ground up in Rust, Warp is not just a high-performance terminal emulator, but a collaborative, AI-native &lt;strong&gt;Agentic Development Environment (ADE)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Will Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;How Warp transforms terminal output into actionable "Blocks."&lt;/li&gt;
&lt;li&gt;How AI integration in the terminal has evolved from simple completion to "Agentic Mode."&lt;/li&gt;
&lt;li&gt;Warp Drive: Managing and sharing team workflows like code repositories.&lt;/li&gt;
&lt;li&gt;The tech behind its high-performance GPU-accelerated UI rendering in Rust.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic command-line experience.&lt;/li&gt;
&lt;li&gt;Familiarity with modern development tools like VS Code.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project Introduction
&lt;/h3&gt;

&lt;p&gt;Warp is a modern, high-performance terminal emulator. It breaks away from the linear text-stream model of traditional terminals, reimagining the interface as a series of distinct "Blocks." Recently, Warp hit a major milestone by &lt;strong&gt;open-sourcing its client codebase&lt;/strong&gt; and fully embracing "Agentic Development," allowing AI agents to debug, refactor, and deploy directly from the command line.&lt;/p&gt;

&lt;h3&gt;
  
  
  Author/Team Introduction
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Team&lt;/strong&gt;: The Warp team, based in the USA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background&lt;/strong&gt;: Backed by top-tier firms like &lt;strong&gt;Sequoia Capital&lt;/strong&gt; and &lt;strong&gt;GV&lt;/strong&gt;, with investors including &lt;strong&gt;Sam Altman&lt;/strong&gt; (CEO of OpenAI) and &lt;strong&gt;Dylan Field&lt;/strong&gt; (CEO of Figma).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Founded&lt;/strong&gt;: 2020 (Client codebase open-sourced in 2024).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ GitHub Stars: 23,000+&lt;/li&gt;
&lt;li&gt;🍴 Forks: 1,200+&lt;/li&gt;
&lt;li&gt;📦 Core Language: &lt;strong&gt;Rust&lt;/strong&gt; (98.2%)&lt;/li&gt;
&lt;li&gt;📄 License: &lt;strong&gt;AGPL v3&lt;/strong&gt; (Client) / &lt;strong&gt;MIT&lt;/strong&gt; (UI Framework)&lt;/li&gt;
&lt;li&gt;🌐 Official Website: &lt;a href="https://www.warp.dev" rel="noopener noreferrer"&gt;warp.dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Main Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Utility
&lt;/h3&gt;

&lt;p&gt;Warp upgrades the boring command-line interaction into an IDE-like collaborative experience. It lowers the barrier to entry for complex CLI tools using AI and eliminates organizational "knowledge silos" through cloud-based sharing features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;AI-Assisted Debugging&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;When a command fails, use the built-in AI to explain the error and generate a fix with one click.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Workflow Sharing (Warp Drive)&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Save complex deployment scripts or operational commands as parameterized "Workflows" and share them with your team.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Immersive Command Editing&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Edit commands like you edit text: support for mouse positioning, undo, and standard editor keybindings.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;Warp is available for macOS, Linux, and Windows (v1).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS users can install via Homebrew&lt;/span&gt;
brew &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--cask&lt;/span&gt; warp

&lt;span class="c"&gt;# Linux users can download .deb, .rpm, or AppImage from the website&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, use &lt;code&gt;CMD+P&lt;/code&gt; to search, or start a command with &lt;code&gt;#&lt;/code&gt; to use natural language to command conversion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Characteristics
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Block-based UI&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Every command and its output is encapsulated in a "Block." You can copy output, share a link to a specific block, or filter output with AI.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Warp AI&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Deeply integrated with models like Claude 3.5 Sonnet and GPT-4o. It acts as a "tech lead" capable of managing sub-agents for complex tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Warp Drive&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;A cloud-based "safe" for storing, searching, and sharing common workflows and interactive notebooks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Exceptional Performance&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Fully GPU-accelerated UI rendering built in Rust, ensuring low latency even with high-volume log outputs.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Modern Editor Experience&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The input bar features syntax highlighting, smart completions, and multi-cursor editing.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Project Advantages
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Warp&lt;/th&gt;
&lt;th&gt;iTerm2 / Alacritty&lt;/th&gt;
&lt;th&gt;VS Code Terminal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native &amp;amp; Deep&lt;/td&gt;
&lt;td&gt;Plugin-based / Minimal&lt;/td&gt;
&lt;td&gt;Basic Completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Collaboration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Sync &amp;amp; Sharing&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Very Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interaction Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Block-level Objects&lt;/td&gt;
&lt;td&gt;Continuous Text Stream&lt;/td&gt;
&lt;td&gt;Text Stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rendering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPU Accelerated (Rust)&lt;/td&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;Generally Lower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why Choose Warp?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Eliminate Context Switching&lt;/strong&gt;: No need to leave the terminal to search for regex syntax or error details.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge Retention&lt;/strong&gt;: Turn scattered team knowledge into reusable digital assets via Warp Drive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero Configuration&lt;/strong&gt;: Comes with most modern developer features out of the box.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Detailed Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. From Terminal to ADE
&lt;/h3&gt;

&lt;p&gt;Warp is positioning itself as an &lt;strong&gt;Agentic Development Environment (ADE)&lt;/strong&gt;. Its architecture leverages the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;, allowing it to connect terminal sessions to external data (GitHub, Jira, DBs) and AI agents seamlessly.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. GPU-Accelerated Rendering
&lt;/h3&gt;

&lt;p&gt;Warp built a custom Rust UI framework called &lt;code&gt;warpui&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rendering&lt;/strong&gt;: Fully rendered on the GPU using modern graphics APIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benefit&lt;/strong&gt;: Extremely low CPU overhead and zero "flicker" or lag when handling heavy log processing.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Links &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/warpdotdev/warp" rel="noopener noreferrer"&gt;https://github.com/warpdotdev/warp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📚 &lt;strong&gt;Documentation&lt;/strong&gt;: &lt;a href="https://docs.warp.dev/" rel="noopener noreferrer"&gt;https://docs.warp.dev/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;Open Source Roadmap&lt;/strong&gt;: &lt;a href="https://www.warp.dev/blog/why-warp-is-going-open-source" rel="noopener noreferrer"&gt;Why Warp is Going Open Source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Community
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Discord&lt;/strong&gt;: Active community for developers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warp Drive&lt;/strong&gt;: Explore how to codify team workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Target Audience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Professional developers seeking ultimate efficiency.&lt;/li&gt;
&lt;li&gt;DevOps engineers who need to share operational assets.&lt;/li&gt;
&lt;li&gt;Terminal veterans tired of the 80s-style command line experience.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Find more useful knowledge and interesting products on my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;Homepage&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>llm</category>
      <category>ai</category>
      <category>cli</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No. 53): pi-mono - Minimalist &amp; High-Performance AI Coding Agent</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Sat, 02 May 2026 02:39:38 +0000</pubDate>
      <link>https://dev.to/wonderlab/one-open-source-project-a-day-no-53-pi-mono-minimalist-high-performance-ai-coding-agent-4d73</link>
      <guid>https://dev.to/wonderlab/one-open-source-project-a-day-no-53-pi-mono-minimalist-high-performance-ai-coding-agent-4d73</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Simplicity is the ultimate sophistication." — Leonardo da Vinci&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the No.53 article in the "One Open Source Project a Day" series. Today, we are looking at &lt;strong&gt;pi-mono&lt;/strong&gt; (pi).&lt;/p&gt;

&lt;p&gt;In an era where AI coding tools are becoming increasingly bloated—with massive binaries and complex sub-agent architectures—Mario Zechner, the author of libGDX, has taken a diametrically opposite path. &lt;strong&gt;pi-mono&lt;/strong&gt; is a TypeScript-based monorepo containing a lean yet powerful CLI coding assistant named &lt;code&gt;pi&lt;/code&gt;. It eschews flashy GUIs for a custom "differential rendering" TUI framework, delivering the smoothest AI collaboration experience directly in your terminal.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You Will Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The minimalist design philosophy of pi-mono.&lt;/li&gt;
&lt;li&gt;How "differential rendering" achieves a flicker-free terminal UI.&lt;/li&gt;
&lt;li&gt;High-efficiency cross-provider LLM switching.&lt;/li&gt;
&lt;li&gt;Why the "YOLO mode" (no permission confirmation) dramatically boosts productivity.&lt;/li&gt;
&lt;li&gt;A deep comparison with heavy-duty agents like Claude Code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic Node.js/TypeScript environment setup.&lt;/li&gt;
&lt;li&gt;Basic understanding of LLM Tool Calling.&lt;/li&gt;
&lt;li&gt;API Keys for AI providers like Anthropic or OpenAI.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Project Introduction
&lt;/h3&gt;

&lt;p&gt;pi-mono is a suite of AI programming agent tools designed specifically for "power users." It consists of a core agent engine, a unified AI interface layer, and a terminal UI with its own autonomous rendering engine. Its core objective is to &lt;strong&gt;provide the fastest response times and the cleanest user experience without sacrificing context control.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Author/Team Introduction
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Author&lt;/strong&gt;: Mario Zechner&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background&lt;/strong&gt;: Founder of the popular open-source game framework &lt;strong&gt;libGDX&lt;/strong&gt; and former founder of RoboVM. He has deep expertise in high-performance cross-platform development and the open-source community.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project Status&lt;/strong&gt;: Under rapid iteration, currently performing exceptionally well on benchmarks like Terminal-Bench.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ GitHub Stars: 430+ (Early stage, growing fast)&lt;/li&gt;
&lt;li&gt;🍴 Forks: 30+&lt;/li&gt;
&lt;li&gt;📦 Package Manager: pnpm&lt;/li&gt;
&lt;li&gt;📄 License: MIT&lt;/li&gt;
&lt;li&gt;🌐 Repository: &lt;a href="https://github.com/badlogic/pi-mono" rel="noopener noreferrer"&gt;badlogic/pi-mono&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Main Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Utility
&lt;/h3&gt;

&lt;p&gt;As a "harness," pi-mono connects LLMs (like Claude 3.5 Sonnet) to your local development environment. It can autonomously read files, execute Bash commands, rewrite code, and provide feedback on results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Fast Refactoring&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;It understands the full context of a codebase and can perform cross-file interface renaming with a single prompt.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Hard Bug Fixing&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;By observing error logs, it autonomously executes search commands and applies targeted fixes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Minimalist Development&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;For developers who prefer working in Terminal/Vim, &lt;code&gt;pi&lt;/code&gt; provides IDE-like interactions while remaining lightweight.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the pi coding agent&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @mariozechner/pi-coding-agent

&lt;span class="c"&gt;# Set up API Key (e.g., Anthropic)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here

&lt;span class="c"&gt;# Start in the project root&lt;/span&gt;
pi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Core Characteristics
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Differential Rendering TUI&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Frustrated by terminal flickering, the author developed a rendering engine inspired by React's diffing algorithm, making Markdown parsing and syntax highlighting extremely smooth.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Minimalist System Prompt&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Unlike other agents using thousands of tokens for instructions, &lt;code&gt;pi&lt;/code&gt; uses less than 1000 tokens, saving context window and increasing response speed.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Seamless Multi-Model Switching&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;You can switch models mid-conversation (e.g., from Claude to GPT-4o), and it automatically migrates the conversation history.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;YOLO Mode&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The project embraces a "trust through execution" philosophy. It doesn't nag you with permission popups for running &lt;code&gt;ls&lt;/code&gt; or &lt;code&gt;read&lt;/code&gt; commands.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Project Advantages
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;pi-mono (pi)&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Cursor / Windsurf&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tiny (Node-based)&lt;/td&gt;
&lt;td&gt;Large (Multi-layer deps)&lt;/td&gt;
&lt;td&gt;Heavy (IDE level)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extensibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (Bash-based)&lt;/td&gt;
&lt;td&gt;Medium (MCP-constrained)&lt;/td&gt;
&lt;td&gt;Low (Closed-box)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Start Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instant&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;td&gt;Slow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100% Transparent&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why Choose This Project?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance Beast&lt;/strong&gt;: Extreme optimization of terminal rendering and network I/O makes it feel 2x faster than similar tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency&lt;/strong&gt;: You can clearly see every character the AI generates and every simple tool call it executes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer-Friendly API&lt;/strong&gt;: If you want to build your own agent, its &lt;code&gt;pi-ai&lt;/code&gt; package is one of the best-wrapped cross-platform AI libraries available.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Detailed Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Differential Rendering Engine (pi-tui)
&lt;/h3&gt;

&lt;p&gt;This is the most technically impressive part of pi-mono. Standard Terminal UIs usually perform full redraws, causing noticeable flickering during long text outputs. &lt;code&gt;pi-tui&lt;/code&gt; borrows from Virtual DOM concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It maintains a buffer of the terminal state.&lt;/li&gt;
&lt;li&gt;It calculates the difference (Diff) between old and new states.&lt;/li&gt;
&lt;li&gt;It only sends the necessary escape sequences to Stdout.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Tool Calling Model (The "Bash-only" Philosophy)
&lt;/h3&gt;

&lt;p&gt;While other agents try to integrate dozens of APIs, &lt;code&gt;pi&lt;/code&gt; insists: &lt;strong&gt;If an AI can use Bash well, it can do anything.&lt;/strong&gt;&lt;br&gt;
Its toolbox consists of only four atomic tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;read(path, startLine, endLine)&lt;/code&gt;: Read file segments.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write(path, content)&lt;/code&gt;: Overwrite files.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;edit(path, oldStr, newStr)&lt;/code&gt;: Local search and replace (the most stable way to edit code).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bash(command)&lt;/code&gt;: Execute any shell command.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design ensures &lt;code&gt;pi&lt;/code&gt; remains incredibly robust in almost any environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Links &amp;amp; Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/badlogic/pi-mono" rel="noopener noreferrer"&gt;https://github.com/badlogic/pi-mono&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;NPM&lt;/strong&gt;: &lt;a href="https://www.npmjs.com/package/@mariozechner/pi-coding-agent" rel="noopener noreferrer"&gt;@mariozechner/pi-coding-agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💬 &lt;strong&gt;Discord&lt;/strong&gt;: Visit GitHub for the invite link.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Related Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/answerdotai/terminal-bench" rel="noopener noreferrer"&gt;Terminal-Bench Leaderboard&lt;/a&gt; - &lt;code&gt;pi&lt;/code&gt; is consistently a top performer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Target Audience
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Terminal natives looking for a high-speed experience.&lt;/li&gt;
&lt;li&gt;Developers with high requirements for code privacy and agent transparency.&lt;/li&gt;
&lt;li&gt;Learners who want to understand how to build high-performance Agent TUIs from scratch.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Find more useful knowledge and interesting products on my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;Homepage&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>agents</category>
      <category>ai</category>
      <category>typescript</category>
    </item>
    <item>
      <title>RAG Series (2): Building Your First RAG Pipeline with LangChain</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Fri, 01 May 2026 02:08:30 +0000</pubDate>
      <link>https://dev.to/wonderlab/rag-series-2-building-your-first-rag-pipeline-with-langchain-2kk7</link>
      <guid>https://dev.to/wonderlab/rag-series-2-building-your-first-rag-pipeline-with-langchain-2kk7</guid>
      <description>&lt;h2&gt;
  
  
  From 100 Lines to a Production Pipeline
&lt;/h2&gt;

&lt;p&gt;In the last article, we built a minimal RAG system in 100 lines of pure Python. It worked. It demonstrated the core idea. But if you tried to productionize that code, you'd quickly run into a wall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loading a PDF?&lt;/strong&gt; You need &lt;code&gt;PyPDF2&lt;/code&gt; or &lt;code&gt;pdfplumber&lt;/code&gt;, and you'll discover that tables, headers, and footers are a nightmare to parse cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Splitting text?&lt;/strong&gt; Your naive &lt;code&gt;text.split("\n\n")&lt;/code&gt; will cut sentences in half, break code blocks, or create chunks so large they blow past the token limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Swapping the vector database?&lt;/strong&gt; Say goodbye to your afternoon — every database has a different API, different distance metrics, different metadata handling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Switching LLM providers?&lt;/strong&gt; OpenAI's client, Anthropic's client, local models via &lt;code&gt;llama.cpp&lt;/code&gt; — each has its own message format, its own token counting, its own error handling.&lt;/p&gt;

&lt;p&gt;This is exactly why &lt;strong&gt;LangChain&lt;/strong&gt; exists. It doesn't do anything magical. It doesn't replace the underlying models or databases. What it does is simple and valuable: &lt;strong&gt;it gives you a uniform interface for plugging components together&lt;/strong&gt;, so you can focus on the logic of your RAG system instead of the plumbing.&lt;/p&gt;

&lt;p&gt;In this article, we'll rebuild our RAG pipeline using LangChain's modern API. By the end, you'll have a complete, runnable project that loads PDFs, splits them intelligently, stores them in ChromaDB, and answers questions using a multi-provider LLM — with about 30 lines of actual pipeline code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;LangChain Version Note:&lt;/strong&gt; The code in this article is based on &lt;code&gt;langchain 1.x&lt;/code&gt; (current stable). LangChain 1.x performed a breaking reorganization of &lt;code&gt;0.3.x&lt;/code&gt; — high-level APIs like &lt;code&gt;create_retrieval_chain&lt;/code&gt; were removed. We use &lt;strong&gt;LCEL native syntax&lt;/strong&gt; (&lt;code&gt;|&lt;/code&gt; pipe operator) instead, which is functionally equivalent and version-agnostic. Full source code: &lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic" rel="noopener noreferrer"&gt;https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Six Moving Parts of a RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;LangChain decomposes RAG into six components. Understanding what each one does — and where the quality risks hide — is the key to debugging RAG systems later.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;The Quality Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Document Loader&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads raw files (PDF, Word, Markdown, HTML) and extracts text&lt;/td&gt;
&lt;td&gt;Tables, images, and weird formatting get mangled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text Splitter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cuts long documents into semantically coherent chunks&lt;/td&gt;
&lt;td&gt;Chunks too large = low precision; too small = lost context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Converts text chunks into high-dimensional vectors&lt;/td&gt;
&lt;td&gt;Poor model choice = semantically unrelated texts cluster together&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector Store&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Persists vectors and enables fast similarity search&lt;/td&gt;
&lt;td&gt;Wrong distance metric or no metadata filtering = bad retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retriever&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Accepts a query, searches the vector store, returns relevant chunks&lt;/td&gt;
&lt;td&gt;Top-K too low = missing information; too high = noisy context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Orchestrates the full flow: query → retrieve → prompt → LLM → answer&lt;/td&gt;
&lt;td&gt;Prompt design and context assembly determine answer quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Think of these six components as an assembly line in a factory. The Document Loader is the raw material intake. The Text Splitter is the precision cutting station. The Embedding Model and Vector Store form the warehouse and inventory system. The Retriever is the picker who fetches the right parts. The Chain is the foreman who coordinates everything and delivers the final product.&lt;/p&gt;

&lt;p&gt;If any station on the line is misconfigured, the final product suffers — and the tricky part is that &lt;strong&gt;the failure often looks like an LLM problem when it's actually a retrieval problem&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hands-On: A Complete LangChain RAG Project
&lt;/h2&gt;

&lt;p&gt;Let's build it. We'll create a RAG system that reads a directory of PDF files, indexes them, and lets you ask questions in natural language.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rag-project/
├── requirements.txt
├── data/
│   └── sample.pdf          # Your PDF documents go here
└── rag_pipeline.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 0: Dependencies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;langchain&amp;gt;=0.3.0
langchain-text-splitters&amp;gt;=0.3.0
langchain-openai&amp;gt;=0.2.0
langchain-chroma&amp;gt;=0.1.0
langchain-community&amp;gt;=0.3.0
pypdf&amp;gt;=4.0.0
python-dotenv&amp;gt;=1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Full source code (ready to run):&lt;/strong&gt; &lt;a href="https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic" rel="noopener noreferrer"&gt;https://github.com/chendongqi/llm-in-action/tree/main/02-langchain-basic&lt;/a&gt;&lt;br&gt;
Supports Zhipu AI / OpenAI / Ollama for LLM, SiliconFlow and local Ollama for Embedding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Install them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your API keys (copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env to fill in LLM_API_KEY and EMBEDDING_API_KEY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported Providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM&lt;/strong&gt;: Zhipu AI (default), OpenAI, SiliconFlow, Ollama, Azure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding&lt;/strong&gt;: SiliconFlow (default), OpenAI, Ollama&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Load Documents
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;PyPDFLoader&lt;/code&gt; handles PDF parsing for us. It extracts text page by page and returns a list of &lt;code&gt;Document&lt;/code&gt; objects, each containing the page content and metadata (page number, source file, etc.).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PyPDFLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_pdfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Load all PDF files from the data directory.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;pdf_paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_dir&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pdf_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pdf_paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loaded &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; pages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total documents loaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Real-world note:&lt;/strong&gt; PDFs are the wild west of document formats. If your PDFs contain scanned images, you'll need OCR (via &lt;code&gt;pdfplumber&lt;/code&gt; with Tesseract or Azure Document Intelligence). If they contain tables, consider &lt;code&gt;UnstructuredPDFLoader&lt;/code&gt; which preserves table structure better than raw text extraction.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 2: Split Documents into Chunks
&lt;/h3&gt;

&lt;p&gt;Remember the chunking problem from Part 1? LangChain's &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; is the industry default for good reason. It tries to split on natural boundaries — paragraphs first, then newlines, then sentences, then words — so it avoids cutting mid-sentence whenever possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Split documents into overlapping chunks.
    chunk_overlap ensures context continuity between adjacent chunks.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Split into &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks (chunk_size=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, overlap=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;chunk_overlap&lt;/code&gt; matters:&lt;/strong&gt; If a key concept spans two chunks — say, "The API rate limit is 100 requests per minute. Exceeding this limit returns a 429 status code" — an overlap of 50 characters ensures the second chunk still contains "100 requests per minute" as context. Without overlap, the retriever might fetch only one chunk and miss the causal relationship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunk size trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Chunk Size&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;256 tokens&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Fact lookup, Q&amp;amp;A over structured docs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512 tokens&lt;/td&gt;
&lt;td&gt;Balanced&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;General-purpose RAG (good default)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024 tokens&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Rich&lt;/td&gt;
&lt;td&gt;Long-form summarization, narrative documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2048+ tokens&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Very rich&lt;/td&gt;
&lt;td&gt;Only when the LLM context window is large and queries are broad&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For most use cases, &lt;strong&gt;512 tokens with 50-token overlap&lt;/strong&gt; is a safe starting point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Embed and Store in ChromaDB
&lt;/h3&gt;

&lt;p&gt;Now we convert each chunk into a vector and store them. ChromaDB is our choice here because it's persistent (data survives restarts), supports metadata filtering, and requires zero setup — it runs locally as an embedded database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_chroma&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;persist_directory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./chroma_db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_vector_store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create embeddings and store in ChromaDB.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-large-zh-v1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# SiliconFlow Chinese model
&lt;/span&gt;        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMBEDDING_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.siliconflow.cn/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;  &lt;span class="c1"&gt;# SiliconFlow limit: max 32 per batch
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;shutil&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rmtree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;persist_directory&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Vector store built: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; vectors persisted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Embedding model note:&lt;/strong&gt; We use &lt;code&gt;BAAI/bge-large-zh-v1.5&lt;/code&gt; on SiliconFlow (excellent for Chinese). Access via the OpenAI-compatible interface in &lt;code&gt;langchain_openai&lt;/code&gt;. Switch to &lt;code&gt;text-embedding-3-small&lt;/code&gt; for OpenAI. The &lt;code&gt;chunk_size=32&lt;/code&gt; parameter is critical — it's SiliconFlow's batch limit (max 32 per request), while most other providers default to 1000.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Build the Retriever
&lt;/h3&gt;

&lt;p&gt;The retriever is a thin wrapper around the vector store that handles the search logic. By default, it performs similarity search and returns the top-K most relevant chunks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Configure retriever with similarity search.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;search_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;search_k&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Build the RAG Chain
&lt;/h3&gt;

&lt;p&gt;Here's where LangChain's modern API shines. Instead of the older &lt;code&gt;RetrievalQA&lt;/code&gt; class, we use &lt;strong&gt;LCEL&lt;/strong&gt; (LangChain Expression Language) to compose a chain that is explicit, debuggable, and easy to modify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.prompts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.output_parsers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StrOutputParser&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.runnables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RunnablePassthrough&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_rag_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Build the full RAG chain using LCEL (langchain 1.x compatible).
    retrieve → format_docs → assemble prompt → LLM → StrOutputParser
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;glm-4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Zhipu AI, via SiliconFlow or direct
&lt;/span&gt;        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLM_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://open.bigmodel.cn/api/paas/v4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# System prompt. {context} filled by format_docs, {question} by the user's input.
&lt;/span&gt;    &lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a precise knowledge assistant. Answer the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;based solely on the provided reference content below. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If the reference content does not contain the answer, say so clearly &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;— do not make anything up.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reference content:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;{context}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_messages&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{question}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Helper: convert Document list to a single string for {context}
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_docs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# LCEL Chain: compose components with the pipe operator
&lt;/span&gt;    &lt;span class="c1"&gt;# 1. {"context": retriever | format_docs, "question": RunnablePassthrough()}
&lt;/span&gt;    &lt;span class="c1"&gt;#    → retriever fetches docs, format_docs converts them to a string
&lt;/span&gt;    &lt;span class="c1"&gt;# 2. | prompt → assembles into a full prompt
&lt;/span&gt;    &lt;span class="c1"&gt;# 3. | llm → generates the answer
&lt;/span&gt;    &lt;span class="c1"&gt;# 4. | StrOutputParser() → returns plain text (not an AIMessage object)
&lt;/span&gt;    &lt;span class="n"&gt;rag_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;format_docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RunnablePassthrough&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;
        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;
        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;StrOutputParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rag_chain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's happening here?&lt;/strong&gt; We're using &lt;strong&gt;LCEL&lt;/strong&gt; (LangChain Expression Language) native syntax — the pipe &lt;code&gt;|&lt;/code&gt; operator — instead of the high-level &lt;code&gt;create_retrieval_chain&lt;/code&gt; (which was removed in langchain 1.x). The key line is &lt;code&gt;retriever | format_docs&lt;/code&gt;: the retriever outputs a list of &lt;code&gt;Document&lt;/code&gt; objects, &lt;code&gt;format_docs&lt;/code&gt; converts them to a string that fills the &lt;code&gt;{context}&lt;/code&gt; placeholder. &lt;code&gt;RunnablePassthrough()&lt;/code&gt; passes the user's raw question through to the &lt;code&gt;{question}&lt;/code&gt; placeholder. Three lines of declarative code that are functionally equivalent to the 50 lines of imperative Python from Part 1.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Query the Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_chain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run a question through the RAG pipeline, print answer and sources.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# LCEL chain returns plain text directly
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Retrieve docs separately to show sources (rag_chain doesn't expose them)
&lt;/span&gt;    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Retrieved sources:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;preview&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;preview&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Putting It All Together
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Load
&lt;/span&gt;    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_pdfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Split (smaller chunk_size for short PDF pages)
&lt;/span&gt;    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Embed &amp;amp; Store
&lt;/span&gt;    &lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_vector_store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Retrieve
&lt;/span&gt;    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 5. Build Chain (LCEL)
&lt;/span&gt;    &lt;span class="n"&gt;rag_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_rag_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 6. Interactive Q&amp;amp;A
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Your question (quit to exit): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rag_chain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Running the Pipeline
&lt;/h2&gt;

&lt;p&gt;Place a PDF in &lt;code&gt;data/sample.pdf&lt;/code&gt; and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python rag_pipeline.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sample output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RAG Pipeline 启动
  LLM Provider    : zhipu
  LLM Model       : glm-4-flash
  Embedding       : openai / BAAI/bge-large-zh-v1.5
  数据目录        : ./data
  向量库          : ./chroma_db
==================================================
已加载 'Automotive-SPICE-PAM-v40.pdf'：153 页
共加载 153 个文档片段
切分为 2000 个块（chunk_size=200，overlap=30）
[Embedding] Provider: openai | Model: BAAI/bge-large-zh-v1.5 | Base: https://api.siliconflow.cn/v1
已清除旧向量库：./chroma_db
向量库构建完成：2000 个向量已持久化
==================================================
RAG Pipeline 构建完成！输入问题开始问答（输入 'quit' 退出）
==================================================

你的问题：What is Automotive SPICE?

==================================================
问题：What is Automotive SPICE？
==================================================

答案：
Automotive SPICE（Automotive Software Process Improvement and
Capability Determination）是一种用于评估和改进汽车软件
开发过程能力的框架。它定义了软件开发生命周期中的关键
过程域，并建立了过程能力的等级评估标准...

检索到的来源：
  [1] ./data/Automotive-SPICE-PAM-v40.pdf（第 5 页）：Automotive
      SPICE Process Assessment Model The Process Assessment Model
      (PAM) defines the processes...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice how the answer includes &lt;strong&gt;citations&lt;/strong&gt; — we know exactly which pages the information came from. This traceability is critical for production RAG systems where users need to verify claims.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed From Our 100-Line Version?
&lt;/h2&gt;

&lt;p&gt;Let's compare the hand-written RAG from Part 1 with our LangChain pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Hand-written (Part 1)&lt;/th&gt;
&lt;th&gt;LangChain (Part 2)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF loading&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;One-line &lt;code&gt;PyPDFLoader&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text splitting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None (whole docs)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; with smart boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector persistence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In-memory only, lost on restart&lt;/td&gt;
&lt;td&gt;ChromaDB persists to disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedding model swap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rewrite API calls&lt;/td&gt;
&lt;td&gt;One-line parameter change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM swap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rewrite client code&lt;/td&gt;
&lt;td&gt;One-line parameter change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector DB swap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rewrite storage + search&lt;/td&gt;
&lt;td&gt;Swap &lt;code&gt;Chroma&lt;/code&gt; for &lt;code&gt;Qdrant&lt;/code&gt; / &lt;code&gt;Pinecone&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Raw string formatting&lt;/td&gt;
&lt;td&gt;Templated prompts with &lt;code&gt;ChatPromptTemplate&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Source citations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual tracking&lt;/td&gt;
&lt;td&gt;Automatic metadata propagation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chain construction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual &lt;code&gt;retrieve + generate&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;LCEL `&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lines of pipeline code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~80&lt;/td&gt;
&lt;td&gt;~25&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The abstraction doesn't hide complexity — it &lt;strong&gt;isolates&lt;/strong&gt; it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;LangChain Version Compatibility:&lt;/strong&gt; The code in this article targets {% raw %}&lt;code&gt;langchain 1.x&lt;/code&gt; (current stable). LangChain 1.x performed a breaking reorganization of &lt;code&gt;0.3.x&lt;/code&gt; — &lt;code&gt;create_retrieval_chain&lt;/code&gt; and &lt;code&gt;create_stuff_documents_chain&lt;/code&gt; were removed. We use LCEL native syntax (&lt;code&gt;|&lt;/code&gt; pipe operator) instead, which is functionally equivalent and version-agnostic. When you need to debug retrieval quality, you know exactly which component to tune. When you need to swap the embedding model for a cheaper alternative, you change one line. When your data grows beyond ChromaDB's capabilities, you switch to Qdrant without touching the rest of the pipeline.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Common Pitfalls at Each Stage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Loader Pitfall: "My PDF has tables and they come out garbled"
&lt;/h3&gt;

&lt;p&gt;Raw PDF text extraction flattens tables into a stream of numbers. For table-heavy documents, use &lt;code&gt;UnstructuredPDFLoader&lt;/code&gt; or &lt;code&gt;AzureAIDocumentIntelligenceLoader&lt;/code&gt; which preserves structural relationships.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splitter Pitfall: "The answer is split across two chunks and the model only sees half"
&lt;/h3&gt;

&lt;p&gt;Increase &lt;code&gt;chunk_overlap&lt;/code&gt; to 100-150 tokens, or reduce &lt;code&gt;chunk_size&lt;/code&gt; so that key concepts fit within a single chunk. Better yet, use &lt;strong&gt;Parent-Document Retrieval&lt;/strong&gt; (covered in a later article) which retrieves small chunks but returns the full parent document for context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding Pitfall: "Questions and documents don't match even though they should"
&lt;/h3&gt;

&lt;p&gt;This is the "asymmetric retrieval" problem. A user asks "How do I reset my password?" but the document says "To reset your password, navigate to Settings → Security." The question and the answer embed to different vectors because their surface text differs. Solutions: use a model fine-tuned for Q&amp;amp;A retrieval (like BGE-M3), or generate hypothetical answers for retrieval (HyDE — also covered later).&lt;/p&gt;

&lt;h3&gt;
  
  
  Retriever Pitfall: "Top-K=4 isn't enough for complex questions"
&lt;/h3&gt;

&lt;p&gt;If a question requires synthesizing information from five different sections of a document, &lt;code&gt;k=4&lt;/code&gt; will miss one. But increasing &lt;code&gt;k&lt;/code&gt; blindly adds noise. A better approach: use &lt;strong&gt;Multi-Query Retrieval&lt;/strong&gt; (generate 3 variants of the question, retrieve for each, deduplicate) or &lt;strong&gt;Reranking&lt;/strong&gt; (retrieve 20, then use a cross-encoder to pick the best 5).&lt;/p&gt;

&lt;h3&gt;
  
  
  Chain Pitfall: "The model ignores the context and hallucinates"
&lt;/h3&gt;

&lt;p&gt;Your prompt matters. The system prompt must explicitly instruct the model to use only the provided context. Adding "If the reference content does not contain the answer, say so clearly — do not make anything up" dramatically improves faithfulness. We'll measure this quantitatively with RAGAS in the evaluation articles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this article, we took the raw RAG concept from Part 1 and wrapped it in a production-ready framework. Here's what we covered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The six components&lt;/strong&gt; of a LangChain RAG pipeline — Loader, Splitter, Embedding, Vector Store, Retriever, and Chain — and what quality risk hides in each one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A complete, runnable project&lt;/strong&gt; that loads PDFs, splits them with &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt;, embeds them with OpenAI, stores them in ChromaDB, and answers questions via a LangChain LCEL chain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The chunk size trade-off&lt;/strong&gt; — in real projects, PDF pages may be very short (e.g., 200 chars). A &lt;code&gt;chunk_size=512&lt;/code&gt; default can produce 0 chunks. 200 + 30 overlap is the safe default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Common pitfalls&lt;/strong&gt; at each pipeline stage, from mangled PDF tables to asymmetric retrieval mismatches.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The code in this article is a solid foundation. It handles real PDFs, persists data, and gives you source citations. But it's still a &lt;strong&gt;naive RAG&lt;/strong&gt; pipeline — one query, one retrieval pass, one answer. In the next articles, we'll add the components that separate toy demos from production systems: hybrid search, reranking, query optimization, and evaluation frameworks.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://python.langchain.com/docs/tutorials/rag/" rel="noopener noreferrer"&gt;LangChain RAG Tutorial&lt;/a&gt; — Official LangChain RAG quickstart&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://python.langchain.com/docs/concepts/lcel/" rel="noopener noreferrer"&gt;LangChain Expression Language (LCEL)&lt;/a&gt; — Why and how to use LCEL for composable chains&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB Documentation&lt;/a&gt; — Vector store setup, persistence, and querying&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>langchain</category>
      <category>opensource</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No.52): Tank-OS - A Red Hat Engineer Baked an AI Agent Into a Bootable Linux Image Over a Weekend</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Thu, 30 Apr 2026 11:59:02 +0000</pubDate>
      <link>https://dev.to/wonderlab/one-open-source-project-a-day-no52-tank-os-a-red-hat-engineer-baked-an-ai-agent-into-a-1262</link>
      <guid>https://dev.to/wonderlab/one-open-source-project-a-day-no52-tank-os-a-red-hat-engineer-baked-an-ai-agent-into-a-1262</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"When an AI Agent starts deleting emails, accessing databases, and calling external APIs, are you certain it can't go out of bounds?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is article No.52 in the "One Open Source Project a Day" series. Today's project is &lt;strong&gt;Tank-OS&lt;/strong&gt; (&lt;a href="https://github.com/LobsterTrap/tank-os" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In April 2026, TechCrunch reported on a project that one engineer built over a single weekend. The engineer is Sally O'Malley, Principal Software Engineer at Red Hat's Office of the CTO and a core maintainer of OpenClaw. The project answers a question that becomes more pressing as AI Agents get more capable: &lt;strong&gt;when you need to deploy a fleet of AI Agents across a company, how do you ensure every machine is isolated, secure, and consistently updatable?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tank-OS's answer: &lt;strong&gt;pack the Agent, its runtime, the OS, Systemd units, and the upgrade mechanism into a single OCI container image, then boot entire machines directly from that image.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In cloud-native circles, this pattern (called bootc — Boot Container) isn't new. But applying it specifically to solve AI Agent enterprise deployment security? Tank-OS is the most concrete, complete open-source reference implementation available.&lt;/p&gt;

&lt;p&gt;TechCrunch headline: &lt;em&gt;"Red Hat's OpenClaw maintainer just made enterprise Claw deployments a lot safer."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;What bootc is: why "OS as a container" is the next Linux deployment paradigm&lt;/li&gt;
&lt;li&gt;rootless Podman Quadlet: how to run containers with no root privileges and no daemon&lt;/li&gt;
&lt;li&gt;Tank-OS's layered architecture: how the immutable OS layer and mutable state layer stay separate&lt;/li&gt;
&lt;li&gt;Transactional system updates: rolling back an entire machine's OS state with one command&lt;/li&gt;
&lt;li&gt;Credential management: how API keys never touch the filesystem in plaintext&lt;/li&gt;
&lt;li&gt;Tank-OS vs. NVIDIA NemoClaw: two enterprise AI Agent security approaches compared&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic understanding of Linux containers (images, OCI spec)&lt;/li&gt;
&lt;li&gt;Familiarity with Podman or Docker&lt;/li&gt;
&lt;li&gt;Basic Systemd service management knowledge&lt;/li&gt;
&lt;li&gt;Familiarity with OpenClaw (AI Agent framework, 40,000+ GitHub Stars)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is It?
&lt;/h3&gt;

&lt;p&gt;The core problem Tank-OS addresses can be stated in one sentence: &lt;strong&gt;AI Agents are powerful enough to be dangerous, and most deployment approaches haven't taken that seriously.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sally O'Malley's project documentation describes real incidents: OpenClaw Agents in production have deleted emails and accidentally modified user data. When an enterprise deploys dozens or hundreds of Agent-running machines, the traditional "apt install + config files" approach has four fundamental problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Credential security&lt;/strong&gt;: API Keys live in config files, readable by other processes or accidentally leaked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update consistency&lt;/strong&gt;: Different OS versions and library versions across machines cause inconsistent Agent behavior
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure isolation&lt;/strong&gt;: A crashed or compromised Agent process can affect the host OS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fleet management&lt;/strong&gt;: No unified mechanism to push updates across the whole fleet&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Tank-OS uses a bootc + rootless Podman Quadlet stack to systematically address all four.&lt;/p&gt;

&lt;h3&gt;
  
  
  About the Author
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sally Ann O'Malley&lt;/strong&gt; (GitHub: &lt;code&gt;sallyom&lt;/code&gt;; &lt;code&gt;LobsterTrap&lt;/code&gt; is the account she created for this project)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Role&lt;/strong&gt;: Red Hat Principal Software Engineer, Emerging Technologies, Office of the CTO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw connection&lt;/strong&gt;: Core maintainer of OpenClaw, collaborating with founder Peter Steinberger on enterprise use cases and Red Hat Linux ecosystem integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background&lt;/strong&gt;: Long-time contributor to Red Hat container technologies; deep user and contributor to Podman, bootc, and related Red Hat open-source projects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Origin of Tank-OS&lt;/strong&gt;: Built in a single weekend. O'Malley anticipated "what happens when OpenClaw enters the enterprise" and wanted a reference architecture ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Red Hat's official blog also published a companion article: &lt;em&gt;"Building a hardened, image-based foundation for AI agents"&lt;/em&gt; — signaling this is not just a personal project but Red Hat's formal exploration of AI Agent infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;GitHub Stars&lt;/strong&gt;: 104&lt;/li&gt;
&lt;li&gt;🍴 &lt;strong&gt;Forks&lt;/strong&gt;: 11&lt;/li&gt;
&lt;li&gt;🔤 &lt;strong&gt;Language&lt;/strong&gt;: Shell (81.7%), Dockerfile (18.3%)&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;Pre-built image&lt;/strong&gt;: &lt;code&gt;quay.io/sallyom/tank-os:latest&lt;/code&gt; (amd64 / arm64)&lt;/li&gt;
&lt;li&gt;📰 &lt;strong&gt;Press&lt;/strong&gt;: TechCrunch (2026-04-28), Decrypt, Yahoo Tech&lt;/li&gt;
&lt;li&gt;📅 &lt;strong&gt;Released&lt;/strong&gt;: April 2026&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Architecture Shift
&lt;/h3&gt;

&lt;p&gt;Tank-OS elevates "how to safely run an AI Agent" from a software configuration problem to an &lt;strong&gt;operating system architecture problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Traditional deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host OS (mutable)
  └── Userspace (mutable)
        └── OpenClaw process (can access host filesystem)
              └── API Keys (plaintext config file)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tank-OS deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bootc image (immutable OS layer, read-only)
  └── openclaw user (no root privileges)
        └── rootless Podman Quadlet (container, no daemon)
              └── OpenClaw process (isolated from host)
                    └── API Keys (Podman secret store, encrypted)
  └── ~/.openclaw/ (mutable state layer, persisted but isolated from OS)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Enterprise AI Agent fleet management&lt;/strong&gt;&lt;br&gt;
Dozens or hundreds of machines running the same OpenClaw version, kept consistent via unified image update — no "configuration drift" causing behavioral differences across machines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dev/prod environment parity&lt;/strong&gt;&lt;br&gt;
Developers run the exact same Tank-OS image locally in a VM, eliminating "works on my machine" problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Read-only OS security hardening&lt;/strong&gt;&lt;br&gt;
Even if an Agent process is compromised or has a bug, it cannot modify the host OS filesystem — the OS layer is immutable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Cloud instances and edge devices&lt;/strong&gt;&lt;br&gt;
SSH key injection via cloud-init enables fast Agent instance startup on AWS EC2, GCP VMs, or Raspberry Pi devices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Fast version rollback&lt;/strong&gt;&lt;br&gt;
One command to roll back to the previous OS + Agent version — no manual uninstall/reinstall of dependencies.&lt;/p&gt;
&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Use the pre-built image (recommended)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No local compilation needed. Use Podman Desktop's BootC extension or &lt;code&gt;bootc-image-builder&lt;/code&gt; to generate a virtual disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create output directory&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; out-tank-os

&lt;span class="c"&gt;# (Optional) Prepare SSH key injection config&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; config.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "blueprint": {
    "customizations": {
      "user": [
        {
          "name": "openclaw",
          "groups": ["wheel"],
          "key": "ssh-ed25519 AAAA...your-public-key..."
        }
      ]
    }
  }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Build QCOW2 disk image (requires rootful Podman)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;podman run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--privileged&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--pull&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;newer &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ./config.json:/config.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ./out-tank-os:/output &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; /var/lib/containers/storage:/var/lib/containers/storage &lt;span class="se"&gt;\&lt;/span&gt;
  quay.io/centos-bootc/bootc-image-builder:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--type&lt;/span&gt; qcow2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; /config.json &lt;span class="se"&gt;\&lt;/span&gt;
  quay.io/sallyom/tank-os:latest

&lt;span class="c"&gt;# Result: out-tank-os/qcow2/disk.qcow2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SSH in and interact with OpenClaw:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Log in as openclaw user (not root)&lt;/span&gt;
ssh openclaw@&amp;lt;vm-ip&amp;gt;

&lt;span class="c"&gt;# Check Agent status&lt;/span&gt;
openclaw gateway status &lt;span class="nt"&gt;--deep&lt;/span&gt;

&lt;span class="c"&gt;# Run health check&lt;/span&gt;
openclaw doctor

&lt;span class="c"&gt;# Get Dashboard URL&lt;/span&gt;
openclaw dashboard &lt;span class="nt"&gt;--no-open&lt;/span&gt;

&lt;span class="c"&gt;# List connected devices&lt;/span&gt;
openclaw devices list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Update OS + Agent (transactional):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# First update: switch to latest image&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bootc switch &lt;span class="nt"&gt;--apply&lt;/span&gt; quay.io/sallyom/tank-os:latest

&lt;span class="c"&gt;# Subsequent updates: pull new version and restart&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bootc upgrade &lt;span class="nt"&gt;--apply&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Core Features
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Immutable OS&lt;/strong&gt; — Root filesystem mounted read-only; Agent processes and userspace programs cannot modify system files. Configuration drift is structurally impossible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;rootless Podman Quadlet&lt;/strong&gt; — OpenClaw runs in a Podman container with no root privileges, lifecycle managed by Systemd Quadlet units. Compromised container process cannot escalate to host OS.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transactional updates&lt;/strong&gt; — bootc's atomic update mechanism: updates either succeed completely or roll back completely. No "half-updated broken machine" state.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Secure credential management&lt;/strong&gt; — API Keys stored in rootless Podman's secret store (encrypted), never appearing in plaintext config files or environment variables. SSH keys injected via cloud-init at first boot, not baked into the image.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-instance support&lt;/strong&gt; — Single machine can run multiple OpenClaw instances (e.g., one for work, one for research), fully isolated with separate containers, ports, data directories, and secret namespaces.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-architecture&lt;/strong&gt; — Pre-built images support both &lt;code&gt;linux/amd64&lt;/code&gt; and &lt;code&gt;linux/arm64&lt;/code&gt;, covering x86 servers and Apple Silicon/ARM devices.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;cloud-init native integration&lt;/strong&gt; — Standard cloud-init support for AWS, GCP, Azure instance initialization.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  How It Compares
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Tank-OS&lt;/th&gt;
&lt;th&gt;Bare metal OpenClaw&lt;/th&gt;
&lt;th&gt;Docker compose&lt;/th&gt;
&lt;th&gt;NVIDIA NemoClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OS immutability&lt;/td&gt;
&lt;td&gt;Complete (bootc read-only root)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None (host OS mutable)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container privileges&lt;/td&gt;
&lt;td&gt;rootless (no daemon)&lt;/td&gt;
&lt;td&gt;No containers&lt;/td&gt;
&lt;td&gt;Root Docker daemon&lt;/td&gt;
&lt;td&gt;K3s + Docker daemon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transactional updates&lt;/td&gt;
&lt;td&gt;Atomic rollback&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential storage&lt;/td&gt;
&lt;td&gt;Podman secret store&lt;/td&gt;
&lt;td&gt;Plaintext files&lt;/td&gt;
&lt;td&gt;Docker secrets or plaintext&lt;/td&gt;
&lt;td&gt;L7 proxy injection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fleet management&lt;/td&gt;
&lt;td&gt;Suitable (unified image)&lt;/td&gt;
&lt;td&gt;Difficult&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Suitable (more complex)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment complexity&lt;/td&gt;
&lt;td&gt;Low (single image boot)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High (K3s cluster)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security policy granularity&lt;/td&gt;
&lt;td&gt;Medium (OS-level isolation)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High (multi-layer policy engine)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Technology 1: bootc (Boot Container)
&lt;/h3&gt;

&lt;p&gt;bootc is a Red Hat-led open-source project that implements a counterintuitive idea: &lt;strong&gt;manage the entire operating system as an OCI container image&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Traditional Linux update model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System running
  → apt upgrade / dnf update (replacing files one by one)
  → partial success → system in intermediate state → rollback is hard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;bootc update model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Currently running: image v1.0 (read-only mounted)
  → bootc switch quay.io/sallyom/tank-os:v1.1
  → Download new image layers, write to separate partition
  → Reboot → atomic switch to v1.1
  → Error → bootc rollback → back to v1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is analogous to the A/B partition mechanism in phone OTA updates: new version writes to standby partition, reboots to switch, keeps original partition if something fails.&lt;/p&gt;

&lt;p&gt;Tank-OS's &lt;code&gt;bootc/Containerfile&lt;/code&gt; builds from &lt;code&gt;quay.io/fedora/fedora-bootc:latest&lt;/code&gt;, Fedora's official bootc base image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; quay.io/fedora/fedora-bootc:latest&lt;/span&gt;

&lt;span class="c"&gt;# Install required packages&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;dnf &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    podman &lt;span class="se"&gt;\
&lt;/span&gt;    openssh-server &lt;span class="se"&gt;\
&lt;/span&gt;    cloud-init &lt;span class="se"&gt;\
&lt;/span&gt;    python3 &lt;span class="se"&gt;\
&lt;/span&gt;    shadow-utils &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;sudo&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    vim

&lt;span class="c"&gt;# Create openclaw user (UID/GID 1000)&lt;/span&gt;
&lt;span class="c"&gt;# Configure subuid/subgid range (100000-165535) for rootless Podman&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;useradd &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; 1000 &lt;span class="nt"&gt;-g&lt;/span&gt; 1000 openclaw &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; wheel openclaw &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"100000:65536"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /etc/subuid &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"100000:65536"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /etc/subgid

&lt;span class="c"&gt;# Enable cloud-init service family&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    cloud-init-local.service &lt;span class="se"&gt;\
&lt;/span&gt;    cloud-init-network.service &lt;span class="se"&gt;\
&lt;/span&gt;    cloud-init.service &lt;span class="se"&gt;\
&lt;/span&gt;    cloud-config.service &lt;span class="se"&gt;\
&lt;/span&gt;    cloud-final.service &lt;span class="se"&gt;\
&lt;/span&gt;    sshd.service

&lt;span class="c"&gt;# Inject tank-os scripts and Quadlet units&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; bootc/ /&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Core Technology 2: rootless Podman Quadlet
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Podman instead of Docker?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Podman is Red Hat's daemonless container tool. The core security advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No central daemon&lt;/strong&gt;: Docker requires a root-privileged &lt;code&gt;dockerd&lt;/code&gt; running continuously; Podman forks container processes directly, no daemon required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rootless mode&lt;/strong&gt;: Regular users can run full containers; container "root" is an unprivileged user on the host&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User namespace isolation&lt;/strong&gt;: Via subuid/subgid mapping — container UID 1000 maps to host UID 101000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenClaw in Tank-OS runs as the &lt;code&gt;openclaw&lt;/code&gt; user (UID 1000). Even if the OpenClaw process is exploited:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It cannot access host OS system files (OS layer is read-only)&lt;/li&gt;
&lt;li&gt;It cannot escalate to root (rootless Podman namespace isolation)&lt;/li&gt;
&lt;li&gt;It cannot affect other users' data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What is Quadlet?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Quadlet (introduced in Podman 4.4+) lets you declare container runtime configuration using Systemd unit file syntax, with Systemd managing the container lifecycle directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/containers/systemd/users/1000/openclaw.container
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;OpenClaw AI Agent Service&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network-online.target&lt;/span&gt;

&lt;span class="nn"&gt;[Container]&lt;/span&gt;
&lt;span class="py"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;ghcr.io/openclaw/openclaw:latest&lt;/span&gt;
&lt;span class="py"&gt;ContainerName&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;openclaw&lt;/span&gt;
&lt;span class="py"&gt;UserNS&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;keep-id:uid=1000,gid=1000&lt;/span&gt;
&lt;span class="py"&gt;Volume&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;%h/.openclaw:/home/openclaw/.openclaw:z&lt;/span&gt;
&lt;span class="py"&gt;Secret&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;openclaw-api-key,type=env,target=OPENCLAW_API_KEY&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;default.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Systemd reads this file and automatically generates the corresponding &lt;code&gt;.service&lt;/code&gt; unit — pulling the image, creating the container, managing restart policy. OpenClaw becomes a standard Systemd service: &lt;code&gt;systemctl --user status openclaw&lt;/code&gt; for status, &lt;code&gt;journalctl --user -u openclaw&lt;/code&gt; for logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  State Layer Architecture
&lt;/h3&gt;

&lt;p&gt;Tank-OS strictly separates immutable and mutable layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Immutable layer (OS image, read-only)
├── /usr/           ← System software
├── /etc/           ← System config (baked into image)
├── /opt/tank-os/   ← tank-os scripts and tools
└── Quadlet units   ← Container declaration files

Mutable layer (runtime state, persisted)
├── ~/.openclaw/            ← OpenClaw session state, plugins, history
├── ~/.config/containers/   ← Podman user config
└── Podman secret store     ← Encrypted API keys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run &lt;code&gt;bootc upgrade&lt;/code&gt;, only the immutable layer is replaced. OpenClaw's state, conversation history, and API keys in the mutable layer are fully preserved. Upgrading the Agent version while keeping all session history and configuration intact.&lt;/p&gt;

&lt;h3&gt;
  
  
  Credential Flow
&lt;/h3&gt;

&lt;p&gt;The API key injection flow is a security highlight:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. First boot: cloud-init executes
   ↓
2. Reads user-data (from cloud provider metadata or local config)
   ↓
3. Runs tank-os bootstrap script
   ↓
4. Writes API key into rootless Podman secret store (encrypted)
   ↓
5. Quadlet unit starts OpenClaw container
   ↓
6. Secret= directive injects secret as environment variable into container
   ↓
7. API key never appears in plaintext on the filesystem or in process environment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared to a traditional &lt;code&gt;.env&lt;/code&gt; file approach: Podman's secret store uses system keychain encryption; ordinary file reads cannot retrieve the contents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tank-OS vs. NVIDIA NemoClaw
&lt;/h3&gt;

&lt;p&gt;The most common technical question after Tank-OS launched: "How does this differ from NVIDIA's NemoClaw reference architecture?"&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Tank-OS&lt;/th&gt;
&lt;th&gt;NVIDIA NemoClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Core technology&lt;/td&gt;
&lt;td&gt;Fedora bootc + rootless Podman&lt;/td&gt;
&lt;td&gt;Docker + embedded K3s cluster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS immutability&lt;/td&gt;
&lt;td&gt;Complete (bootc partition-level)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential isolation&lt;/td&gt;
&lt;td&gt;Podman secret store&lt;/td&gt;
&lt;td&gt;L7 proxy injection (key never touches Agent filesystem)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security policy&lt;/td&gt;
&lt;td&gt;OS-level isolation&lt;/td&gt;
&lt;td&gt;seccomp + Landlock + network namespace multi-layer policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update mechanism&lt;/td&gt;
&lt;td&gt;Atomic image replacement (rollback capable)&lt;/td&gt;
&lt;td&gt;Standard container update&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target scenario&lt;/td&gt;
&lt;td&gt;Enterprise fleet (standard IT ops toolchain)&lt;/td&gt;
&lt;td&gt;Research environments, fine-grained policy control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment complexity&lt;/td&gt;
&lt;td&gt;Low (single image, standard VM tools)&lt;/td&gt;
&lt;td&gt;High (K3s cluster operations)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;: Tank-OS fits "treat AI Agent as standard IT infrastructure" enterprise scenarios. NemoClaw fits "need multi-layer fine-grained security policy" research or high-security-requirement environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/LobsterTrap/tank-os" rel="noopener noreferrer"&gt;https://github.com/LobsterTrap/tank-os&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;Pre-built image&lt;/strong&gt;: &lt;a href="https://quay.io/sallyom/tank-os" rel="noopener noreferrer"&gt;quay.io/sallyom/tank-os&lt;/a&gt; (amd64 / arm64)&lt;/li&gt;
&lt;li&gt;📚 &lt;strong&gt;Build docs&lt;/strong&gt;: &lt;a href="https://github.com/LobsterTrap/tank-os/blob/main/docs/build.md" rel="noopener noreferrer"&gt;docs/build.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📚 &lt;strong&gt;CLI docs&lt;/strong&gt;: &lt;a href="https://github.com/LobsterTrap/tank-os/blob/main/docs/cli.md" rel="noopener noreferrer"&gt;docs/cli.md&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Press &amp;amp; Related
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;📰 &lt;strong&gt;TechCrunch&lt;/strong&gt;: &lt;a href="https://techcrunch.com/2026/04/28/red-hats-openclaw-maintainer-just-made-enterprise-claw-deployments-a-lot-safer/" rel="noopener noreferrer"&gt;Red Hat's OpenClaw maintainer just made enterprise Claw deployments a lot safer&lt;/a&gt; (Julie Bort, 2026-04-28)&lt;/li&gt;
&lt;li&gt;📝 &lt;strong&gt;Red Hat blog&lt;/strong&gt;: &lt;a href="https://www.redhat.com/en/blog/building-hardened-image-based-foundation-ai-agents" rel="noopener noreferrer"&gt;Building a hardened, image-based foundation for AI agents&lt;/a&gt; (Sally O'Malley)&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Fedora bootc docs&lt;/strong&gt;: &lt;a href="https://docs.fedoraproject.org/en-US/bootc/" rel="noopener noreferrer"&gt;docs.fedoraproject.org/en-US/bootc&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;OpenClaw project&lt;/strong&gt;: &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;github.com/openclaw/openclaw&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clear problem definition&lt;/strong&gt;: Tank-OS targets real enterprise AI Agent deployment security problems — credential leakage, configuration drift, update consistency, failure isolation — not technology stack-stacking for its own sake&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature technology choices&lt;/strong&gt;: bootc + rootless Podman + Systemd Quadlet are all Red Hat production-grade tools with full enterprise support and documentation. Not experimental.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The layered design is the insight&lt;/strong&gt;: Strict separation of immutable OS layer and mutable state layer is what lets Tank-OS simultaneously achieve "security isolation" and "state persistence." This design principle is worth borrowing for any stateful AI Agent deployment scenario.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transactional updates change the ops paradigm&lt;/strong&gt;: A single &lt;code&gt;bootc upgrade&lt;/code&gt; command updates the entire machine's OS + Agent, with atomic rollback on failure. For IT teams managing AI Agent fleets, this is a qualitative change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The weekend project lesson&lt;/strong&gt;: Tank-OS uses ~500 lines of Shell scripts and Dockerfile to solve a real enterprise pain point. A reminder that many apparently complex infrastructure problems only need existing mature tools composed correctly.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Who Should Use This
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise IT administrators&lt;/strong&gt;: Deploying OpenClaw Agent fleets internally with unified management and security compliance requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevOps/SRE engineers&lt;/strong&gt;: Interested in bootc and immutable infrastructure; looking for an AI Agent deployment reference architecture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Red Hat/Fedora users&lt;/strong&gt;: Want to integrate AI Agents seamlessly into existing RHEL infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI security researchers&lt;/strong&gt;: Studying AI Agent isolation, credential management, and enterprise security hardening&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where to Start
&lt;/h3&gt;

&lt;p&gt;Start by understanding bootc's core concepts (&lt;a href="https://docs.fedoraproject.org/en-US/bootc/" rel="noopener noreferrer"&gt;Fedora bootc docs&lt;/a&gt;), then read Tank-OS's &lt;code&gt;bootc/Containerfile&lt;/code&gt; and &lt;code&gt;docs/build.md&lt;/code&gt;, following the documentation to run an instance in a local VM. The whole process takes under 2 hours, but you'll come away with a very concrete feel for the "OS as container" paradigm.&lt;/p&gt;

&lt;p&gt;If you're already deploying OpenClaw in an enterprise environment, Tank-OS can be used almost directly as a production reference architecture. Sally O'Malley is OpenClaw's enterprise maintainer — every design decision in this project comes from real enterprise requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Question Worth Sitting With
&lt;/h3&gt;

&lt;p&gt;Tank-OS's central premise is worth sitting with: &lt;strong&gt;when should AI Agent security be solved at the OS architecture level rather than the application configuration level?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most current AI Agent deployments treat security as a software problem — use the right API key management library, set the right file permissions, configure the right network rules. Tank-OS argues that once an Agent is powerful enough to delete emails, modify databases, and call external APIs, software configuration isn't enough. The OS itself needs to be part of the security model.&lt;/p&gt;

&lt;p&gt;That's a significant architectural claim. The fact that it came from a Red Hat engineer on a weekend — not a multi-year research project — suggests the underlying tools were already ready. Someone just needed to put them together.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Visit my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;personal site&lt;/a&gt; for more useful knowledge and interesting products&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>openclaw</category>
      <category>linux</category>
      <category>podman</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No.51): VibeVoice - Microsoft's Speech AI That Processes 90 Minutes of Audio in a Single Pass</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Wed, 29 Apr 2026 02:32:29 +0000</pubDate>
      <link>https://dev.to/wonderlab/one-open-source-project-a-day-no51-vibevoice-microsofts-speech-ai-that-processes-90-minutes-3k6p</link>
      <guid>https://dev.to/wonderlab/one-open-source-project-a-day-no51-vibevoice-microsofts-speech-ai-that-processes-90-minutes-3k6p</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"The fundamental limit of traditional speech AI isn't model quality — it's architecture. They were never designed for long audio."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is article No.51 in the "One Open Source Project a Day" series. Today's project is &lt;strong&gt;VibeVoice&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/VibeVoice" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;In August 2025, Microsoft Research quietly pushed a repository to GitHub. No launch event. No press release.&lt;/p&gt;

&lt;p&gt;The capability it demonstrated: &lt;strong&gt;synthesizing 90 minutes of natural multi-speaker conversation — 4 speakers, consistent voices throughout — in a single model pass&lt;/strong&gt;. For context, ElevenLabs tops out around 5 minutes per call. OpenAI's TTS has similar constraints. The open-source alternatives before this couldn't touch an hour of audio without stitching together segments.&lt;/p&gt;

&lt;p&gt;The mechanism behind this is a single architectural decision: a &lt;strong&gt;7.5 Hz ultra-low framerate tokenizer&lt;/strong&gt; that compresses 90 minutes of audio into ~40,500 tokens — small enough to fit inside an LLM's context window. That's a 3,200x compression ratio compared to the raw audio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;44,900+ Stars, 5,000+ Forks. ICLR 2026 Oral Presentation&lt;/strong&gt; (top ~2% acceptance rate at the field's premier venue).&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Why 7.5 Hz tokenization is the key architectural insight: the math behind 3,200x compression&lt;/li&gt;
&lt;li&gt;The LLM + diffusion head hybrid architecture: how semantic understanding and acoustic generation divide responsibilities&lt;/li&gt;
&lt;li&gt;Three model comparison: ASR-7B, TTS-1.5B, Realtime-0.5B — what each is for&lt;/li&gt;
&lt;li&gt;Benchmark numbers: how VibeVoice ASR compares to Gemini 2.5 Pro and Whisper&lt;/li&gt;
&lt;li&gt;The misuse incident: why Microsoft pulled TTS access in September 2025 and current availability status&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Basic familiarity with speech AI concepts (TTS, ASR, WER)&lt;/li&gt;
&lt;li&gt;Python and Hugging Face Transformers experience&lt;/li&gt;
&lt;li&gt;Basic understanding of LLM inference and diffusion models (optional but helpful for architecture sections)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is It?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;VibeVoice&lt;/strong&gt; is a family of speech AI models from Microsoft Research that uses a unified architecture — 7.5 Hz continuous speech tokenizer + LLM backbone + diffusion head — to address the fundamental limitation of existing speech AI systems: they were built for short audio.&lt;/p&gt;

&lt;p&gt;Traditional TTS synthesizes a few minutes at most. Traditional ASR (like Whisper) splits long audio into 30-second chunks and processes them sequentially — meaning speaker tracking breaks at every boundary and global semantic context is lost. Try to process a one-hour podcast or generate a 45-minute audiobook and these systems essentially give up.&lt;/p&gt;

&lt;p&gt;VibeVoice's answer: &lt;strong&gt;redesign at the architecture level, so the LLM operates directly on compressed audio tokens over the full audio length&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  About the Team
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Organization&lt;/strong&gt;: Microsoft Research&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Academic recognition&lt;/strong&gt;: VibeVoice-TTS paper "Expressive Podcast Generation with Next-Token Diffusion" accepted as ICLR 2026 Oral Presentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical report&lt;/strong&gt;: arXiv 2508.19205&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First published&lt;/strong&gt;: August 2025&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;GitHub Stars&lt;/strong&gt;: 44,900+&lt;/li&gt;
&lt;li&gt;🍴 &lt;strong&gt;Forks&lt;/strong&gt;: 5,000+&lt;/li&gt;
&lt;li&gt;📝 &lt;strong&gt;Commits&lt;/strong&gt;: 134&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;🌐 &lt;strong&gt;Language&lt;/strong&gt;: Python (100%)&lt;/li&gt;
&lt;li&gt;🤗 &lt;strong&gt;HuggingFace&lt;/strong&gt;: &lt;a href="https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f" rel="noopener noreferrer"&gt;microsoft/vibevoice collection&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three Models, One Architecture
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Core capability&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VibeVoice-ASR&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;60-min long audio recognition + speaker diarization&lt;/td&gt;
&lt;td&gt;Meeting transcription, podcast-to-text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VibeVoice-TTS&lt;/td&gt;
&lt;td&gt;1.5B&lt;/td&gt;
&lt;td&gt;90-min 4-speaker speech synthesis&lt;/td&gt;
&lt;td&gt;Podcast generation, audiobook production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VibeVoice-Realtime&lt;/td&gt;
&lt;td&gt;0.5B&lt;/td&gt;
&lt;td&gt;~300ms first-chunk latency streaming&lt;/td&gt;
&lt;td&gt;Real-time voice assistants, dialogue systems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Core Technical Features
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Single-pass 60-minute ASR&lt;/strong&gt;&lt;br&gt;
No segmentation, no concatenation — one forward pass over the entire audio. Speaker tracking doesn't break, global semantic context is maintained end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. 90-minute multi-speaker TTS&lt;/strong&gt;&lt;br&gt;
Up to 4 speakers. Single synthesis of a complete podcast. Voice consistency is maintained for each speaker across the full duration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. 7.5 Hz ultra-low framerate tokenizer&lt;/strong&gt;&lt;br&gt;
This is the key innovation. At 7.5 Hz vs. Encodec's 25-50 Hz, the compression ratio improves 80x. 90 minutes of audio becomes ~40,500 tokens — fitting comfortably inside a 64K context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Custom hotword injection&lt;/strong&gt;&lt;br&gt;
ASR supports dynamic domain vocabulary injection at inference time (medical, legal, product names) without retraining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Expressive synthesis with natural spontaneous speech&lt;/strong&gt;&lt;br&gt;
TTS generates contextually appropriate emotional variation — laughter, pauses, interjections, natural dialogue prosody.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. LoRA fine-tuning support&lt;/strong&gt;&lt;br&gt;
ASR supports LoRA fine-tuning for accent adaptation or domain specialization with minimal labeled data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. vLLM inference backend&lt;/strong&gt;&lt;br&gt;
ASR supports vLLM acceleration for high-throughput production deployment.&lt;/p&gt;
&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;ASR model (integrated into HuggingFace Transformers since March 2026):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;transformers torch soundfile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoModelForSpeechSeq2Seq&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;soundfile&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sf&lt;/span&gt;

&lt;span class="c1"&gt;# Load model (GPU required, float16 inference)
&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microsoft/VibeVoice-ASR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForSpeechSeq2Seq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microsoft/VibeVoice-ASR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read audio (supports up to 60 minutes)
&lt;/span&gt;&lt;span class="n"&gt;audio_array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sampling_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meeting.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Single-pass inference over the full audio
&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;audio_array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sampling_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Structured output with speaker labels and timestamps
&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Realtime TTS model:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/microsoft/VibeVoice.git
&lt;span class="nb"&gt;cd &lt;/span&gt;VibeVoice
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="c"&gt;# Optional: install Flash Attention for speedup&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;flash-attn &lt;span class="nt"&gt;--no-build-isolation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vibevoice&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VibeVoiceRealtime&lt;/span&gt;

&lt;span class="n"&gt;tts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;VibeVoiceRealtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microsoft/VibeVoice-Realtime-0.5B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Streaming synthesis — audio starts playing ~300ms after first input
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;audio_chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Welcome to VibeVoice. This is a real-time synthesis demo.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;play_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How It Compares
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;VibeVoice&lt;/th&gt;
&lt;th&gt;ElevenLabs&lt;/th&gt;
&lt;th&gt;OpenAI TTS&lt;/th&gt;
&lt;th&gt;Whisper (ASR)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max single-pass length&lt;/td&gt;
&lt;td&gt;TTS 90 min / ASR 60 min&lt;/td&gt;
&lt;td&gt;~5 min&lt;/td&gt;
&lt;td&gt;Heavy limits&lt;/td&gt;
&lt;td&gt;30s chunks stitched&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-speaker&lt;/td&gt;
&lt;td&gt;Up to 4 speakers&lt;/td&gt;
&lt;td&gt;Multiple calls required&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Post-processing only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First-chunk latency (realtime)&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;td&gt;~75ms&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local deployment&lt;/td&gt;
&lt;td&gt;Fully supported&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Free, open-source&lt;/td&gt;
&lt;td&gt;$5-330/month&lt;/td&gt;
&lt;td&gt;$15/1M chars&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware requirement&lt;/td&gt;
&lt;td&gt;8GB+ VRAM&lt;/td&gt;
&lt;td&gt;None (API)&lt;/td&gt;
&lt;td&gt;None (API)&lt;/td&gt;
&lt;td&gt;4GB+ VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ICLR academic recognition&lt;/td&gt;
&lt;td&gt;Oral 2026&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why 7.5 Hz? The Token Math
&lt;/h3&gt;

&lt;p&gt;To understand VibeVoice's core innovation, start with the fundamental tension it resolves: &lt;strong&gt;audio data is enormous vs. LLM context windows are finite&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;90 minutes of 24kHz audio encoded with a traditional codec at 50 Hz (like Encodec) produces approximately &lt;strong&gt;270,000 tokens&lt;/strong&gt;. That exceeds the context capacity of any existing LLM. End-to-end long-audio understanding is effectively impossible.&lt;/p&gt;

&lt;p&gt;VibeVoice compresses the framerate from 50 Hz to 7.5 Hz. Token count drops to ~&lt;strong&gt;40,500&lt;/strong&gt; — a compression ratio of roughly 80x relative to Encodec. 90 minutes of audio now fits inside a 64K context window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;90 min audio × 60 s/min × 7.5 frames/s = 40,500 tokens ✓ (fits in 64K context)
90 min audio × 60 s/min × 50 frames/s = 270,000 tokens ✗ (exceeds any LLM window)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real question: can 7.5 Hz preserve enough acoustic information for high-quality synthesis? That's what the σ-VAE tokenizer architecture is designed to answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dual Tokenizer Architecture (σ-VAE)
&lt;/h3&gt;

&lt;p&gt;VibeVoice uses two parallel continuous speech tokenizers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw audio (24kHz)
    │
    ├── Acoustic Tokenizer
    │   └── σ-VAE architecture
    │   └── Captures timbre, prosody, pitch, voice quality
    │   └── Output: acoustic latent vectors @ 7.5 Hz
    │
    └── Semantic Tokenizer
        └── Same σ-VAE, but trained with ASR surrogate task
        └── Captures linguistic content, word boundaries
        └── Output: semantic latent vectors @ 7.5 Hz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two tokenizers describe the same audio from different angles: the semantic tokenizer "knows what was said," the acoustic tokenizer "knows how it was said." This disentanglement is what makes the hybrid LLM + diffusion architecture possible downstream.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM + Diffusion Head Hybrid Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text input / audio token input
          │
    ┌─────▼────────────────────────┐
    │   Qwen2.5 LLM backbone       │
    │   - Understands dialogue semantics   │
    │   - Manages speaker identity changes │
    │   - Outputs semantic token sequence  │
    └─────────────┬────────────────┘
                  │ semantic tokens
    ┌─────────────▼────────────────┐
    │   Diffusion Head             │
    │   - Conditioned on semantic tokens   │
    │   - Generates high-fidelity acoustic latents │
    │   - Handles timbre detail, emotional variation │
    └─────────────┬────────────────┘
                  │ acoustic latents
    ┌─────────────▼────────────────┐
    │   Vocoder                    │
    │   - Decodes acoustic latents to waveform │
    │   - Outputs final audio      │
    └──────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight from the ICLR 2026 paper: &lt;strong&gt;"Hybrid architecture proves necessary: by explicitly disentangling semantic structure from acoustic realization."&lt;/strong&gt; Pure LLM approaches underperform on acoustic detail. Pure diffusion approaches drift semantically over long sequences. The hybrid gets both.&lt;/p&gt;

&lt;h3&gt;
  
  
  ASR: End-to-End Long Audio Understanding
&lt;/h3&gt;

&lt;p&gt;VibeVoice-ASR (7B parameters) doesn't slice long audio. The architectural contrast is stark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional Whisper:
60 min audio → split into 120 × 30s segments → transcribe each → stitch → speaker tracking breaks ×120

VibeVoice-ASR:
60 min audio → compress to ~27,000 tokens → single LLM inference
             → structured output with speaker labels + timestamps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model sees the entire conversation's global context. If Speaker A at minute 5 says something relevant to minute 55, the model maintains consistent semantic understanding throughout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark performance:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Medical audio WER&lt;/th&gt;
&lt;th&gt;LibriSpeech test-clean WER&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VibeVoice-ASR 9B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.34%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 2.5 Pro&lt;/td&gt;
&lt;td&gt;8.15%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs Scribe v2&lt;/td&gt;
&lt;td&gt;9.72%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Whisper large-v3&lt;/td&gt;
&lt;td&gt;~11%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VibeVoice-Realtime 0.5B&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.00%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Realtime Streaming: How 300ms First-Chunk Works
&lt;/h3&gt;

&lt;p&gt;VibeVoice-Realtime (0.5B) is optimized separately for low-latency scenarios using &lt;strong&gt;incremental text encoding + parallel acoustic generation&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text stream input:   "Hello" → "Hello, how" → "Hello, how are" → ...
                       ↓             ↓               ↓
Parallel generation: [chunk 1]    [chunk 2]       [chunk 3]
                       ↓
First audio output: ~200-300ms after first input

Specs:
- 8K context window (supports extended conversation history)
- English only (current version)
- Runs on T4 GPU
- Slight instability on very short inputs (≤3 words)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Frozen Tokenizer: Long-Term Engineering Value
&lt;/h3&gt;

&lt;p&gt;VibeVoice's tokenizer weights are frozen — only the LLM backbone and diffusion head require training.&lt;/p&gt;

&lt;p&gt;This means as Qwen, LLaMA, and other base LLMs continue to improve, VibeVoice can swap in a stronger LLM backbone without retraining the tokenizer. &lt;strong&gt;The entire framework upgrades automatically as the LLM ecosystem improves&lt;/strong&gt;. This is one reason researchers see long-term architectural durability in this design.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Misuse Incident
&lt;/h3&gt;

&lt;p&gt;On September 5, 2025, Microsoft temporarily pulled access to the primary TTS model, citing "use patterns inconsistent with stated intent." Documented misuse included: synthesizing voices of non-consenting individuals, deepfake audio production, and fraudulent voice content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current availability (April 2026):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VibeVoice-ASR: Available, integrated into HuggingFace Transformers&lt;/li&gt;
&lt;li&gt;VibeVoice-TTS-1.5B: Restricted access; related calls disabled in public codebase&lt;/li&gt;
&lt;li&gt;VibeVoice-Realtime-0.5B: Available for download on HuggingFace&lt;/li&gt;
&lt;li&gt;Source code: Fully open-source, MIT license&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Microsoft stated they are implementing protective mechanisms (watermarking, access auditing) before restoring full TTS access.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/microsoft/VibeVoice" rel="noopener noreferrer"&gt;https://github.com/microsoft/VibeVoice&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🏠 &lt;strong&gt;Project page&lt;/strong&gt;: &lt;a href="https://microsoft.github.io/VibeVoice/" rel="noopener noreferrer"&gt;https://microsoft.github.io/VibeVoice/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🤗 &lt;strong&gt;HuggingFace models&lt;/strong&gt;: &lt;a href="https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f" rel="noopener noreferrer"&gt;microsoft/vibevoice collection&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;Technical report&lt;/strong&gt;: &lt;a href="https://arxiv.org/abs/2508.19205" rel="noopener noreferrer"&gt;arXiv 2508.19205&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📜 &lt;strong&gt;ICLR 2026 paper&lt;/strong&gt;: &lt;a href="https://openreview.net/pdf?id=FihSkzyxdv" rel="noopener noreferrer"&gt;Expressive Podcast Generation with Next-Token Diffusion&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Demos &amp;amp; Tools
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🧪 &lt;strong&gt;Colab demo (Realtime)&lt;/strong&gt;: &lt;a href="https://colab.research.google.com/github/microsoft/VibeVoice/blob/main/demo/vibevoice_realtime_colab.ipynb" rel="noopener noreferrer"&gt;vibevoice_realtime_colab.ipynb&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔁 &lt;strong&gt;Replicate online trial&lt;/strong&gt;: &lt;a href="https://replicate.com/microsoft/vibevoice" rel="noopener noreferrer"&gt;replicate.com/microsoft/vibevoice&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The tokenizer is the breakthrough&lt;/strong&gt;: 7.5 Hz framerate tokenization compresses 90 minutes to 40,500 tokens. Without this, none of VibeVoice's long-audio capabilities are possible — everything else follows from this single decision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid architecture is necessary, not elegant&lt;/strong&gt;: The ICLR paper's conclusion is unambiguous: pure LLM fails on acoustic detail, pure diffusion fails on semantic coherence over long sequences. The hybrid solves both&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three models for three real scenarios&lt;/strong&gt;: ASR-7B for enterprise meeting transcription, TTS-1.5B for podcast/audiobook production, Realtime-0.5B for conversational AI — same architecture, different tradeoffs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ASR quality is production-ready&lt;/strong&gt;: VibeVoice-ASR 9B at 8.34% medical WER — approaching Gemini 2.5 Pro at 8.15%, beating ElevenLabs and Whisper — is a real result in a demanding domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The misuse incident is a signal&lt;/strong&gt;: When TTS capability reaches the level where it needs to be pulled from open access over safety concerns, that tells you something about the capability ceiling it's hitting&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Who Should Use This
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Podcast creators and content producers&lt;/strong&gt;: Automate multi-speaker podcast generation from scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise IT teams&lt;/strong&gt;: Local deployment for meeting transcription with data privacy requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI application developers&lt;/strong&gt;: Building real-time voice assistants and conversational AI products&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech AI researchers&lt;/strong&gt;: Studying long-audio processing, multi-speaker synthesis, diffusion head architectures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM enthusiasts&lt;/strong&gt;: Looking for a self-hostable, free alternative to ElevenLabs or OpenAI TTS&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where to Start
&lt;/h3&gt;

&lt;p&gt;Start with VibeVoice-ASR — it's fully integrated into HuggingFace Transformers, has the most complete documentation, and was officially released with Transformers in March 2026. If you need real-time TTS, Realtime-0.5B is the currently available version; the official Colab demo runs out of the box.&lt;/p&gt;

&lt;p&gt;For those wanting to understand the architecture deeply, the arXiv technical report (2508.19205) and the ICLR 2026 paper both explain in detail why the hybrid architecture is necessary and why continuous latent space outperforms discrete tokens for high-fidelity acoustic generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Question Worth Sitting With
&lt;/h3&gt;

&lt;p&gt;The misuse incident raises a question that the speech AI field hasn't fully answered: &lt;strong&gt;at what capability level does open-source speech synthesis become a dual-use risk that requires gating?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Text generation has largely been treated as freely distributable. Images crossed a threshold where safety concerns became prominent. Voice synthesis — because it can impersonate specific, identifiable individuals — may be in a different category. VibeVoice's temporary takedown isn't a story about one bad actor; it's a data point about where the field is heading.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Visit my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;personal site&lt;/a&gt; for more useful knowledge and interesting products&lt;/em&gt;&lt;/p&gt;

</description>
      <category>asr</category>
      <category>tts</category>
      <category>microsoft</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Spent Thousands of Dollars in Tokens Building an AI-Driven End-to-End Bug Fix Pipeline</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Wed, 29 Apr 2026 02:05:26 +0000</pubDate>
      <link>https://dev.to/wonderlab/i-spent-thousands-of-dollars-in-tokens-building-an-ai-driven-end-to-end-bug-fix-pipeline-36p0</link>
      <guid>https://dev.to/wonderlab/i-spent-thousands-of-dollars-in-tokens-building-an-ai-driven-end-to-end-bug-fix-pipeline-36p0</guid>
      <description>&lt;h2&gt;
  
  
  Before We Start
&lt;/h2&gt;

&lt;p&gt;Let me lead with the cost: this system burned thousands of dollars in API tokens during development and debugging.&lt;/p&gt;

&lt;p&gt;I still think it's worth writing about. Because this isn't a demo — it's an end-to-end pipeline running against real enterprise systems. A bug ticket in Jira goes in. The AI reads the logs, diagnoses the root cause, writes the fix, runs a Code Review, executes a SonarQube scan, runs unit tests, submits to Gerrit, polls CI/CD for results, adds a human reviewer, and writes back a comment to Jira.&lt;/p&gt;

&lt;p&gt;AI drives the whole thing. Humans only step in at critical decision gates.&lt;/p&gt;

&lt;p&gt;This article is a full retrospective: how the system is designed, what worked, what failed, what engineering problems I ran into, and how I solved them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Build This
&lt;/h2&gt;

&lt;p&gt;There's a category of software engineering work that consumes enormous human capacity every day but follows a highly standardized process: &lt;strong&gt;bug fixing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A typical bug workflow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Receive bug ticket → Read logs → Find root cause → Fix code → Self-test
→ Code Review → Static scan → Unit tests → Submit → Wait for CI
→ Add reviewer → Wait for CR → Merge → Update Jira
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every step involves fixed actions, tool interactions, and a certain amount of accumulated experience. This is exactly the kind of work AI agents are built for.&lt;/p&gt;

&lt;p&gt;The challenge: &lt;strong&gt;this is a 12+ node long-chain pipeline spanning Jira, log systems, code repositories, review standards, Gerrit, and CI/CD infrastructure. A tool failure at any single node can break the entire flow.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent a significant amount of time designing and debugging this workflow on the OpenClaw platform, turning a whiteboard design into a system that actually runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: Three Layers
&lt;/h2&gt;

&lt;p&gt;The system has three layers: &lt;strong&gt;Skills&lt;/strong&gt;, &lt;strong&gt;Workflow&lt;/strong&gt;, and &lt;strong&gt;Platform&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────┐
│              OpenClaw Platform                  │
│  (Enterprise AI Coding Assistant, Claude-based) │
└──────────────────┬─────────────────────────────┘
                   │
┌──────────────────▼─────────────────────────────┐
│              Bug E2E Workflow                   │
│  Node sequence, branching logic, retry policy   │
│  Human gates: checkpoints A / B / C             │
└──────────────────┬─────────────────────────────┘
                   │
┌──────────────────▼─────────────────────────────┐
│                  Skills                         │
│  One independently runnable AI skill per node   │
│  jira-communication / rnd-automotive-issue-     │
│  analyzer / write-code / ph-code-review /       │
│  ph-sonar-scan / ph-junit-ut / commit-format /  │
│  gerrit-verify / ...                            │
└────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Skills&lt;/strong&gt; are atomic units. Each skill maps to a single capability and has its own &lt;code&gt;SKILL.md&lt;/code&gt; that defines context, input/output contracts, and execution steps. When the agent runs a node, it reads the corresponding skill document and follows the spec.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow&lt;/strong&gt; is the orchestration layer. Defined in OpenClaw: node execution order, conditional branches (e.g., retry paths when Code Review fails), human gate trigger conditions, and cross-session state management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Platform&lt;/strong&gt; is OpenClaw — a Claude-based enterprise AI coding assistant that supports multi-agent concurrency, sub-agent invocation, workspace persistence, and workflow orchestration.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Workflow: 12 Nodes
&lt;/h2&gt;

&lt;p&gt;The workflow starts when a bug ticket arrives and ends when Jira is updated. In between:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Node&lt;/th&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Fetch bug info &amp;amp; logs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;jira-communication&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Root cause analysis&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rnd-automotive-issue-analyzer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Fetch source code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;code-fetch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Fix code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;write-android-code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Code Review&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ph-code-review&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Static analysis (SonarQube)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ph-sonar-scan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Generate &amp;amp; run unit tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ph-junit-ut&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Commit code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;commit-format&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Poll CI/CD verify result&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gerrit-verify&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Add Gerrit reviewer&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Automated regression tests&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Write back Jira comment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;jira-communication&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each node is much more than "call an API." Take &lt;strong&gt;root cause analysis&lt;/strong&gt; as an example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download the log attachment from Jira (usually a &lt;code&gt;.zip&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Extract it and locate the relevant log files&lt;/li&gt;
&lt;li&gt;Invoke &lt;code&gt;rnd-automotive-issue-analyzer&lt;/code&gt; — a skill built specifically for diagnosing crashes, black screens, and system stability issues in automotive Android systems&lt;/li&gt;
&lt;li&gt;Output: root cause judgment, affected modules, and suggested fix directions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Only then does code modification begin.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
&lt;strong&gt;Why is "Fetch Source Code" the hardest node?&lt;/strong&gt;&lt;br&gt;
To automatically map a bug description to the correct repository, you need a maintained "module → repo" mapping table and bug tickets that have accurate module fields filled in at creation time. On top of that, some repos are massive — pulling fresh every time is too slow, but caching creates workspace storage and staleness problems. This node is currently in cross-team evaluation (DevOps + IT + Development + QA).&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Tests: 6 Scenarios
&lt;/h2&gt;

&lt;p&gt;The design only matters if it holds up in practice. Here are the 6 most representative cases from actual test runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 1: The Closest to the Happy Path
&lt;/h3&gt;

&lt;p&gt;This was the most satisfying run. The workflow completed end-to-end as designed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fetch bug info
  → Download logs → Extract → Analyze root cause
  → Fix code
  → Code Review (passed)
  → SonarQube scan
  → Unit tests
  → Commit to Gerrit
  → Poll verify status (scheduled)
  → Add Gerrit reviewer
  → Add Jira comment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 12 nodes, fully AI-driven, zero human intervention. The Gerrit MR was submitted successfully; the Jira ticket was automatically updated with the analysis summary and action log.&lt;/p&gt;

&lt;p&gt;One minor glitch: the Commit step was supposed to run automatically but triggered a human confirmation prompt. This was Claude Code's Permission system requiring a second confirmation for certain Git operations. Fixed later by adjusting the Hooks configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 2: Tool Failure → Wait for Human
&lt;/h3&gt;

&lt;p&gt;The UT step failed due to a JDK version mismatch in the environment.&lt;/p&gt;

&lt;p&gt;When the workflow detected the failure, &lt;strong&gt;it didn't crash — it triggered the human notification path&lt;/strong&gt;, paused, and waited. This is exactly the fallback route designed from the start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tool failure
  → Can it be auto-retried?
  → No → Push notification → Wait for human → Resume after confirmation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This case validated that the fault tolerance mechanism works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 3: Session Interrupted → Resume in New Session
&lt;/h3&gt;

&lt;p&gt;A system notification interrupted the agent mid-execution. I opened a new session, sent the same task description, and the agent automatically read the previous &lt;code&gt;workflow_state.json&lt;/code&gt; file and resumed from the checkpoint — no restart from scratch.&lt;/p&gt;

&lt;p&gt;This relies on the workflow persisting a state file after every node completes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bug_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AE-33995"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_phase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_step"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"completed_steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4.2"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"artifacts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"log_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"review_result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"review_r1.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sessions can be interrupted. Tasks don't get lost.&lt;/strong&gt; For long-running automation workflows, this is non-negotiable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 4: Code Review Fails → Multi-Round Retry Loop
&lt;/h3&gt;

&lt;p&gt;This is the most interesting section of the whole pipeline.&lt;/p&gt;

&lt;p&gt;Code Review failed on the first round (score: 57, with 8 mandatory violations). Instead of immediately escalating to a human, the workflow launched &lt;strong&gt;up to 3 automatic fix-and-retry rounds&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1: CR → Failed (57 pts, 8 mandatory violations)
  → Launch sub-agent to fix all 8 violations
  → Round 2 CR → Failed (83 pts, 2 violations)
  → Launch Round 3 fix
  → Round 3 CR → Failed (74 pts, 4 violations)
  → Max retries reached → Trigger human gate B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each round, the agent reads the previous CR result and makes targeted fixes — not a full rewrite. Round 1 fix summary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue ID&lt;/th&gt;
&lt;th&gt;Violation&lt;/th&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;B2-01&lt;/td&gt;
&lt;td&gt;Raw Thread → CoroutineScope&lt;/td&gt;
&lt;td&gt;PerformanceInfoMonitor.kt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C1-01&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;binding!!&lt;/code&gt; → safe call&lt;/td&gt;
&lt;td&gt;PerformanceFloatingView.kt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C1-02&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;windowManager!!&lt;/code&gt; (5x) → &lt;code&gt;?.let&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;PerformanceFloatingView.kt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C5-01&lt;/td&gt;
&lt;td&gt;InterruptedException caught separately&lt;/td&gt;
&lt;td&gt;PerformanceInfoMonitor.kt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D1-01&lt;/td&gt;
&lt;td&gt;Interface renamed → ICloseClickCallback&lt;/td&gt;
&lt;td&gt;PerformanceFloatingView.kt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Case 5: Max Retries Exceeded → Human Gate
&lt;/h3&gt;

&lt;p&gt;Continuing from Case 4. After 3 rounds of fixes, Code Review still failed. But this exposed a deeper and more interesting problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Two of the 4 "mandatory violations" in Round 3 were classified as "recommended" in Round 2, then upgraded to "mandatory" the following round.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;When an LLM acts as a code review judge, its scoring criteria drift between rounds.&lt;/strong&gt; The same issue gets rated differently depending on how much history has accumulated in the prompt context. The loop can't converge.&lt;/p&gt;

&lt;p&gt;This is a fundamental engineering problem, not something a Prompt tweak can fix.&lt;/p&gt;

&lt;p&gt;When human gate B fired, the agent presented a three-round comparison report and offered two options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A: Accept current code and proceed to submission
   (core bug fixed; remaining issues are style, not functional)

B: Human fixes the remaining 4 items, then notifies agent to re-run quality checks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point the human's role is &lt;strong&gt;judge&lt;/strong&gt;, not executor. The AI did everything it could; a person makes the call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Case 6: CI Pipeline Verify Failed → Human Gate
&lt;/h3&gt;

&lt;p&gt;After submitting to Gerrit, the CI pipeline returned failure votes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reviewer&lt;/th&gt;
&lt;th&gt;Verified&lt;/th&gt;
&lt;th&gt;Compile&lt;/th&gt;
&lt;th&gt;UT&lt;/th&gt;
&lt;th&gt;Code-Check&lt;/th&gt;
&lt;th&gt;Smoke-Test&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;icvsbgci&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;td&gt;+1&lt;/td&gt;
&lt;td&gt;+1&lt;/td&gt;
&lt;td&gt;+1&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;jenkins.dl&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;-1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent read the Gerrit vote state, detected pipeline failures, and entered human gate C with three options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A — Pipeline issue is known/acceptable; proceed with adding reviewers and updating Jira
B — Needs pipeline re-trigger (manually rebase in Gerrit, then agent re-polls)
C — Abandon this submission and re-evaluate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Engineering War Stories: Three Core Problems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Problem 1: Context Explosion on Long Pipelines
&lt;/h3&gt;

&lt;p&gt;This is the biggest technical challenge the system faces.&lt;/p&gt;

&lt;p&gt;A single bug-fix run accumulates: Jira reads, log downloads and extractions, parsing large log files, reading multiple source files, Code Review output, Sonar scan reports… all within one agent turn. In one real test run, a single turn executed &lt;strong&gt;117 tool calls&lt;/strong&gt; — and then, right after the Sonar scan completed as the agent was about to proceed to the next step, the API request was aborted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;seq&lt;/span&gt;=&lt;span class="m"&gt;115&lt;/span&gt;: &lt;span class="n"&gt;sonar&lt;/span&gt; &lt;span class="n"&gt;poll&lt;/span&gt; &lt;span class="n"&gt;returned&lt;/span&gt;  ← &lt;span class="n"&gt;scan&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="n"&gt;received&lt;/span&gt;
&lt;span class="n"&gt;seq&lt;/span&gt;=&lt;span class="m"&gt;117&lt;/span&gt;: &lt;span class="n"&gt;stopReason&lt;/span&gt;=&lt;span class="s2"&gt;"aborted"&lt;/span&gt;, &lt;span class="n"&gt;errorMessage&lt;/span&gt;=&lt;span class="s2"&gt;"Request was aborted."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent knew exactly what to do next. The turn's accumulated context was simply too large and the server rejected the request outright.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution: Sub-Agent Architecture.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Refactor each heavyweight phase into an independent sub-agent call. The main agent only orchestrates; sub-agents execute specific tasks and return structured results. After each sub-agent completes, the main agent receives only a summary — not the full execution trace. Each phase's context stays isolated and doesn't accumulate linearly across the whole pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: LLM Judge Score Drift
&lt;/h3&gt;

&lt;p&gt;Already described in Case 5. The core issue: &lt;strong&gt;an LLM's judgment is influenced by its accumulated context&lt;/strong&gt;. As fix rounds increase, the prompt history grows, and the model's evaluation baseline shifts.&lt;/p&gt;

&lt;p&gt;No clean solution exists yet. Directions being explored:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start a fresh session for each Code Review round (clear all history context)&lt;/li&gt;
&lt;li&gt;Decouple the scoring logic from the LLM — use AST-based static rule checking for pass/fail decisions, with the LLM only providing human-readable explanations&lt;/li&gt;
&lt;li&gt;Add a consistency validation layer on top of the "mandatory/recommended" classification&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Problem 3: Cross-Session Task Recovery
&lt;/h3&gt;

&lt;p&gt;An AI session is fundamentally stateless. In enterprise environments, a bug fix might run for 20 minutes or be interrupted for hours while waiting for CI. You can't rely on a single session staying alive.&lt;/p&gt;

&lt;p&gt;The solution is &lt;strong&gt;externalizing state&lt;/strong&gt;. After every node completes, the workflow serializes the current state to the filesystem (&lt;code&gt;workflow_state.json&lt;/code&gt;), recording: completed nodes, key artifact paths, and the current phase. When a new session starts, it reads this file first and resumes from the checkpoint.&lt;/p&gt;

&lt;p&gt;This is essentially simulating a persistent task queue, with the filesystem as the state store.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Works, What's Still Missing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What's Running
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Jira info retrieval + log download / extraction / analysis&lt;/li&gt;
&lt;li&gt;Bug root cause analysis for automotive Android systems (via &lt;code&gt;rnd-automotive-issue-analyzer&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Code fixes (Kotlin, Android application layer)&lt;/li&gt;
&lt;li&gt;Code Review with custom standards (multi-round retry + human gate)&lt;/li&gt;
&lt;li&gt;SonarQube static analysis (requires SonarQube 10.x — 9.9 doesn't work)&lt;/li&gt;
&lt;li&gt;Code submission to Gerrit (with standardized commit messages)&lt;/li&gt;
&lt;li&gt;CI/CD result polling + vote status interpretation&lt;/li&gt;
&lt;li&gt;Automated Gerrit reviewer assignment&lt;/li&gt;
&lt;li&gt;Jira comment write-back&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What's Still Being Built
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Automatic source code matching&lt;/strong&gt; is the biggest open problem. Inferring the right repository from a bug description requires a maintained "module → repo" mapping and accurate module fields in bug tickets. This needs cross-team coordination and is currently handled by manually specifying the repo path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated regression testing&lt;/strong&gt; requires the QA team to co-design the smoke test trigger, execution environment, and result write-back — all of which involve pipeline changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Was It Worth It?
&lt;/h2&gt;

&lt;p&gt;Thousands of dollars in tokens, for a pipeline that's still being refined. Worth it?&lt;/p&gt;

&lt;p&gt;My answer: &lt;strong&gt;depends what you're measuring.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From a pure cost standpoint, no — the per-run token cost is still high and needs optimization.&lt;/p&gt;

&lt;p&gt;But from another angle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;This is working proof on real enterprise systems&lt;/strong&gt;, not a toy demo. It's connected to real production toolchains: Jira, Gerrit, SonarQube, Jenkins.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It reveals the actual boundary of AI agents in enterprise engineering&lt;/strong&gt;: what can be automated, what needs human judgment, and what's an engineering problem rather than an AI capability limitation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sub-Agent architecture, externalized state, and human gates&lt;/strong&gt; — these three engineering patterns were forged through real failures. They're applicable to any team trying to deploy AI agents in enterprise environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Most importantly: the path is viable.&lt;/strong&gt; Many nodes are still being refined, but "a bug ticket goes in, a Gerrit MR comes out" has been demonstrated.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This isn't something you knock out in a month. Every skill requires deep familiarity with enterprise tooling. Every workflow node debugged reveals another edge case in how AI agents behave in complex real-world environments.&lt;/p&gt;

&lt;p&gt;But the direction is right. AI isn't here to replace engineers — it's here to replace the part of engineering that engineers hate doing but that still has to get done.&lt;/p&gt;

&lt;p&gt;If you're working on similar problems, I'd love to compare notes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions or thoughts on your own AI automation experience? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>agents</category>
      <category>enterprise</category>
      <category>bugfix</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No.50): The TypeScript Wizard Pushed His .claude Directory to GitHub and Hit #1 Worldwide Overnight</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Tue, 28 Apr 2026 06:06:26 +0000</pubDate>
      <link>https://dev.to/wonderlab/one-open-source-project-a-day-no50-the-typescript-wizard-pushed-his-claude-directory-to-github-41jj</link>
      <guid>https://dev.to/wonderlab/one-open-source-project-a-day-no50-the-typescript-wizard-pushed-his-claude-directory-to-github-41jj</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"My agent skills that I use every day to do real engineering — not vibe coding."&lt;br&gt;
— Matt Pocock, README first line&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is article No.50 in the "One Open Source Project a Day" series. Today's project is &lt;strong&gt;skills&lt;/strong&gt; (&lt;a href="https://github.com/mattpocock/skills" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Let's start with what makes this repository unusual.&lt;/p&gt;

&lt;p&gt;It's not a new framework. It's not a big company's open-source release. It's not even a runnable program. It's just one engineer's &lt;code&gt;.claude&lt;/code&gt; working directory — 21 Markdown files, each telling Claude Code how to behave in a specific engineering scenario.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Matt Pocock&lt;/strong&gt; pushed this directory to GitHub. No promotion. No blog post explaining it. No YouTube demo. No Hacker News submission.&lt;/p&gt;

&lt;p&gt;In 24 hours: &lt;strong&gt;22,000 Stars&lt;/strong&gt;. &lt;strong&gt;GitHub Trending #1 globally&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Final count: &lt;strong&gt;30,800+ Stars&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This wasn't random. The same day, &lt;code&gt;free-claude-code&lt;/code&gt; pulled 1,701 stars and &lt;code&gt;awesome-codex-skills&lt;/code&gt; pulled 517. Three repositories dominated the Trending page with one shared theme: &lt;strong&gt;how to configure your AI to work the way you want it to&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That day, GitHub's community voted: we need this.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Why "configuring how your AI works" is becoming a serious engineering practice&lt;/li&gt;
&lt;li&gt;All 8 core skills broken down: grill-me, tdd, triage-issue, and more&lt;/li&gt;
&lt;li&gt;The "vertical slice" and "anti-vibe-coding" philosophy&lt;/li&gt;
&lt;li&gt;How to install and adapt these skills to your own workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Have used Claude Code or a similar AI coding assistant&lt;/li&gt;
&lt;li&gt;Familiar with basic software engineering concepts (TDD, refactoring, PRDs)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is It?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;skills&lt;/strong&gt; is the public version of Matt Pocock's personal Claude Code skill library. Each "skill" is a standalone folder with a &lt;code&gt;SKILL.md&lt;/code&gt; as the main file, describing how an Agent should work in a specific engineering scenario — goal, steps, constraints, output format.&lt;/p&gt;

&lt;p&gt;Installation is frictionless:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills@latest add mattpocock/skills/grill-me
&lt;span class="c"&gt;# Copies the SKILL.md into your .claude/ directory&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  About the Author: Matt Pocock
&lt;/h3&gt;

&lt;p&gt;Matt Pocock's position in the TypeScript community is something like its chief evangelist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: 10,300+ followers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity&lt;/strong&gt;: "TypeScript wizard" (GitHub bio), "I teach devs for a living" (X/Twitter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notable projects&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total TypeScript&lt;/strong&gt;: Comprehensive production-grade TypeScript course — his core business&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ts-reset&lt;/strong&gt; (8,400 ⭐): Called "the CSS Reset for TypeScript"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;evalite&lt;/strong&gt; (1,500 ⭐): LLM application evaluation in TypeScript&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sandcastle&lt;/strong&gt; (1,000 ⭐): TypeScript sandbox coding agent orchestration&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;AI Hero Newsletter&lt;/strong&gt;: ~60,000 subscribers&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Background&lt;/strong&gt;: Former Vercel engineer, former Stately engineer; used to be a vocal coach (yes, seriously)&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;He isn't primarily someone who builds tools for others — he's first and foremost an engineer who uses AI seriously every day. This repository is a direct output of his actual work process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;GitHub Stars&lt;/strong&gt;: 30,800+ (22,000 in first 24 hours)&lt;/li&gt;
&lt;li&gt;🍴 &lt;strong&gt;Forks&lt;/strong&gt;: 2,400+&lt;/li&gt;
&lt;li&gt;📝 &lt;strong&gt;Commits&lt;/strong&gt;: 34&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;Install tool&lt;/strong&gt;: &lt;code&gt;npx skills@latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;📰 &lt;strong&gt;Newsletter&lt;/strong&gt;: AI Hero (60,000 subscribers)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Philosophy: Deliberately Slow AI Down
&lt;/h3&gt;

&lt;p&gt;Most people's mental model for AI-assisted coding is "generate" — have it write functions, fill boilerplate, produce whole pages, faster the better. The industry calls this &lt;strong&gt;Vibe Coding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Matt's approach is the opposite. Most of what he teaches Claude to do isn't generating code — it's &lt;strong&gt;thinking the problem through before writing a single line&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's what that looks like:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;grill-me&lt;/code&gt; — His Most Useful Skill
&lt;/h3&gt;

&lt;p&gt;Matt has called this his most useful skill.&lt;/p&gt;

&lt;p&gt;What it does: &lt;strong&gt;Turn Claude into a relentless technical interviewer who interrogates your design until you run out of answers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not the AI that gently suggests two improvements and calls it done. This one hunts — every branch, every edge case, every assumption you haven't made explicit — &lt;strong&gt;one question at a time&lt;/strong&gt;, until you find yourself thinking "damn, I hadn't thought that through."&lt;/p&gt;

&lt;p&gt;Key design choices in the skill:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One question at a time (no question barrages — forces systematic thinking)&lt;/li&gt;
&lt;li&gt;Gives recommended answers (doesn't just ask, also helps you think)&lt;/li&gt;
&lt;li&gt;Actively explores the codebase to answer questions itself, rather than always asking you&lt;/li&gt;
&lt;li&gt;Goal: &lt;strong&gt;Resolve every decision branch before a single line of code is written&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: You want to add a caching layer to your project
After grill-me:
  "What's your cache granularity — function-level or request-level?"
  "If cache keys collide, what's your invalidation strategy?"
  "Do you have race conditions? Multiple requests missing cache simultaneously?"
  "If the cache service goes down, what's your fallback?"
  "How do your tests mock the cache?"
  ...
After the interrogation, your design is either much tighter,
or you've realized you shouldn't build this at all.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;tdd&lt;/code&gt; — Enforced Red-Green-Refactor
&lt;/h3&gt;

&lt;p&gt;This skill doesn't let Claude write out a full feature at once. It enforces a strict TDD workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tracer Bullet (build minimal working path first)
    ↓
Increment loop:
  Write one failing test
    ↓
  Write the minimum code to make it pass
    ↓
  Refactor (code cleaner, tests still green)
    ↓
  Next test...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key constraints from the SKILL.md:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No horizontal slicing&lt;/strong&gt;: Writing all tests first before any implementation is forbidden (leads to over-engineering)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests verify behavior, not implementation&lt;/strong&gt;: Test through public interfaces, not internals&lt;/li&gt;
&lt;li&gt;Ships with 6 reference documents: &lt;code&gt;deep-modules.md&lt;/code&gt;, &lt;code&gt;interface-design.md&lt;/code&gt;, &lt;code&gt;mocking.md&lt;/code&gt;, &lt;code&gt;refactoring.md&lt;/code&gt;, &lt;code&gt;tests.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Slow, deliberate, old-fashioned — and exactly how serious engineers actually write code.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;triage-issue&lt;/code&gt; — Diagnose First, Fix Second
&lt;/h3&gt;

&lt;p&gt;When a bug report comes in, most people (and most AI assistants) react the same way: find the line, change it, open a PR.&lt;/p&gt;

&lt;p&gt;Matt's &lt;code&gt;triage-issue&lt;/code&gt; skill adds a step before fixing: &lt;strong&gt;thorough diagnosis&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Five-stage pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Capture the problem (understand the symptom)
2. Explore and diagnose (comb the codebase, find root cause)
3. Determine the fix (not the first viable fix — the best fix)
4. Design a TDD fix plan (write the test first, then implement)
5. Create a GitHub Issue (diagnosis + fix plan + acceptance criteria)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output isn't code. It's a GitHub Issue. When you (or another Agent) go to actually fix it, there's already a clear root-cause analysis and a test-driven roadmap waiting.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;design-an-interface&lt;/code&gt; — Force Real Alternatives
&lt;/h3&gt;

&lt;p&gt;This skill implements John Ousterhout's "&lt;strong&gt;Design It Twice&lt;/strong&gt;" principle from &lt;em&gt;A Philosophy of Software Design&lt;/em&gt;: for any important decision, generate at least two genuinely different approaches before choosing.&lt;/p&gt;

&lt;p&gt;Implementation: &lt;strong&gt;Launch multiple parallel sub-Agents&lt;/strong&gt;, each constrained to a different dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sub-Agent A: Minimum method count (minimize interface surface)
Sub-Agent B: Maximum flexibility (maximize extensibility)
Sub-Agent C: Optimize common cases (minimize usage friction)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the three proposals, it evaluates them across four dimensions: simplicity, generalizability, implementation efficiency, and "depth" (how much complexity does the interface hide). Then it helps you decide.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;to-issues&lt;/code&gt; — Vertical Slice Decomposition
&lt;/h3&gt;

&lt;p&gt;Breaking a feature plan into GitHub Issues sounds ordinary. The key is &lt;em&gt;how&lt;/em&gt; it slices: &lt;strong&gt;vertical slices&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ Horizontal slicing (don't do this):
   Issue 1: Database Schema
   Issue 2: API layer
   Issue 3: Frontend UI
   Issue 4: Tests

✅ Vertical slicing (Matt's way):
   Issue 1: Users can create a basic draft (schema + API + UI + tests, all layers)
   Issue 2: Drafts can contain rich text content
   Issue 3: Drafts can be published
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each Issue is classified as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HITL&lt;/strong&gt; (Human In The Loop): Requires human decision before proceeding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AFK&lt;/strong&gt; (Away From Keyboard): Safe to run unattended&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;git-guardrails-claude-code&lt;/code&gt; — Stop AI From Deleting Your Work
&lt;/h3&gt;

&lt;p&gt;Intercepts dangerous git commands via Claude Code's &lt;code&gt;PreToolUse&lt;/code&gt; hook. Blocked operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git push &lt;span class="o"&gt;(&lt;/span&gt;including &lt;span class="nt"&gt;--force&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
git reset &lt;span class="nt"&gt;--hard&lt;/span&gt;
git clean &lt;span class="nt"&gt;-f&lt;/span&gt; / &lt;span class="nt"&gt;-fd&lt;/span&gt;
git branch &lt;span class="nt"&gt;-D&lt;/span&gt;
git checkout &lt;span class="nb"&gt;.&lt;/span&gt;
git restore &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Claude tries to run these, it receives: "it lacks authority to access these commands." Configurable at project level or globally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Skill Inventory
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;to-prd&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Conversation context → structured PRD, submitted as GitHub Issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;request-refactor-plan&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Creates detailed refactor plans with small commits, refined through user interview&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;improve-codebase-architecture&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Finds architectural "deepening" opportunities using CONTEXT.md and ADR docs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;setup-pre-commit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Configures Husky + lint-staged + Prettier + type checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ubiquitous-language&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extracts DDD-style shared vocabulary from the current conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write-a-skill&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Creates new skills following the standard structure (a skill that writes skills)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;edit-article&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Improves writing by restructuring sections and clarifying language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;obsidian-vault&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search, create, and manage notes in an Obsidian vault with wikilinks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What a Skill File Actually Looks Like
&lt;/h3&gt;

&lt;p&gt;Each skill is a standalone folder with &lt;code&gt;SKILL.md&lt;/code&gt; as the main file. The &lt;code&gt;tdd&lt;/code&gt; skill as an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;skills/tdd/
├── SKILL.md           ← Main skill definition
├── deep-modules.md    ← Reference: deep module design principles
├── interface-design.md
├── mocking.md
├── refactoring.md
└── tests.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SKILL.md&lt;/code&gt; structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# TDD&lt;/span&gt;

&lt;span class="gu"&gt;## Goal&lt;/span&gt;
Build features one vertical slice at a time using TDD...

&lt;span class="gu"&gt;## Steps&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Tracer Bullet: First, write the minimal end-to-end path...
&lt;span class="p"&gt;2.&lt;/span&gt; Increment: For each behavior to implement...
   a. Write a failing test
   b. Write minimal code to pass the test
   c. Refactor if needed

&lt;span class="gu"&gt;## Rules&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; NO horizontal slicing...
&lt;span class="p"&gt;-&lt;/span&gt; Tests should verify behavior, not implementation details...

&lt;span class="gu"&gt;## Resources&lt;/span&gt;
@deep-modules.md
@interface-design.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plain text, no special syntax, no runtime dependencies. It's configuration as documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "Configuring Your AI" Is Becoming Engineering Practice
&lt;/h3&gt;

&lt;p&gt;Three AI-configuration repos dominating Trending on the same day isn't coincidence. It points at something structural:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Same models (GPT-4o, Claude Opus) — available to everyone
Same tools (Cursor, Claude Code) — download and install
Same subscription fees

Why do some people get 2x productivity from Claude Code
while others fight hallucinations and throwaway code all day?

The gap isn't the model. It's the configuration:
  • How you describe the boundaries of a problem
  • Which checkpoints require human sign-off
  • What engineering conventions to follow
  • Which mistakes you tolerate, which you don't

Nobody teaches you this. Nobody sells it.
You grind it out across weeks of daily use.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Matt published what he ground out. The industry called this the &lt;strong&gt;"npm moment" for Claude Code Skills&lt;/strong&gt; — just as npm let the Node.js community share reusable packages, Skills is enabling the Claude Code community to share workflow recipes. JetBrains and other major vendors started publishing official skills packages shortly after this repo went viral.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Is Your .claude Directory Empty?"
&lt;/h3&gt;

&lt;p&gt;This is the most valuable question this project raises.&lt;/p&gt;

&lt;p&gt;If your &lt;code&gt;.claude&lt;/code&gt; directory (or &lt;code&gt;.cursorrules&lt;/code&gt;, or &lt;code&gt;AGENTS.md&lt;/code&gt;) is empty, you're starting from zero every time. Your AI doesn't remember the mistake you made last month, doesn't know your project's architecture conventions, doesn't understand your team's code standards. It's a new hire every single session.&lt;/p&gt;

&lt;p&gt;Matt's 21 skills aren't meant to be copied wholesale — he writes TypeScript, builds education products, his workflow probably looks different from yours. But he's provided a living sample of what it looks like when an engineer treats their AI configuration as a serious engineering artifact.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/mattpocock/skills" rel="noopener noreferrer"&gt;https://github.com/mattpocock/skills&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📦 &lt;strong&gt;Install&lt;/strong&gt;: &lt;code&gt;npx skills@latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;📰 &lt;strong&gt;AI Hero Newsletter&lt;/strong&gt;: &lt;a href="https://aihero.dev" rel="noopener noreferrer"&gt;https://aihero.dev&lt;/a&gt; (60,000 subscribers)&lt;/li&gt;
&lt;li&gt;🎓 &lt;strong&gt;Total TypeScript&lt;/strong&gt;: &lt;a href="https://totaltypescript.com" rel="noopener noreferrer"&gt;https://totaltypescript.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Related
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ts-reset&lt;/strong&gt; (8,400 ⭐): &lt;a href="https://github.com/total-typescript/ts-reset" rel="noopener noreferrer"&gt;https://github.com/total-typescript/ts-reset&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;evalite&lt;/strong&gt; (1,500 ⭐): LLM application evaluation tooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simon Willison's LLM skills&lt;/strong&gt;: Another widely-referenced personal Claude skills repo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Protocol&lt;/strong&gt;: &lt;a href="https://modelcontextprotocol.io" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Anti-vibe-coding&lt;/strong&gt;: Most of Matt's skills aren't about generating code faster — they're about thinking through the problem more thoroughly before writing any. That's how senior engineers use AI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical slice first&lt;/strong&gt;: &lt;code&gt;tdd&lt;/code&gt;, &lt;code&gt;to-issues&lt;/code&gt;, and &lt;code&gt;triage-issue&lt;/code&gt; all enforce "complete one full path at a time" over "finish all of layer X first" — this is a genuine engineering philosophy, not just a style preference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill files are engineering artifacts&lt;/strong&gt;: Like Dockerfile, &lt;code&gt;.eslintrc&lt;/code&gt;, and &lt;code&gt;tsconfig.json&lt;/code&gt;, your AI configuration is infrastructure worth maintaining and versioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "npm moment" significance&lt;/strong&gt;: This repo's virality marked a shift — Claude Code Skills is becoming a community-shared ecosystem, not just personal configuration files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The gap is in configuration, not the model&lt;/strong&gt;: Same Claude Opus, two different engineers — the difference is what they've built around it&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Who Should Use This
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code power users&lt;/strong&gt;: Already using Claude Code and want a stricter, more reliable workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript engineers&lt;/strong&gt;: Some skills (&lt;code&gt;migrate-to-shoehorn&lt;/code&gt;, &lt;code&gt;scaffold-exercises&lt;/code&gt;) are TypeScript-ecosystem specific&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineers building their own skill library&lt;/strong&gt;: Matt's repo is the best reference sample available; &lt;code&gt;write-a-skill&lt;/code&gt; can even generate new skills for you&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering leads standardizing team AI practices&lt;/strong&gt;: &lt;code&gt;git-guardrails&lt;/code&gt;, &lt;code&gt;to-issues&lt;/code&gt;, &lt;code&gt;triage-issue&lt;/code&gt; can directly standardize how your team uses AI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Question Worth Sitting With
&lt;/h3&gt;

&lt;p&gt;The line in Matt's README deserves to be read more than once: &lt;strong&gt;"Agent skills for real engineers, not vibe coding."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Behind that line is a judgment: there are two paths for AI-assisted engineering. One path uses AI to generate more code, faster — Vibe Coding. The other uses AI to make every decision before that code more rigorous — AI as a stricter thinking partner, not a faster typist.&lt;/p&gt;

&lt;p&gt;Both paths ship software. But they arrive at very different places.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Visit my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;personal site&lt;/a&gt; for more useful knowledge and interesting products&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>agents</category>
      <category>skill</category>
    </item>
    <item>
      <title>One Open Source Project a Day (No.49): free-claude-code - Run Claude Code for Free with One Environment Variable</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:10:47 +0000</pubDate>
      <link>https://dev.to/wonderlab/one-open-source-project-a-day-no49-free-claude-code-run-claude-code-for-free-with-one-ed6</link>
      <guid>https://dev.to/wonderlab/one-open-source-project-a-day-no49-free-claude-code-run-claude-code-for-free-with-one-ed6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"When a tool is too expensive, programmers build a cheaper key."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is article No.49 in the "One Open Source Project a Day" series. Today's project is &lt;strong&gt;free-claude-code&lt;/strong&gt; (&lt;a href="https://github.com/Alishahryar1/free-claude-code" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Claude Code is Anthropic's AI coding agent — deeply integrated into the terminal and VS Code, able to autonomously read files, edit code, and run commands. The catch: it requires a real Anthropic API key, and those API calls add up fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;free-claude-code&lt;/strong&gt; is built on a deceptively simple insight: Claude Code is just an HTTP client that calls the Anthropic Messages API. If you run a compatible proxy server locally that intercepts those requests and forwards them to any free or cheap backend, Claude Code has no idea the difference. Change one environment variable (&lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt;), and suddenly your $20/month tool is running on NVIDIA's free GPU credits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;14,300+ Stars, 2,000+ Forks&lt;/strong&gt; — it surged to the top of GitHub Trending in April and stayed there for four consecutive days. The author, Ali Khokhar (Alishahryar1), was virtually unknown before this project. His other pinned repos have 3 stars each. This is a textbook "one-hit wonder" in the best sense.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You'll Learn
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The core proxy architecture: how &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; redirection works&lt;/li&gt;
&lt;li&gt;Multi-backend routing: NVIDIA NIM free tier, OpenRouter free models, local Ollama&lt;/li&gt;
&lt;li&gt;API format translation: adapting Anthropic Messages ↔ OpenAI Chat Completions&lt;/li&gt;
&lt;li&gt;Thinking Token conversion: mapping &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags to Claude's native thinking blocks&lt;/li&gt;
&lt;li&gt;Discord/Telegram bot mode: remote-control Claude Code from your phone&lt;/li&gt;
&lt;li&gt;Real limitations: what you actually lose when you swap out the model&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Comfortable with terminal operations and environment variables&lt;/li&gt;
&lt;li&gt;Basic familiarity with Claude Code&lt;/li&gt;
&lt;li&gt;Basic understanding of REST APIs (optional)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Is It?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;free-claude-code&lt;/strong&gt; is a local HTTP proxy server built with FastAPI that emulates the Anthropic API interface. When Claude Code sends API requests, the proxy intercepts them, translates formats, routes to a configured free backend, converts the response back to Anthropic format, and returns it to Claude Code — completely transparently.&lt;/p&gt;

&lt;p&gt;The name is blunt and accurate: &lt;code&gt;free-claude-code&lt;/code&gt;. It runs Claude Code for free.&lt;/p&gt;

&lt;p&gt;The core architecture in one sentence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code → sends Anthropic API requests
           → proxy intercepts at localhost:8082
           → translates to OpenAI format
           → forwards to NVIDIA NIM / OpenRouter / Ollama
           → translates response back to Anthropic format
           → returns to Claude Code as if nothing happened
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  About the Author
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Author&lt;/strong&gt;: Ali Khokhar (GitHub: Alishahryar1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Location&lt;/strong&gt;: Sunnyvale, California&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bio&lt;/strong&gt;: "Writing easily understandable code..."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Background&lt;/strong&gt;: Individual developer; before this project, essentially no significant GitHub presence. Classic "overnight" open-source success.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Stats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;GitHub Stars&lt;/strong&gt;: 14,300+ (10,700+ in April alone)&lt;/li&gt;
&lt;li&gt;🍴 &lt;strong&gt;Forks&lt;/strong&gt;: 2,000+&lt;/li&gt;
&lt;li&gt;👥 &lt;strong&gt;Contributors&lt;/strong&gt;: 22&lt;/li&gt;
&lt;li&gt;📄 &lt;strong&gt;License&lt;/strong&gt;: MIT&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Trend&lt;/strong&gt;: Topped GitHub Trending (Python + global) April 24-27&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6 Free Backends
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NVIDIA NIM    → Free tier: 40 req/min, models: GLM-4, Llama 3, Mistral
OpenRouter    → 580+ models, many with daily free quotas
DeepSeek      → Ultra-low cost, native Anthropic Messages format support
LM Studio     → Local GUI for running quantized models
llama.cpp     → CPU/GPU inference, maximum control
Ollama        → One-line local model deployment, completely offline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Per-Model-Tier Routing
&lt;/h3&gt;

&lt;p&gt;Claude Code internally uses three model "tiers" (Opus, Sonnet, Haiku) for different task complexity levels. free-claude-code lets you route each tier to a different backend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Route heavy reasoning tasks to a large model&lt;/span&gt;
&lt;span class="nv"&gt;MODEL_OPUS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_nim/nvidia/llama-3.1-nemotron-ultra-253b-v1

&lt;span class="c"&gt;# Route standard coding to a mid-size model&lt;/span&gt;
&lt;span class="nv"&gt;MODEL_SONNET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_nim/nvidia/llama-3.3-70b-instruct

&lt;span class="c"&gt;# Route simple tasks (status checks, classification) to a small fast model&lt;/span&gt;
&lt;span class="nv"&gt;MODEL_HAIKU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_nim/meta/llama-3.1-8b-instruct
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Thinking Token Conversion
&lt;/h3&gt;

&lt;p&gt;Some open-source models (DeepSeek-R1, QwQ, GLM-Z1) output their reasoning wrapped in &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags. The proxy automatically extracts these and maps them to Claude's native &lt;code&gt;thinking&lt;/code&gt; content blocks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Proxy logic (simplified)
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model_output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;thinking_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_between_tags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;thinking_content&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means VS Code's Claude Code extension — which collapses and displays thinking blocks — works correctly with open-source reasoning models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intelligent Rate Limiting
&lt;/h3&gt;

&lt;p&gt;NVIDIA NIM's free tier caps at 40 requests/minute. Claude Code's aggressive request pattern (frequent context sends, tool calls) hits this quickly. The proxy implements two-layer protection:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Proactive throttling&lt;/strong&gt;: Before sending a request, predicts whether it would exceed the rate limit and preemptively waits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reactive backoff&lt;/strong&gt;: On receiving &lt;code&gt;429 Too Many Requests&lt;/code&gt;, parses &lt;code&gt;retry-after&lt;/code&gt; headers or applies exponential backoff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrency control&lt;/strong&gt;: &lt;code&gt;PROVIDER_MAX_CONCURRENCY&lt;/code&gt; env var limits simultaneous in-flight requests&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Discord/Telegram Bot Mode
&lt;/h3&gt;

&lt;p&gt;Beyond pure proxying, the project manages Claude Code's full lifecycle in bot mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phone message: "refactor the auth module to use async/await"
      ↓
Telegram/Discord bot receives message
      ↓
Spawns Claude Code subprocess in CLAUDE_WORKSPACE directory
      ↓
Streams real-time output back to your chat window
      ↓
Session persisted for follow-up messages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone the repo&lt;/span&gt;
git clone https://github.com/Alishahryar1/free-claude-code.git
&lt;span class="nb"&gt;cd &lt;/span&gt;free-claude-code

&lt;span class="c"&gt;# 2. Copy config&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env

&lt;span class="c"&gt;# 3. Edit .env — add your free NVIDIA NIM key&lt;/span&gt;
&lt;span class="c"&gt;# NVIDIA_NIM_API_KEY=nvapi-xxxx&lt;/span&gt;
&lt;span class="c"&gt;# MODEL_OPUS=nvidia_nim/nvidia/llama-3.1-nemotron-ultra-253b-v1&lt;/span&gt;
&lt;span class="c"&gt;# MODEL_SONNET=nvidia_nim/nvidia/llama-3.3-70b-instruct&lt;/span&gt;
&lt;span class="c"&gt;# MODEL_HAIKU=nvidia_nim/meta/llama-3.1-8b-instruct&lt;/span&gt;

&lt;span class="c"&gt;# 4. Start the proxy (requires uv: pip install uv)&lt;/span&gt;
uv run uvicorn server:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8082

&lt;span class="c"&gt;# 5. In another terminal, run Claude Code through the proxy&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8082"&lt;/span&gt; claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Claude Code is now running on NVIDIA's free GPU infrastructure. No Anthropic API key. No charges.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture: Transparent Proxy Pattern
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code CLI / VS Code Extension
              │
              │  POST /v1/messages (Anthropic format)
              ▼
┌─────────────────────────────────────┐
│        free-claude-code Proxy       │
│  ┌────────────────────────────────┐ │
│  │    FastAPI Router              │ │
│  │  ┌──────────────────────────┐  │ │
│  │  │ Request parsing &amp;amp;        │  │ │
│  │  │ model tier routing       │  │ │
│  │  └──────────┬───────────────┘  │ │
│  │             │                   │ │
│  │  ┌──────────▼───────────────┐  │ │
│  │  │ Format translation layer │  │ │
│  │  │  Anthropic → OpenAI      │  │ │
│  │  │  (or direct passthrough) │  │ │
│  │  └──────────┬───────────────┘  │ │
│  │             │                   │ │
│  │  ┌──────────▼───────────────┐  │ │
│  │  │ Rate limiting &amp;amp; retry    │  │ │
│  │  └──────────┬───────────────┘  │ │
│  └─────────────┼─────────────────┘ │
└────────────────┼────────────────────┘
                 │
    ┌────────────┼────────────┐
    ▼            ▼            ▼
NVIDIA NIM   OpenRouter    Ollama
(free tier)  (free models) (local)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Hard Part: API Format Translation
&lt;/h3&gt;

&lt;p&gt;This is the real engineering challenge. Anthropic Messages API and OpenAI Chat Completions API differ significantly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic format (what Claude Code sends):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-opus-4-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Refactor this code"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_result"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_use_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"xxx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"thinking"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"budget_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenAI format (what NVIDIA NIM / OpenRouter accepts):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"nvidia/llama-3.1-nemotron-ultra-253b-v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Refactor this code"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stream"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The translation points the proxy handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multimodal content&lt;/strong&gt;: Flatten Anthropic's &lt;code&gt;content&lt;/code&gt; array (text, images, tool results) into OpenAI's string or object format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calls&lt;/strong&gt;: &lt;code&gt;tool_use&lt;/code&gt; blocks ↔ &lt;code&gt;function_call&lt;/code&gt; / &lt;code&gt;tool_calls&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming&lt;/strong&gt;: Convert backend SSE stream to Anthropic's &lt;code&gt;event: content_block_delta&lt;/code&gt; format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking blocks&lt;/strong&gt;: Extract &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags into separate &lt;code&gt;thinking&lt;/code&gt; content blocks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Full Environment Variable Reference
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# === Backend selection ===&lt;/span&gt;
&lt;span class="nv"&gt;NVIDIA_NIM_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvapi-xxxxx
&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-xxxxx
&lt;span class="nv"&gt;DEEPSEEK_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-xxxxx
&lt;span class="nv"&gt;OLLAMA_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434

&lt;span class="c"&gt;# === Model tier routing ===&lt;/span&gt;
&lt;span class="nv"&gt;MODEL_OPUS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_nim/nvidia/llama-3.1-nemotron-ultra-253b-v1
&lt;span class="nv"&gt;MODEL_SONNET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_nim/nvidia/llama-3.3-70b-instruct
&lt;span class="nv"&gt;MODEL_HAIKU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_nim/meta/llama-3.1-8b-instruct

&lt;span class="c"&gt;# === Thinking support ===&lt;/span&gt;
&lt;span class="nv"&gt;ENABLE_SONNET_THINKING&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true
&lt;/span&gt;&lt;span class="nv"&gt;ENABLE_OPUS_THINKING&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# === Rate limiting ===&lt;/span&gt;
&lt;span class="nv"&gt;PROVIDER_RATE_LIMIT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1           &lt;span class="c"&gt;# max requests/second&lt;/span&gt;
&lt;span class="nv"&gt;PROVIDER_MAX_CONCURRENCY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5      &lt;span class="c"&gt;# max concurrent requests&lt;/span&gt;

&lt;span class="c"&gt;# === Bot config (optional) ===&lt;/span&gt;
&lt;span class="nv"&gt;MESSAGING_PLATFORM&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;telegram
&lt;span class="nv"&gt;TELEGRAM_BOT_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;xxxxx
&lt;span class="nv"&gt;ALLOWED_TELEGRAM_USER_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;123456789
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Limitations Worth Knowing
&lt;/h2&gt;

&lt;p&gt;Before setting this up, these caveats matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Model quality gap is the fundamental trade-off&lt;/strong&gt;&lt;br&gt;
free-claude-code gives you the Claude Code &lt;em&gt;interface&lt;/em&gt;, not Claude &lt;em&gt;models&lt;/em&gt;. Open-source models on free tiers lag behind Claude Sonnet/Opus on complex multi-step reasoning, instruction following stability, and tool call reliability. One YouTube reviewer put it clearly: "Simply wrapping Claude's application layer around open LLMs will not produce the same quality output."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. NVIDIA NIM free tier exhausts quickly&lt;/strong&gt;&lt;br&gt;
40 req/min sounds like a lot until Claude Code starts sending its full context window on every turn. Rate-limited sessions introduce noticeable pauses. Real coding sessions will hit the ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Tool call compatibility is imperfect&lt;/strong&gt;&lt;br&gt;
Claude Code depends heavily on structured tool calls (file read/write, bash execution, search). Open-source models vary in their tool call formatting discipline. The proxy includes heuristic parsing as a fallback, but failures happen — especially with smaller models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Claude-specific features unavailable&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Computer Use (Anthropic's vision+interaction feature)&lt;/li&gt;
&lt;li&gt;True Extended Thinking (deep reasoning mode)&lt;/li&gt;
&lt;li&gt;Latest Claude training data and safety alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Local models need real hardware&lt;/strong&gt;&lt;br&gt;
Running quality coding models locally (e.g., Qwen2.5-Coder-32B via Ollama) requires 20-24GB+ VRAM. Most consumer GPUs won't run the best models comfortably.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;vs. free-claude-code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;openclaw&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alternative Claude Code CLI&lt;/td&gt;
&lt;td&gt;Independent implementation; doesn't use the official Claude Code client&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aider&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standalone AI coding CLI&lt;/td&gt;
&lt;td&gt;Mature and stable; native multi-model support; different UX entirely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenCode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Terminal AI coding tool&lt;/td&gt;
&lt;td&gt;Native multi-model design, not a proxy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;free-claude-code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API proxy layer&lt;/td&gt;
&lt;td&gt;Preserves the complete Claude Code UX; just swaps the backend&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The unique value: users already fluent in Claude Code's workflow don't need to learn a new tool. One environment variable change, costs go to zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Official
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🌟 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Alishahryar1/free-claude-code" rel="noopener noreferrer"&gt;https://github.com/Alishahryar1/free-claude-code&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📋 &lt;strong&gt;Issues&lt;/strong&gt;: &lt;a href="https://github.com/Alishahryar1/free-claude-code/issues" rel="noopener noreferrer"&gt;https://github.com/Alishahryar1/free-claude-code/issues&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Related
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://build.nvidia.com/" rel="noopener noreferrer"&gt;NVIDIA NIM free API signup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/models?q=free" rel="noopener noreferrer"&gt;OpenRouter free model list&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama — local model deployment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.anthropic.com/en/docs/claude-code" rel="noopener noreferrer"&gt;Claude Code official docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The insight is the trick&lt;/strong&gt;: Claude Code is just an Anthropic API client. &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; redirection is all it takes — the engineering complexity is in the format translation layer, not the core idea&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real technical work&lt;/strong&gt;: Anthropic ↔ OpenAI format conversion, streaming response handling, tool call adaptation, and Thinking Token mapping — these are non-trivial engineering challenges the project solves well&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Free" has a price&lt;/strong&gt;: You save on API costs but trade model quality. NVIDIA NIM's free tier rate limits will make sessions feel slow during heavy usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community momentum&lt;/strong&gt;: 14k+ stars, 22 contributors, active issues — the project is iterating fast on the rough edges&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use case clarity&lt;/strong&gt;: Best for learning/experimentation and offline private deployment; not a substitute for production-grade Claude in complex agentic tasks&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Who Should Use This
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Students and hobbyists&lt;/strong&gt;: Want to experience Claude Code's full terminal agent workflow without paying for an API subscription&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model researchers&lt;/strong&gt;: Want to compare open-source models using Claude Code's interface as a consistent test harness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise private deployment&lt;/strong&gt;: Using Ollama for a fully offline, air-gapped AI coding assistant on internal infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remote coding enthusiasts&lt;/strong&gt;: Using the Telegram/Discord bot to control a remote server's Claude Code instance from a phone&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Question Worth Sitting With
&lt;/h3&gt;

&lt;p&gt;free-claude-code's virality is more than just "free stuff is popular." It reflects a specific tension: Anthropic built a genuinely excellent developer tool, then gated it behind a per-token billing model that makes sustained use expensive. The community's immediate response was to route around it.&lt;/p&gt;

&lt;p&gt;The question isn't whether this is "ethical" — it's clearly operating in a gray zone. The more interesting question is what it signals: &lt;strong&gt;when developers immediately build free alternatives to paid AI tools, it suggests the underlying capability is perceived as infrastructure, not a premium product&lt;/strong&gt;. Infrastructure wants to be free or at least flat-rate. The market is making that preference clear.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Visit my &lt;a href="https://home.wonlab.top" rel="noopener noreferrer"&gt;personal site&lt;/a&gt; for more useful knowledge and interesting products&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>claude</category>
      <category>nvidia</category>
      <category>llm</category>
    </item>
    <item>
      <title>Claude Code + SonarQube Static Analysis: The AI Quality Loop is Finally Closed</title>
      <dc:creator>WonderLab</dc:creator>
      <pubDate>Mon, 27 Apr 2026 05:44:14 +0000</pubDate>
      <link>https://dev.to/wonderlab/claude-code-sonarqube-static-analysis-the-ai-quality-loop-is-finally-closed-3gh0</link>
      <guid>https://dev.to/wonderlab/claude-code-sonarqube-static-analysis-the-ai-quality-loop-is-finally-closed-3gh0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Sound familiar? You finish writing some code, open a PR, and then CI blows up with a wall of static analysis findings. You spend the next hour tracking down issues that, honestly, you could have caught at the keyboard.&lt;/p&gt;

&lt;p&gt;SonarQube is one of the most widely adopted code quality platforms in the industry. It detects bugs, vulnerabilities, code smells, and security hotspots, and tracks coverage and duplication. Now it can be integrated directly into Claude Code — so the same AI session that helps you write code can also scan it, flag problems, and fix them on the spot.&lt;/p&gt;

&lt;p&gt;This article is a complete walkthrough: from first install to day-to-day usage. But before we get into the steps, there is &lt;strong&gt;one critical gotcha&lt;/strong&gt; you need to know about first — otherwise you'll burn hours troubleshooting something that has nothing to do with your configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Gotcha: SonarQube Server Version
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key point&lt;/strong&gt;: sonarqube-cli (and the sonarqube-mcp-server behind it) &lt;strong&gt;does not support SonarQube Server 9.x&lt;/strong&gt;. You must be running &lt;strong&gt;10.x or later&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our team had SonarQube &lt;strong&gt;9.9 LTS&lt;/strong&gt; deployed — a rock-solid long-term support release that plenty of organizations still run. We followed the official docs to the letter, set up sonarqube-cli, ran authentication, and kept getting connection errors with no useful message. Spent a good chunk of time ruling out network issues, token problems, and firewall rules.&lt;/p&gt;

&lt;p&gt;The root cause turned out to be simple: sonarqube-mcp-server calls the &lt;strong&gt;next-generation SonarQube API&lt;/strong&gt; (&lt;code&gt;/api/v2/&lt;/code&gt; prefix). Those endpoints were introduced in version 10.x. They simply do not exist on a 9.9 instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: We spun up a fresh SonarQube Server &lt;strong&gt;10.x&lt;/strong&gt; instance, pointed the config at it, and everything worked immediately.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
&lt;strong&gt;Check your version before anything else.&lt;/strong&gt; Log into the SonarQube console and look in the bottom-right corner or under &lt;code&gt;Administration &amp;gt; System&lt;/code&gt;. If you see a 9.x version number, you need to upgrade or deploy a separate 10.x instance before continuing with any of the steps below.&lt;br&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Version Compatibility at a Glance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SonarQube Server Version&lt;/th&gt;
&lt;th&gt;Works with sonarqube-cli / MCP?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;9.9 LTS and below&lt;/td&gt;
&lt;td&gt;❌ Not supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10.0 – 10.x&lt;/td&gt;
&lt;td&gt;✅ Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SonarQube Cloud&lt;/td&gt;
&lt;td&gt;✅ Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Understanding the three-layer stack makes troubleshooting much easier.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code
    │
    ├── sonarqube-agent-plugins    ← Plugin layer: slash commands and Skills
    │       └── /sonar-analyze, /sonar-integrate, etc.
    │
    ├── sonarqube-cli (sonar)      ← CLI layer: lightweight tool for auth and analysis
    │       └── ~/.local/share/sonarqube-cli/bin/sonar
    │
    └── sonarqube-mcp-server       ← MCP layer: containerized server for deep analysis
            └── Runs via Docker/Podman, calls SonarQube Server API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer has a distinct responsibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;sonarqube-agent-plugins&lt;/strong&gt;: The official plugin bundle that injects Sonar slash commands and Skills into Claude Code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sonarqube-cli&lt;/strong&gt;: A lightweight CLI tool that handles authentication and basic analysis — &lt;strong&gt;no container required&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sonarqube-mcp-server&lt;/strong&gt;: A Docker/Podman container that acts as the MCP server, powering advanced capabilities like coverage, quality gates, and duplication detection&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Make sure the following are in place before you start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node.js 18+&lt;/strong&gt;: Required by the plugin's &lt;code&gt;SessionStart&lt;/code&gt; hook script (&lt;code&gt;scripts/setup.js&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker or Podman&lt;/strong&gt;: The MCP Server runs as a container

&lt;ul&gt;
&lt;li&gt;On macOS in corporate environments, Docker Desktop is often disallowed (licensing); use &lt;strong&gt;Podman&lt;/strong&gt; instead (see the macOS section below)&lt;/li&gt;
&lt;li&gt;Linux and Windows can use Docker directly&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;SonarQube Server 10.x&lt;/strong&gt; (or SonarQube Cloud): Deployed and reachable over the network&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Browser logged into SonarQube&lt;/strong&gt;: The OAuth authorization step requires a browser session&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install the sonarqube-agent-plugins Plugin
&lt;/h3&gt;

&lt;p&gt;Open Claude Code and run these two slash commands in order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plugin marketplace add SonarSource/sonarqube-agent-plugins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plugin install sonarqube@sonar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then reload the plugins (or start a fresh Claude Code session):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/reload-plugins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify&lt;/strong&gt;: Type &lt;code&gt;/sonar&lt;/code&gt; in Claude Code. If you see a list of Sonar commands, the plugin installed correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Run the Integration Wizard
&lt;/h3&gt;

&lt;p&gt;In Claude Code, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/sonar-integrate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This launches an interactive wizard that walks you through: installing sonarqube-cli → connecting to SonarQube Server → completing OAuth authorization → registering the MCP Server.&lt;/p&gt;

&lt;h4&gt;
  
  
  2.1 Install sonarqube-cli
&lt;/h4&gt;

&lt;p&gt;The wizard installs &lt;code&gt;sonarqube-cli&lt;/code&gt; automatically in the first step. The CLI lands at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;~/.local/share/sonarqube-cli/bin/sonar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a "command not found" error when running &lt;code&gt;sonar&lt;/code&gt; later, add it to your PATH manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add to ~/.zshrc or ~/.bashrc&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export PATH="$HOME/.local/share/sonarqube-cli/bin:$PATH"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2.2 Connect to SonarQube Server
&lt;/h4&gt;

&lt;p&gt;The wizard prompts you to choose a connection method. Select &lt;strong&gt;option 4: "Type something"&lt;/strong&gt; and enter your SonarQube Server URL directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;http://your-sonarqube-server:9000/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;br&gt;
Reminder: this URL must point to a &lt;strong&gt;SonarQube 10.x or later&lt;/strong&gt; instance. Pointing at a 9.9 server here will cause auth and scan failures downstream.&lt;br&gt;
&lt;/p&gt;

&lt;h4&gt;
  
  
  2.3 Authenticate
&lt;/h4&gt;

&lt;p&gt;Once the wizard recognizes the server address, it tells you the next step. Run the auth command in the Claude Code terminal or a separate terminal window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sonar auth login &lt;span class="nt"&gt;-s&lt;/span&gt; http://your-sonarqube-server:9000/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;strong&gt;automatically opens a browser&lt;/strong&gt; and navigates to the SonarQube authorization page. Click &lt;strong&gt;Allow connection&lt;/strong&gt; to complete the flow.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
&lt;strong&gt;Order matters&lt;/strong&gt;: make sure your browser is already logged into SonarQube before running this command. If you're not logged in, the redirect will take you to the login screen first and the flow gets messier.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Once you've clicked Allow in the browser, return to Claude Code and tell it "Authorization complete." Claude Code will run a connection health check, and if everything passes, moves to the next step.&lt;/p&gt;

&lt;h4&gt;
  
  
  2.4 Choose Integration Scope
&lt;/h4&gt;

&lt;p&gt;The wizard asks where to apply the SonarQube integration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Current project&lt;/strong&gt;: Takes effect only in the current working directory. Config is written to the project-level &lt;code&gt;.claude/&lt;/code&gt; directory. Recommended for shared codebases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global&lt;/strong&gt;: Applies to all projects. Config goes into &lt;code&gt;~/.claude/&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After you choose, Claude Code automatically registers the MCP Server in the appropriate config file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Once this is done, exit Claude Code completely.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Corporate Network: Replacing the Docker Image Mirror
&lt;/h2&gt;

&lt;p&gt;In corporate environments, Docker Hub (&lt;code&gt;registry-1.docker.io&lt;/code&gt;) is often blocked. You'll need to point the sonarqube-mcp-server image at an internal mirror.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edit the MCP Config
&lt;/h3&gt;

&lt;p&gt;The MCP config lives in &lt;code&gt;~/.claude.json&lt;/code&gt; (global) or &lt;code&gt;.claude/claude.json&lt;/code&gt; (project-level). Find the sonarqube entry under &lt;code&gt;mcpServers&lt;/code&gt; and update the image reference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt; (using a JFrog Artifactory proxy):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Before&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sonarsource/sonarqube-mcp-server:latest"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;After&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;internal&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;mirror&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;proxy&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jfrog.yourcompany.com/external-docker-public-virtual/sonarsource/sonarqube-mcp-server:latest"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;br&gt;
&lt;strong&gt;Mapping rule&lt;/strong&gt;: prepend your internal registry host to the original image name: &lt;code&gt;&amp;lt;registry&amp;gt;/&amp;lt;original-image&amp;gt;:&amp;lt;tag&amp;gt;&lt;/code&gt;. Check with your infrastructure team for the exact proxy URL and path prefix.&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  macOS: Using Podman Instead of Docker
&lt;/h2&gt;

&lt;p&gt;Many corporate macOS environments prohibit Docker Desktop due to licensing requirements. Podman is a fully open-source, Docker-compatible alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Podman
&lt;/h3&gt;

&lt;p&gt;Download the macOS installer (&lt;code&gt;.pkg&lt;/code&gt;) from &lt;a href="https://podman.io/" rel="noopener noreferrer"&gt;podman.io&lt;/a&gt; and double-click to install.&lt;/p&gt;

&lt;p&gt;Add Podman to your PATH:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export PATH="/opt/podman/bin:$PATH"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;which podman
&lt;span class="c"&gt;# /opt/podman/bin/podman&lt;/span&gt;

podman &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# podman version 5.x.x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Initialize the Podman Machine
&lt;/h3&gt;

&lt;p&gt;On macOS, Podman needs a lightweight VM to run containers (similar to Docker Desktop's VM layer):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# First-time setup (downloads ~500 MB base image — takes a while)&lt;/span&gt;
podman machine init

&lt;span class="c"&gt;# Start the VM&lt;/span&gt;
podman machine start

&lt;span class="c"&gt;# Check status&lt;/span&gt;
podman machine list
&lt;span class="c"&gt;# NAME                     VM TYPE  CREATED  LAST UP            CPUS  MEMORY  DISK SIZE&lt;/span&gt;
&lt;span class="c"&gt;# podman-machine-default*  applehv  ...      Currently running  5     2GiB    100GiB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;br&gt;
&lt;strong&gt;After every Mac restart&lt;/strong&gt;, Podman Machine does not start automatically. You need to run &lt;code&gt;podman machine start&lt;/code&gt; manually, or configure a launchd service to start it on login.&lt;br&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Start Up and Verify
&lt;/h2&gt;

&lt;p&gt;With all configuration done, &lt;strong&gt;quit Claude Code fully and relaunch it&lt;/strong&gt; so the MCP config takes effect.&lt;/p&gt;

&lt;p&gt;On the first launch, Claude Code will pull the sonarqube-mcp-server container image. &lt;strong&gt;This will be slow the first time&lt;/strong&gt; — give it a minute or two. Then run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When sonarqube shows as &lt;strong&gt;connected&lt;/strong&gt;, the integration is live.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optional: sonar-project.properties
&lt;/h2&gt;

&lt;p&gt;Create a &lt;code&gt;sonar-project.properties&lt;/code&gt; file in your project root to pre-declare project metadata. This lets the analysis commands auto-detect the project without needing you to pass the project key manually each time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;sonar.projectKey&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;my-project&lt;/span&gt;
&lt;span class="py"&gt;sonar.projectName&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;My Project&lt;/span&gt;
&lt;span class="py"&gt;sonar.projectVersion&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1.0&lt;/span&gt;
&lt;span class="py"&gt;sonar.sources&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;src&lt;/span&gt;
&lt;span class="py"&gt;sonar.sourceEncoding&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;UTF-8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Command Reference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CLI Commands (no MCP required)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-integrate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Re-run setup: re-authenticate, re-register MCP, reinstall hooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-list-projects [keyword]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List accessible SonarQube projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-list-issues [project] [--severity CRITICAL]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search and filter project issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-fix-issue &amp;lt;rule&amp;gt; &amp;lt;file&amp;gt;[:&amp;lt;line&amp;gt;]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fix a specific rule violation in a file&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  MCP Commands (requires connected MCP Server)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-analyze [file path]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Analyze a single file and display issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-quality-gate [project] [--branch]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check Quality Gate status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-coverage [project] [--max N] [--file]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View code coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-duplication [project] [--pr N] [--file]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View code duplication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/sonar-dependency-risks [project] [--pr N]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View dependency risks (requires Advanced Security)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Example: Scan a Single File
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/sonar-analyze ./src/main/java/com/example/UserService.java
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code calls the MCP Server, runs the analysis, and returns a structured list of bugs, vulnerabilities, and code smells with fix suggestions. You can immediately ask Claude to act on the results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fix all CRITICAL issues found in the last scan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Authentication Fails
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Re-authenticate (overwrites the old token)&lt;/span&gt;
sonar auth login &lt;span class="nt"&gt;-s&lt;/span&gt; http://your-sonarqube-server:9000/

&lt;span class="c"&gt;# Check auth status&lt;/span&gt;
sonar auth status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this any time you deploy a new SonarQube Server or switch instances.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;sonar&lt;/code&gt; Command Not Found
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export PATH="$HOME/.local/share/sonarqube-cli/bin:$PATH"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MCP Server Fails to Start
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Verify the container runtime is working: &lt;code&gt;docker info&lt;/code&gt; or &lt;code&gt;podman info&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Double-check the image path in &lt;code&gt;~/.claude.json&lt;/code&gt; (especially for corporate mirror setups)&lt;/li&gt;
&lt;li&gt;On macOS, confirm Podman Machine is running: &lt;code&gt;podman machine start&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  404 / API Errors When Connecting to SonarQube (the most common failure)
&lt;/h3&gt;

&lt;p&gt;If you see 404s or API errors during auth or scanning, the server version is almost certainly the culprit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check the server version via API&lt;/span&gt;
curl http://your-sonarqube-server:9000/api/server/version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the response is &lt;code&gt;9.9.x&lt;/code&gt;, you need to upgrade to &lt;code&gt;10.x&lt;/code&gt;. There is no workaround — the new API endpoints simply don't exist on 9.x.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Here's what we covered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt;: Three layers — agent-plugins, sonarqube-cli, and mcp-server — each with a distinct role&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The version trap&lt;/strong&gt;: SonarQube Server must be &lt;strong&gt;10.x or later&lt;/strong&gt;; 9.9 LTS is not supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full installation&lt;/strong&gt;: Plugin marketplace → integration wizard → auth → MCP registration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Corporate network setup&lt;/strong&gt;: Internal mirror proxy for Docker images + Podman as a Docker Desktop replacement on macOS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily workflow&lt;/strong&gt;: File scanning, quality gates, coverage, and fix commands&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With this setup in place, Claude Code isn't just helping you write code faster — it's also helping you write code that passes quality gates before it ever hits CI. That's what AI-assisted development should actually look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/SonarSource/sonarqube-agent-plugins" rel="noopener noreferrer"&gt;SonarSource/sonarqube-agent-plugins&lt;/a&gt; — Official Claude Code plugin repository&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/SonarSource/sonarqube-mcp-server" rel="noopener noreferrer"&gt;SonarQube MCP Server&lt;/a&gt; — MCP Server implementation&lt;/li&gt;
&lt;li&gt;&lt;a href="https://podman.io/getting-started/installation" rel="noopener noreferrer"&gt;Podman Installation Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Questions or issues? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>sonar</category>
      <category>vibecoding</category>
      <category>codequality</category>
    </item>
  </channel>
</rss>
