<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: 马国锦</title>
    <description>The latest articles on DEV Community by 马国锦 (@mgj).</description>
    <link>https://dev.to/mgj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3978634%2Ff850f006-c53c-4347-b0ae-54f643d01c12.png</url>
      <title>DEV Community: 马国锦</title>
      <link>https://dev.to/mgj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mgj"/>
    <language>en</language>
    <item>
      <title>How I Ship 10x Faster with Claude Code: The 5-Layer Workflow System</title>
      <dc:creator>马国锦</dc:creator>
      <pubDate>Thu, 11 Jun 2026 03:37:23 +0000</pubDate>
      <link>https://dev.to/mgj/how-i-ship-10x-faster-with-claude-code-the-5-layer-workflow-system-3120</link>
      <guid>https://dev.to/mgj/how-i-ship-10x-faster-with-claude-code-the-5-layer-workflow-system-3120</guid>
      <description>&lt;p&gt;After 8 months of daily Claude Code use, I've distilled my workflow into a 5-layer system. Each layer builds on the previous one. Skip one, and the whole thing falls apart.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with Most Claude Code Users
&lt;/h2&gt;

&lt;p&gt;Most people use Claude Code like ChatGPT — open terminal, ask a question, close, repeat. The next day, they explain their project from scratch. Again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The symptom:&lt;/strong&gt; 20% of every session is wasted on context re-establishment.&lt;br&gt;
&lt;strong&gt;The root cause:&lt;/strong&gt; No project memory, no workflow discipline.&lt;/p&gt;

&lt;p&gt;Here's the system that fixed it for me.&lt;/p&gt;


&lt;h2&gt;
  
  
  Layer 1: CLAUDE.md — Your Project's Memory Anchor
&lt;/h2&gt;

&lt;p&gt;This is the foundation. Without it, nothing else works.&lt;/p&gt;

&lt;p&gt;CLAUDE.md is a file at your project root. Claude reads it automatically at the start of every session. It tells Claude:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What this project is (one sentence)&lt;/li&gt;
&lt;li&gt;The tech stack (specific technologies, not "Python web framework")&lt;/li&gt;
&lt;li&gt;The architecture (the big picture you'd need 3 files to understand)&lt;/li&gt;
&lt;li&gt;Unique conventions (not generic advice like "write tests")&lt;/li&gt;
&lt;li&gt;Quality priorities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bad CLAUDE.md (you've probably written this):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# My Project&lt;/span&gt;
A web application built with Python and FastAPI.

&lt;span class="gu"&gt;## Development&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Write clean code
&lt;span class="p"&gt;-&lt;/span&gt; Add unit tests
&lt;span class="p"&gt;-&lt;/span&gt; Use Git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells Claude nothing it doesn't already know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good CLAUDE.md:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# CLAUDE.md&lt;/span&gt;

&lt;span class="gu"&gt;## Project Overview&lt;/span&gt;
Internal RAG knowledge base serving 500+ employees.

&lt;span class="gu"&gt;## Tech Stack&lt;/span&gt;
FastAPI + LangChain + Milvus + PostgreSQL + Redis

&lt;span class="gu"&gt;## Commands&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Start: &lt;span class="sb"&gt;`uvicorn app.main:app --reload --port 8080`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Test: &lt;span class="sb"&gt;`pytest -x --cov=app --cov-report=term-missing`&lt;/span&gt;

&lt;span class="gu"&gt;## Architecture&lt;/span&gt;
Request flow: router → service → retriever → Milvus
                             → generator → LLM API
Key directories:
&lt;span class="p"&gt;-&lt;/span&gt; app/router/  - API layer
&lt;span class="p"&gt;-&lt;/span&gt; app/service/ - Business logic orchestration
&lt;span class="p"&gt;-&lt;/span&gt; app/retriever/ - Retrieval strategies (vector/BM25/hybrid)
&lt;span class="p"&gt;-&lt;/span&gt; app/generator/ - LLM calls and prompt management

&lt;span class="gu"&gt;## Key Conventions&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; All APIs return &lt;span class="sb"&gt;`{"data": ..., "error": null}`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Retrieval results MUST include source field
&lt;span class="p"&gt;-&lt;/span&gt; Milvus collection naming: &lt;span class="sb"&gt;`{env}_{doc_type}`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; Only write what's unique to your project. Claude already knows to use Git.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 2: Plan Mode — Direction Before Speed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Trigger:&lt;/strong&gt; Any change touching 3+ files or architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claude explores existing code (modules, data flows, API structures)&lt;/li&gt;
&lt;li&gt;Designs an implementation plan&lt;/li&gt;
&lt;li&gt;Shows you the plan for approval&lt;/li&gt;
&lt;li&gt;Only writes code after you confirm&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The anti-pattern:&lt;/strong&gt; "Rewrite the entire auth module" → Claude changes 12 files → everything breaks → you don't know where to start fixing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; "I want to add hybrid search. Let's plan first." → Claude explores → proposes architecture → you approve → step-by-step implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 3: Small Tasks — One Unit, One Commit
&lt;/h2&gt;

&lt;p&gt;Each task changes ONE logical unit. After each task, the project is still runnable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "Refactor the entire retriever module."
Claude: [changes 8 files] → total chaos
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Good:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You: "Step 1: Extract BM25 logic into retriever/bm25.py"
Claude: [changes 1 file, tests pass]
You: "Step 2: Add RRF fusion in retriever/fusion.py"
Claude: [changes 1 file, tests pass]
You: "Commit these two steps."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; Each commit is a checkpoint. Something breaks? You know exactly which commit introduced it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 4: Git — Your Undo Button
&lt;/h2&gt;

&lt;p&gt;Commit after every small task. This isn't just version control — it's your change log.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;feat: extract BM25 retriever into standalone module
feat: add RRF fusion for hybrid search
fix: Chinese tokenization for BM25 queries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The commit history IS your project journal. Six months later, you can trace why a design decision was made.&lt;/p&gt;




&lt;h2&gt;
  
  
  Layer 5: Worktree — Parallel Without Collision
&lt;/h2&gt;

&lt;p&gt;When you're working on feature A and need to fix bug B without context-switching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# While working on feature A...&lt;/span&gt;
git worktree add ../bugfix .claude/worktrees/hotfix
&lt;span class="nb"&gt;cd&lt;/span&gt; ../bugfix
&lt;span class="c"&gt;# Fix bug B in complete isolation&lt;/span&gt;
&lt;span class="c"&gt;# Commit, return, clean up&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; -
git worktree prune
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or just tell Claude: "Use a worktree for this." It handles the isolation automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus: Context Compounding — The Meta-Layer
&lt;/h2&gt;

&lt;p&gt;Every technical decision you make, every pattern you discover — write it down. It compounds.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Vehicle&lt;/th&gt;
&lt;th&gt;What Gets Stored&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Project&lt;/td&gt;
&lt;td&gt;CLAUDE.md&lt;/td&gt;
&lt;td&gt;Architecture, conventions, decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session&lt;/td&gt;
&lt;td&gt;Memory system&lt;/td&gt;
&lt;td&gt;User preferences, feedback, corrections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge Base&lt;/td&gt;
&lt;td&gt;Linked notes&lt;/td&gt;
&lt;td&gt;Methodologies, patterns, abstractions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After 8 months, my AI assistant doesn't just understand my project — it understands how I think about projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here's what adding a "CSV export" feature looks like with this system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLAUDE.md      → Auto-loads project context. Claude knows where the API routes live.
Plan Mode      → "Let's plan the CSV export feature first" → Approval before code.
Small Tasks    → Step 1: Add export route. Step 2: Add frontend button.
Git            → Commit after each step. Rollback if needed.
Worktree       → Meanwhile, fix a login bug in isolation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; Features ship faster, bugs are traceable, and every new session starts with full context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Start Checklist
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Day 1: Write a proper CLAUDE.md (30 min — just write the unique stuff)
Day 2: Say "Let's plan first" before cross-file changes
Day 3: Break one big task into 3-5 small steps
Day 5: Try Worktree — "Use a worktree for this"
Day 7: After your first design decision, write a Memory
Day 14: Review two weeks of commits — you'll see patterns worth documenting
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two weeks from now, you won't be "chatting" with Claude anymore. You'll be producing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this helped you level up your Claude Code workflow, give it a unicorn. I share more practical AI engineering tips weekly — follow me here on Dev.to.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
    </item>
    <item>
      <title>Build Your RAG System Right the First Time: 6 Decisions That Make or Break It</title>
      <dc:creator>马国锦</dc:creator>
      <pubDate>Thu, 11 Jun 2026 03:36:21 +0000</pubDate>
      <link>https://dev.to/mgj/build-your-rag-system-right-the-first-time-6-decisions-that-make-or-break-it-68i</link>
      <guid>https://dev.to/mgj/build-your-rag-system-right-the-first-time-6-decisions-that-make-or-break-it-68i</guid>
      <description>&lt;p&gt;After debugging 20+ broken RAG systems, I've identified the 6 decisions that determine whether yours works. Here's how to get each one right.&lt;/p&gt;




&lt;h2&gt;
  
  
  The RAG Developer's Trap
&lt;/h2&gt;

&lt;p&gt;Every RAG developer falls into the same trap: you build the basic pipeline, it sort of works, and then you spend weeks tweaking prompt templates — while the real problem sits untouched in your indexing pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 80/20 rule: 80% of RAG problems come from indexing, not generation.&lt;/strong&gt; But 80% of debugging effort goes into generation.&lt;/p&gt;

&lt;p&gt;Let's fix that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision 1: Embedding Model — The Single Biggest Lever
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The mistake:&lt;/strong&gt; Using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; for Chinese documents because it's the default in every tutorial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's wrong:&lt;/strong&gt; It's English-trained. Drop it on Chinese text and it loses 30-50% of semantic fidelity.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Use This&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chinese&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BAAI/bge-large-zh-v1.5&lt;/code&gt; (1024-dim)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chinese + English&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BAAI/bge-m3&lt;/code&gt; (multilingual + sparse)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;&lt;code&gt;text-embedding-3-large&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;jina-embeddings-v3&lt;/code&gt; or &lt;code&gt;voyage-code-3&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Non-negotiable:&lt;/strong&gt; Indexing model and query model must be byte-for-byte identical. Switch models = rebuild entire index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; +15-40% Recall@10 for Chinese RAG.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision 2: Chunk Size — Not a Magic Number
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Physics:&lt;/strong&gt; Too small (&amp;lt; 100 tokens) = semantic fragmentation. Too large (&amp;gt; 1000 tokens) = noise injection.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document Type&lt;/th&gt;
&lt;th&gt;Sweet Spot&lt;/th&gt;
&lt;th&gt;Overlap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FAQ / Short-form&lt;/td&gt;
&lt;td&gt;128-256&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Technical docs&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-form articles&lt;/td&gt;
&lt;td&gt;768-1024&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code&lt;/td&gt;
&lt;td&gt;Function boundaries&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The method matters more than the size.&lt;/strong&gt; Use recursive splitting, not fixed-length:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; +5-15% Recall@10.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision 3: Index Type — HNSW vs IVF
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 1M vectors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;HNSW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recall &amp;gt; 0.95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1-5M, RAM tight&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;IVF + PQ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75% memory savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt; 5M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;IVF + PQ + Sharding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Horizontal scale&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key nuance:&lt;/strong&gt; HNSW has high insertion cost. Streaming docs → IVF may be better even at small scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; +5-15% recall or 2-5x latency improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision 4: Metadata — Not Optional
&lt;/h2&gt;

&lt;p&gt;Without metadata filtering, every query scans every vector. Add &lt;code&gt;department=engineering AND date &amp;gt; 2024-01-01&lt;/code&gt; and you go from 5M → 50K vectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"internal_wiki"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"doc_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"design_doc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"publish_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-06-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"department"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"engineering"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bonus:&lt;/strong&gt; Inject metadata into prompts. LLMs weight credibility based on source and recency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; +10-25% precision from filtering alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision 5: Deduplication — Do It Twice
&lt;/h2&gt;

&lt;p&gt;Enterprise knowledge bases are full of duplicates. Without dedup, your top-10 results might be 7 copies of the same document.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Document-level&lt;/strong&gt; (before chunking): MinHash + LSH, threshold 0.85&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk-level&lt;/strong&gt; (after chunking): SimHash, threshold 0.95&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt; +10-20% effective recall.&lt;/p&gt;




&lt;h2&gt;
  
  
  Decision 6: Query Processing — The Other Half
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;When&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Query rewriting&lt;/td&gt;
&lt;td&gt;Short/fuzzy queries&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HyDE&lt;/td&gt;
&lt;td&gt;Factual Q&amp;amp;A, &amp;lt; 10 words&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-path recall + RRF&lt;/td&gt;
&lt;td&gt;Semantic + exact-match&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-Encoder rerank&lt;/td&gt;
&lt;td&gt;Post-retrieval refinement&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Minimum viable stack:&lt;/strong&gt; Query rewriting + Cross-Encoder rerank.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Optimization Priority Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;First (biggest impact):
  - Embedding model language-appropriate?
  - Chunk size reasonable (256-768)?
  - Deduplicating?

Second:
  - Query rewriting
  - Cross-Encoder reranking
  - Metadata filtering

Third:
  - Multi-path recall + RRF
  - HyDE for short queries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Know If You Fixed It
&lt;/h2&gt;

&lt;p&gt;Don't guess. Measure. 50 (query, ground_truth) pairs, track Recall@10 and MRR.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recall_at_k&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ground_truth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ground_truth&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mrr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ground_truth&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ground_truth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;If this saved you an afternoon of debugging, give it a unicorn and share it with someone who's still tweaking chunk sizes.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
