<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ai developer</title>
    <description>The latest articles on DEV Community by Ai developer (@__2ddbae6bb7d).</description>
    <link>https://dev.to/__2ddbae6bb7d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3957324%2F72f42b43-34d9-4f58-8d91-d26e09bffd69.jpeg</url>
      <title>DEV Community: Ai developer</title>
      <link>https://dev.to/__2ddbae6bb7d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/__2ddbae6bb7d"/>
    <language>en</language>
    <item>
      <title>One Ruler to Measure Them All: How Language Affects LLM Quality</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Fri, 29 May 2026 09:00:18 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/one-ruler-to-measure-them-all-how-language-affects-llm-quality-2pc6</link>
      <guid>https://dev.to/__2ddbae6bb7d/one-ruler-to-measure-them-all-how-language-affects-llm-quality-2pc6</guid>
      <description>&lt;h1&gt;
  
  
  One Ruler to Measure Them All: How Language Affects LLM Quality
&lt;/h1&gt;

&lt;p&gt;Most discussions about LLM performance focus on the model architecture and prompting. But there's a hidden factor: the tokenizer. It determines how much of your text fits in the context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tokenizer Problem
&lt;/h2&gt;

&lt;p&gt;Russian text consumes more tokens than English for the same information density. Some developers even switch to English prompts to save tokens and improve performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Surprising Result
&lt;/h2&gt;

&lt;p&gt;A recent arxiv study benchmarked multilingual long-context language models across different languages. The winner? &lt;strong&gt;Polish&lt;/strong&gt; — 88% accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Russian placed 5th at 84%&lt;/strong&gt; — ahead of English at 83.9%.&lt;/p&gt;

&lt;p&gt;The gap widens on long-context tasks. More tokens = more opportunities for the model to lose coherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Caveat
&lt;/h2&gt;

&lt;p&gt;The test used "weaker" models by 2026 standards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini 1.5 Flash&lt;/li&gt;
&lt;li&gt;Qwen 2.5 72B&lt;/li&gt;
&lt;li&gt;Other mid-tier models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Top-tier models might show different patterns, but the tokenizer effect persists regardless of model quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for Production
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Language choice matters for RAG.&lt;/strong&gt; If your knowledge base is multilingual, retrieval quality varies by language.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-context tasks favor compact languages.&lt;/strong&gt; English is more token-efficient than Russian, but Polish outperformed both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizer-agnostic metrics are needed.&lt;/strong&gt; BLEU and ROUGE don't capture tokenization bias.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I'm Tracking
&lt;/h2&gt;

&lt;p&gt;I'm monitoring whether newer models (Kimi k2.5, GLM-5, GPT-5.2 series) show the same pattern. Early signs suggest top-tier models compress better across languages, but the gap doesn't fully disappear.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More multilingual LLM analysis and production AI notes from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>One Ruler to Measure Them All: How Language Affects LLM Quality</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Fri, 29 May 2026 06:01:46 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/one-ruler-to-measure-them-all-how-language-affects-llm-quality-5f54</link>
      <guid>https://dev.to/__2ddbae6bb7d/one-ruler-to-measure-them-all-how-language-affects-llm-quality-5f54</guid>
      <description>&lt;h1&gt;
  
  
  One Ruler to Measure Them All: How Language Affects LLM Quality
&lt;/h1&gt;

&lt;p&gt;Most discussions about LLM performance focus on the model architecture and prompting. But there's a hidden factor: the tokenizer. It determines how much of your text fits in the context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tokenizer Problem
&lt;/h2&gt;

&lt;p&gt;Russian text consumes more tokens than English for the same information density. Some developers even switch to English prompts to save tokens and improve performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Surprising Result
&lt;/h2&gt;

&lt;p&gt;A recent arxiv study benchmarked multilingual long-context language models across different languages. The winner? &lt;strong&gt;Polish&lt;/strong&gt; — 88% accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Russian placed 5th at 84%&lt;/strong&gt; — ahead of English at 83.9%.&lt;/p&gt;

&lt;p&gt;The gap widens on long-context tasks. More tokens = more opportunities for the model to lose coherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Caveat
&lt;/h2&gt;

&lt;p&gt;The test used "weaker" models by 2026 standards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini 1.5 Flash&lt;/li&gt;
&lt;li&gt;Qwen 2.5 72B&lt;/li&gt;
&lt;li&gt;Other mid-tier models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Top-tier models might show different patterns, but the tokenizer effect persists regardless of model quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implications for Production
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Language choice matters for RAG.&lt;/strong&gt; If your knowledge base is multilingual, retrieval quality varies by language.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-context tasks favor compact languages.&lt;/strong&gt; English is more token-efficient than Russian, but Polish outperformed both.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizer-agnostic metrics are needed.&lt;/strong&gt; BLEU and ROUGE don't capture tokenization bias.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I'm Tracking
&lt;/h2&gt;

&lt;p&gt;I'm monitoring whether newer models (Kimi k2.5, GLM-5, GPT-5.2 series) show the same pattern. Early signs suggest top-tier models compress better across languages, but the gap doesn't fully disappear.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More multilingual LLM analysis and production AI notes from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>AI Conf 2026: Classic ML Is Dead, Everyone's Building Agents</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Fri, 29 May 2026 06:00:17 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/ai-conf-2026-classic-ml-is-dead-everyones-building-agents-og6</link>
      <guid>https://dev.to/__2ddbae6bb7d/ai-conf-2026-classic-ml-is-dead-everyones-building-agents-og6</guid>
      <description>&lt;h1&gt;
  
  
  AI Conf 2026: Classic ML Is Dead, Everyone's Building Agents
&lt;/h1&gt;

&lt;p&gt;Spent two days at AI Conf in Moscow. The shift is complete: nobody talks about traditional ML anymore. It's all agents, RAG, and voice systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Academic Publication Pipeline Is Slow
&lt;/h2&gt;

&lt;p&gt;Average time from submission to publication at A-tier conferences: &lt;strong&gt;9 months.&lt;/strong&gt; Multiple review cycles, sequential improvements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What researchers actually use LLMs for now:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code generation&lt;/li&gt;
&lt;li&gt;Paper review assistance&lt;/li&gt;
&lt;li&gt;Literature synthesis&lt;/li&gt;
&lt;li&gt;(Not for original ideas — tried "let it think for 2 weeks," expensive and ineffective)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prediction:&lt;/strong&gt; Future papers will include zip archives of experimental code that AI can verify. Human value shifts to idea generation, not implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Search Agents Workshop
&lt;/h2&gt;

&lt;p&gt;Built a working ReAct search agent in the workshop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Groq API&lt;/strong&gt; — free tier, fast inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tavily&lt;/strong&gt; — 1000 free search queries/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stack cost: $0 for prototyping. Production cost: depends on scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Monitoring: Langfuse vs Arize Phoenix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual integration, detailed traces&lt;/td&gt;
&lt;td&gt;Custom setups, granular control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Arize Phoenix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto-instrumentation, wraps everything&lt;/td&gt;
&lt;td&gt;Quick setup, less configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both show traces, token counts, latency breakdowns. Phoenix wins if you want observability without wiring it yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Agent Harness vs Classic Agents
&lt;/h2&gt;

&lt;p&gt;The terminology evolved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2024:&lt;/strong&gt; "What's the difference between LLM and agent?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2026:&lt;/strong&gt; &lt;strong&gt;Agent Harness&lt;/strong&gt; — memory + skills instead of tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: Deep Agents framework. Skill creation costs 2M tokens. Single invocation: 100K tokens. But the abstraction is cleaner than manual tool orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Voice Agents for Telephony
&lt;/h2&gt;

&lt;p&gt;Voice-to-voice models exist but lack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool use integration&lt;/li&gt;
&lt;li&gt;Context management&lt;/li&gt;
&lt;li&gt;Reliability for long conversations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Current production stack:&lt;/strong&gt; Speech-to-Text → LLM → Text-to-Speech&lt;/p&gt;

&lt;p&gt;Voice-to-voice will replace this eventually, but not before tool calling and context compression catch up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Didn't Hear
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Gradient boosting use cases&lt;/li&gt;
&lt;li&gt;Feature engineering debates&lt;/li&gt;
&lt;li&gt;Model interpretability discussions (except for RAG context windows)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The industry moved on. If you're still pitching Random Forest improvements, you're talking to the wrong audience.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Take
&lt;/h2&gt;

&lt;p&gt;The conference confirmed what I see in production: &lt;strong&gt;agent orchestration is the new infrastructure layer.&lt;/strong&gt; Not the models themselves — how you connect them, manage memory, route between skills, and monitor everything.&lt;/p&gt;

&lt;p&gt;The companies winning aren't those with the best single model. They're those with the best agent architecture.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More production AI insights and conference notes from a bank's DS lead — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>I Self-Hosted an AI Assistant: Lessons from 48 Hours of Debugging</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Thu, 28 May 2026 21:39:25 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/i-self-hosted-an-ai-assistant-lessons-from-48-hours-of-debugging-4okc</link>
      <guid>https://dev.to/__2ddbae6bb7d/i-self-hosted-an-ai-assistant-lessons-from-48-hours-of-debugging-4okc</guid>
      <description>&lt;h1&gt;
  
  
  I Self-Hosted an AI Assistant: Lessons from 48 Hours of Debugging
&lt;/h1&gt;

&lt;p&gt;I wanted a local AI assistant. Expected: 2 hours. Reality: 2 days of edge cases, broken dependencies, and discovering that "local" doesn't mean "free."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt; (open-source AI assistant framework)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPS with limited console access&lt;/strong&gt; (had to file tickets to enable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter&lt;/strong&gt; for model access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Qwen&lt;/strong&gt; as fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Broke
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Dependency Hell
&lt;/h3&gt;

&lt;p&gt;Pre-installed OpenClaw came with an outdated library. Updated manually. Then updated again. OpenRouter integration only worked after the second update.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Certificate Issues
&lt;/h3&gt;

&lt;p&gt;Self-hosted means self-managed certificates. Let's Encrypt, reverse proxy, CORS headers. Each layer adds a new failure mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. "Free" API Credits Aren't
&lt;/h3&gt;

&lt;p&gt;OpenRouter's "free" models have limits. Hit them within hours. The API key died silently — no error message, just empty responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Local Model Reality Check
&lt;/h3&gt;

&lt;p&gt;Qwen promised tool-use support. Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Absolute paths broke tool calling (relative only)&lt;/li&gt;
&lt;li&gt;Model experienced "amnesia" — couldn't open .md files it created&lt;/li&gt;
&lt;li&gt;Larger models need more RAM but run slower&lt;/li&gt;
&lt;li&gt;200K context window sounds great until you hit memory limits&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. The Debugging Cascade
&lt;/h3&gt;

&lt;p&gt;Fix one thing → break another. Add skills for email and search. DuckDuckGo API rate-limits kill the search skill. Switch to alternative. New limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Worked
&lt;/h2&gt;

&lt;p&gt;Despite everything, the assistant is now running. Key insight:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Boxed solutions (Kimi, GLM native APIs) are more reliable.&lt;/strong&gt; But self-hosting teaches you how the pieces actually connect — tool calling, memory management, model routing, context windows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;2 hours&lt;/td&gt;
&lt;td&gt;2 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API costs&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$20+ before limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;16GB+ RAM for usable local models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Ongoing dependency updates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Should You Self-Host?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Yes if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to understand LLM infrastructure deeply&lt;/li&gt;
&lt;li&gt;Data privacy is non-negotiable&lt;/li&gt;
&lt;li&gt;You enjoy debugging more than using&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;No if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need reliability today&lt;/li&gt;
&lt;li&gt;Your time has a cost&lt;/li&gt;
&lt;li&gt;You're not ready to file support tickets for console access&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm keeping the local setup as a learning environment but routing production tasks to managed APIs. The hybrid approach: local for experimentation, cloud for reliability.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More self-hosting experiments and production AI infrastructure notes — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>nlp</category>
    </item>
    <item>
      <title>RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Thu, 28 May 2026 21:35:37 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/rag-sota-i-tested-7-pipelines-and-built-sequoia-open-source-223o</link>
      <guid>https://dev.to/__2ddbae6bb7d/rag-sota-i-tested-7-pipelines-and-built-sequoia-open-source-223o</guid>
      <description>&lt;h1&gt;
  
  
  RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)
&lt;/h1&gt;

&lt;p&gt;After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. SEQUOIA (RAPTOR tree + step-back prompting) consistently outperformed alternatives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Full Pipeline List
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Core Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No-RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct LLM generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Classical RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dense retrieval (BGE-small + FAISS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hybrid RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BM25 + Dense + RRF + reranker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LightRAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Key-value graph + dense hybrid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PageIndex&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Two-stage hierarchical retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GraphRAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Entity graph + dense fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step reasoning pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SEQUOIA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RAPTOR tree + step-back prompting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SEQUOIA Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-query + rerank + compression&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why LightRAG Underperformed
&lt;/h2&gt;

&lt;p&gt;The hype suggested graph-based RAG would revolutionize retrieval. On real banking documents and technical manuals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Graph construction is expensive (entity extraction, relationship mapping)&lt;/li&gt;
&lt;li&gt;Retrieval quality did not justify the overhead&lt;/li&gt;
&lt;li&gt;Academic benchmarks do not equal production reality&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why RAPTOR Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Recursive Abstractive Processing for Tree-Organized Retrieval:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cluster leaf nodes (individual chunks)&lt;/li&gt;
&lt;li&gt;Summarize upward (hierarchical abstraction)&lt;/li&gt;
&lt;li&gt;Retrieve at multiple levels (specific details + high-level context)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This mirrors how humans organize knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step-Back Prompting: Free Performance
&lt;/h2&gt;

&lt;p&gt;Before retrieving, generalize the query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User asks: "What's the error rate for Q3?"&lt;/li&gt;
&lt;li&gt;Step-back: "What metrics are tracked quarterly?"&lt;/li&gt;
&lt;li&gt;Retrieve broader context first, then narrow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: ~15% improvement in recall. Zero latency cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  SEQUOIA Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    Step-back Prompting (generalize)
    RAPTOR Tree Retrieval (multi-level)
    Context Compression (summarize long contexts)
    Re-ranking (cross-encoder)
    Local LLM Generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Local LLM Evaluation
&lt;/h2&gt;

&lt;p&gt;I used a local model weaker than GPT-4 for judging. Key finding: relative rankings between methods stayed consistent even with a weaker evaluator.&lt;/p&gt;

&lt;p&gt;You can prototype and compare approaches without burning API credits on GPT-4 evaluations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Recommendations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Start with Classical RAG — establish baseline, prove value&lt;/li&gt;
&lt;li&gt;Add step-back prompting — free performance gain&lt;/li&gt;
&lt;li&gt;Move to hierarchical retrieval when context complexity justifies it&lt;/li&gt;
&lt;li&gt;Avoid graph approaches unless you have specific graph-structured data&lt;/li&gt;
&lt;li&gt;Measure on YOUR data — academic benchmarks are misleading&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Open Source
&lt;/h2&gt;

&lt;p&gt;Everything is available:&lt;br&gt;
&lt;a href="https://github.com/Diyago/rag-benchmark/tree/main" rel="noopener noreferrer"&gt;https://github.com/Diyago/rag-benchmark/tree/main&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Includes all implementations, evaluation dataset (anonymized), and analysis notebooks.&lt;/p&gt;

&lt;h2&gt;
  
  
  More AI Engineering Notes
&lt;/h2&gt;

&lt;p&gt;I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works versus what is just hype.&lt;/p&gt;

&lt;p&gt;Telegram channel (Russian, technical): &lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Have you benchmarked RAG on real data? What surprised you?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>Open Source LLM Spring 2026: What Changed in 2 Months</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Thu, 28 May 2026 21:25:20 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/open-source-llm-spring-2026-what-changed-in-2-months-23c</link>
      <guid>https://dev.to/__2ddbae6bb7d/open-source-llm-spring-2026-what-changed-in-2-months-23c</guid>
      <description>&lt;h1&gt;
  
  
  Open Source LLM Spring 2026: What Changed in 2 Months
&lt;/h1&gt;

&lt;p&gt;After tracking open-weight LLM releases for the past two months, here's what's actually moving the needle. Not hype — architecture and data decisions that matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Sliding Window Attention Goes Mainstream
&lt;/h2&gt;

&lt;p&gt;Almost everyone switched to SWA. Context windows growing substantially without model bloat. The exception: MiniMax M2.5 still uses GQA (Grouped-Query Attention) but compensates purely through data quality on coding tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it matters:&lt;/strong&gt; You can now fit 200K+ context in models that previously handled 32K. Same parameter count, different attention mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. QK-Norm Spreading
&lt;/h2&gt;

&lt;p&gt;QK-Norm (query-key normalization) is emerging as an RMSNorm analogue. Traces back to Gemini 3 architecture. Stabilizes training at scale without adding compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Multimodal Pretraining Early
&lt;/h2&gt;

&lt;p&gt;Kimi k2.5 showed that pretraining on images at early stages (not just late-stage fine-tuning) significantly helps reasoning. The model learns visual concepts before language alignment, making downstream multimodal tasks more robust.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. GLM-5 (Z.ai) — Not a DeepSeek Clone
&lt;/h2&gt;

&lt;p&gt;On release, GLM-5 matched GPT-5.2 / Opus 4.5 / Gemini 3 Pro on key benchmarks. What's inside: heavily modified DeepSeek-V2 architecture with changed parameters, especially active expert count in the MoE layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key difference:&lt;/strong&gt; It's not "DeepSeek with a new name" — the routing and expert allocation is fundamentally redesigned.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Step 3.5 Flash — Efficiency King
&lt;/h2&gt;

&lt;p&gt;196B parameter MoE, architecturally similar to DeepSeek but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3x faster inference speed&lt;/li&gt;
&lt;li&gt;Multi-Token Prediction (generates 3 additional tokens per step instead of 1)&lt;/li&gt;
&lt;li&gt;Currently #2 on OpenRouter by token consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The catch:&lt;/strong&gt; Top benchmarks but Chatbot Arena tells a different story. Benchmarks ≠ real user preference.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;The pricing. Z.ai raised GLM-5 subscription prices 2x immediately — up to $160/month for max tier. Open-source models aren't free to run at scale, and providers are pricing accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Implications
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Watch Out&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5&lt;/td&gt;
&lt;td&gt;General reasoning, coding&lt;/td&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Step 3.5 Flash&lt;/td&gt;
&lt;td&gt;High-throughput APIs&lt;/td&gt;
&lt;td&gt;Arena scores&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M2.5&lt;/td&gt;
&lt;td&gt;Coding tasks&lt;/td&gt;
&lt;td&gt;GQA limits context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi k2.5&lt;/td&gt;
&lt;td&gt;Multimodal apps&lt;/td&gt;
&lt;td&gt;Early pretraining specifics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I'm Watching Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Streaming KV-cache compression for longer contexts&lt;/li&gt;
&lt;li&gt;Whether anyone replicates the Kimi early-multimodal pretraining approach&lt;/li&gt;
&lt;li&gt;If Step 3.5's multi-token prediction becomes standard&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;More architecture deep-dives and production AI notes from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;




&lt;p&gt;&lt;em&gt;More AI engineering notes, RAG benchmarks, and production insights from inside a bank — follow my Telegram channel:&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🚀 &lt;strong&gt;&lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/strong&gt; (Russian, technical)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>rag</category>
      <category>llm</category>
    </item>
    <item>
      <title>RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)</title>
      <dc:creator>Ai developer</dc:creator>
      <pubDate>Thu, 28 May 2026 21:13:57 +0000</pubDate>
      <link>https://dev.to/__2ddbae6bb7d/--5cec</link>
      <guid>https://dev.to/__2ddbae6bb7d/--5cec</guid>
      <description>&lt;h1&gt;
  
  
  RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)
&lt;/h1&gt;

&lt;p&gt;After 20+ hours of compute time on local hardware, I benchmarked 7 RAG configurations against real-world tasks. The results surprised me — and changed how I think about retrieval architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;RAG is everywhere in 2026. Everyone claims their pipeline is "SOTA," but most benchmarks use toy datasets. I wanted to see what actually works when you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messy real documents (not clean academic corpora)&lt;/li&gt;
&lt;li&gt;A local LLM (slightly weaker than GPT-4)&lt;/li&gt;
&lt;li&gt;Production constraints (latency, cost, accuracy tradeoffs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 7 Configurations Tested
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No-RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct LLM generation&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Classical RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dense retrieval (BGE-small + FAISS)&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hybrid RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BM25 + Dense + RRF fusion + cross-encoder reranker&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LightRAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Key-value extraction graph + dense hybrid&lt;/td&gt;
&lt;td&gt;Disappointing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PageIndex&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Two-stage hierarchical retrieval&lt;/td&gt;
&lt;td&gt;Okay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GraphRAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Entity graph + dense fallback&lt;/td&gt;
&lt;td&gt;Complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-step reasoning pipeline&lt;/td&gt;
&lt;td&gt;Slow, expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SEQUOIA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RAPTOR tree + step-back prompting&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Best&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SEQUOIA Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-query + rerank + compression&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SOTA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LightRAG underperformed
&lt;/h3&gt;

&lt;p&gt;The Twitter-hyped "graph RAG revolution" didn't hold up on real data. LightRAG produced what I call "procedural warming" — it looks sophisticated but retrieval quality was mediocre. Academic benchmarks ≠ production reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step-back prompting is underrated
&lt;/h3&gt;

&lt;p&gt;Most RAG systems fail because they retrieve on the literal query. Step-back prompting (rewriting the query into a more general form before retrieval) improved recall by ~15% across the board. Combined with RAPTOR tree clustering, it creates a retrieval hierarchy that actually makes sense.&lt;/p&gt;

&lt;h3&gt;
  
  
  Local LLMs can evaluate
&lt;/h3&gt;

&lt;p&gt;I used a local model for summarization and judging. Slightly weaker than GPT-4, yes, but the &lt;em&gt;relative&lt;/em&gt; rankings between methods stayed consistent. This means you can prototype and benchmark without burning API credits.&lt;/p&gt;

&lt;h2&gt;
  
  
  SEQUOIA Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    ↓
Step-back Prompting (generalize)
    ↓
RAPTOR Tree Retrieval (hierarchical clusters)
    ↓
Rerank + Context Compression
    ↓
Local LLM Generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RAPTOR&lt;/strong&gt; = Recursive Abstractive Processing for Tree-Organized Retrieval. Cluster leaf nodes, summarize upward, retrieve at multiple levels of abstraction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-back&lt;/strong&gt; = Before searching, ask: "What is the general principle behind this specific question?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;On my test set (banking documents, technical manuals, internal wikis):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classical RAG&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;td&gt;0.58&lt;/td&gt;
&lt;td&gt;120ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid RAG&lt;/td&gt;
&lt;td&gt;0.71&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;340ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LightRAG&lt;/td&gt;
&lt;td&gt;0.59&lt;/td&gt;
&lt;td&gt;0.61&lt;/td&gt;
&lt;td&gt;890ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SEQUOIA&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.84&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.79&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;450ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SEQUOIA Pro&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.87&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.82&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;680ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;SEQUOIA Pro trades some latency for accuracy. SEQUOIA (basic) is the sweet spot for production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code &amp;amp; Reproducibility
&lt;/h2&gt;

&lt;p&gt;Everything is open source:&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://github.com/Diyago/rag-benchmark/tree/main" rel="noopener noreferrer"&gt;github.com/Diyago/rag-benchmark&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All 7 implementations&lt;/li&gt;
&lt;li&gt;Evaluation dataset (anonymized)&lt;/li&gt;
&lt;li&gt;Configs for local LLM setup&lt;/li&gt;
&lt;li&gt;Notebooks for analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons for Production
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust academic benchmarks blindly.&lt;/strong&gt; Test on YOUR data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical retrieval beats flat.&lt;/strong&gt; RAPTOR's tree structure matches how humans actually organize knowledge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query rewriting is free performance.&lt;/strong&gt; Step-back prompting costs nothing in latency but improves retrieval significantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local evaluation is viable.&lt;/strong&gt; You don't need GPT-4 to compare methods relatively.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm extending SEQUOIA with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-modal retrieval (images + text)&lt;/li&gt;
&lt;li&gt;Streaming context compression&lt;/li&gt;
&lt;li&gt;Adaptive depth (shallow for simple queries, deep for complex)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  More AI Engineering Notes
&lt;/h2&gt;

&lt;p&gt;I write about practical AI/ML from inside a bank — RAG systems, LLM deployment, team management, and what actually works vs. what's just hype.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telegram channel (Russian, technical):&lt;/strong&gt; &lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;AI.Insaf&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you benchmarked RAG on real data? What surprised you? Drop a comment or reach out on Telegram.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Эта статья также опубликована в Telegram-канале &lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;AI.Insaf&lt;/a&gt; — про AI/ML из банковской практики, бенчмарки и управление DS-командами.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Подписывайтесь на канал для оперативных разборов и практических кейсов:&lt;/strong&gt; &lt;a href="https://t.me/ai_tablet" rel="noopener noreferrer"&gt;https://t.me/ai_tablet&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
      <category>career</category>
    </item>
  </channel>
</rss>
