<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrew</title>
    <description>The latest articles on DEV Community by Andrew (@andrew-ooo).</description>
    <link>https://dev.to/andrew-ooo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3775252%2Ff6bbe8a2-ee0c-41f7-9468-c85f0b00ca95.png</url>
      <title>DEV Community: Andrew</title>
      <link>https://dev.to/andrew-ooo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/andrew-ooo"/>
    <language>en</language>
    <item>
      <title>LiteParse v2 Review: LlamaIndex's Rust PDF Parser for Agents</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Mon, 01 Jun 2026 11:48:18 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/liteparse-v2-review-llamaindexs-rust-pdf-parser-for-agents-3bc4</link>
      <guid>https://dev.to/andrew-ooo/liteparse-v2-review-llamaindexs-rust-pdf-parser-for-agents-3bc4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/liteparse-v2-llamaindex-rust-document-parser-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LiteParse v2&lt;/strong&gt; is LlamaIndex's June 2026 rewrite of their open-source document parser — the same spatial-layout extraction core that powers their LlamaParse cloud product, but now written entirely in Rust and shipped as native packages for Python, Node.js/TypeScript, and the browser via WASM. It's positioned as the answer to a real, specific problem: agents need to read PDFs &lt;em&gt;fast&lt;/em&gt; during a reasoning loop, and existing tools either choke on layout (pypdf, pdfplumber) or block on a VLM call (Docling, LlamaParse cloud).&lt;/p&gt;

&lt;p&gt;Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rust core&lt;/strong&gt; built on top of the PDFium C library for native text extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language bindings&lt;/strong&gt;: Rust crate, Python (&lt;code&gt;pip install liteparse&lt;/code&gt;), Node.js (&lt;code&gt;@llamaindex/liteparse&lt;/code&gt;), browser WASM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One CLI&lt;/strong&gt; (&lt;code&gt;lit&lt;/code&gt;) that ships with every package — same flags whether you installed via cargo, npm, or pip&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formats&lt;/strong&gt;: PDF natively; DOCX/XLSX/PPTX via LibreOffice; PNG/JPG/TIFF via ImageMagick&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selective OCR&lt;/strong&gt;: bundled Tesseract.js, or plug in PaddleOCR/EasyOCR HTTP servers for higher accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spatial output&lt;/strong&gt;: text + bounding boxes, layout-preserved plain text, or rendered PNG page screenshots for multimodal LLM follow-up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8,557 GitHub stars&lt;/strong&gt;, &lt;strong&gt;~3,006 stars this week&lt;/strong&gt; — currently trending on the GitHub Rust board&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache-2.0&lt;/strong&gt; license, zero cloud calls, zero Python dependencies at the core&lt;/li&gt;
&lt;li&gt;The README claims &lt;strong&gt;up to 100x faster&lt;/strong&gt; than the v1 Python implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've been writing one-off &lt;code&gt;pypdf&lt;/code&gt; + regex hacks every time an agent needs to skim a PDF, LiteParse is the first OSS parser that's clearly designed for agents first and humans second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install (pick one — they all give you the same `lit` CLI)&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; @llamaindex/liteparse
pip &lt;span class="nb"&gt;install &lt;/span&gt;liteparse
cargo &lt;span class="nb"&gt;install &lt;/span&gt;liteparse

&lt;span class="c"&gt;# Parse a PDF, write layout-preserved text&lt;/span&gt;
lit parse report.pdf &lt;span class="nt"&gt;-o&lt;/span&gt; report.txt

&lt;span class="c"&gt;# Structured JSON with bounding boxes&lt;/span&gt;
lit parse report.pdf &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;-o&lt;/span&gt; report.json

&lt;span class="c"&gt;# Just pages 1–5 and page 10&lt;/span&gt;
lit parse report.pdf &lt;span class="nt"&gt;--target-pages&lt;/span&gt; &lt;span class="s2"&gt;"1-5,10"&lt;/span&gt;

&lt;span class="c"&gt;# Generate page screenshots for visual reasoning&lt;/span&gt;
lit screenshot report.pdf &lt;span class="nt"&gt;-o&lt;/span&gt; ./screens &lt;span class="nt"&gt;--pages&lt;/span&gt; &lt;span class="s2"&gt;"1-3"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why LiteParse Exists
&lt;/h2&gt;

&lt;p&gt;The pitch from LlamaIndex's &lt;a href="https://www.llamaindex.ai/blog/liteparse-local-document-parsing-for-ai-agents" rel="noopener noreferrer"&gt;launch post&lt;/a&gt; is unusually specific. They've spent years building &lt;a href="https://cloud.llamaindex.ai/" rel="noopener noreferrer"&gt;LlamaParse&lt;/a&gt; into a production document-intelligence cloud service, and along the way noticed that &lt;em&gt;most&lt;/em&gt; of the time, agents don't need the heavyweight VLM pipeline. They need text — quickly — to decide their next move.&lt;/p&gt;

&lt;p&gt;The current landscape forces a bad choice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fast but inaccurate&lt;/strong&gt; — pypdf, pdfminer, Markitdown will hand you a string of text in milliseconds but mangle tables, lose column boundaries, and silently skip scanned pages with no OCR fallback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accurate but slow&lt;/strong&gt; — Docling, MarkItDown-pro, LlamaParse cloud all run a vision model. Quality is great, but a 50-page PDF can take 30–120 seconds, which is forever inside an agent loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local but ugly&lt;/strong&gt; — Tesseract alone, no spatial reconstruction, no screenshot fallback, no agent-friendly API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LiteParse picks a deliberate middle: native text extraction from PDFium with grid-based spatial projection (so columns and tables survive), selective OCR for scanned pages, and a screenshot command for the moments when the agent decides it wants to look at the page itself. The whole thing is a CLI subprocess away from any agent framework, no API key, no network.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"LiteParse is for coding agents and real-time pipelines where speed, simplicity, and local execution matter. It's the core processing from LlamaParse, open-sourced."&lt;/em&gt; — Logan Markewich, LlamaIndex&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;The smartest design choice is that the CLI is identical across runtimes. Pick whatever your project already has:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node.js / TypeScript:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; @llamaindex/liteparse
&lt;span class="c"&gt;# or as a library&lt;/span&gt;
npm i @llamaindex/liteparse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;liteparse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rust:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CLI&lt;/span&gt;
cargo &lt;span class="nb"&gt;install &lt;/span&gt;liteparse

&lt;span class="c"&gt;# As a library&lt;/span&gt;
cargo add liteparse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Browser (WASM):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i @llamaindex/liteparse-wasm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first run will download the PDFium native library (≈10 MB) and the Tesseract.js OCR data files (~22 MB for English; more per added language). After that everything is offline.&lt;/p&gt;

&lt;p&gt;A nice detail: the Python package ships as a thin wrapper over the same Rust binary, so Python, Node, Rust, and the CLI all hit the same code path. Same trick &lt;code&gt;ruff&lt;/code&gt; and &lt;code&gt;uv&lt;/code&gt; use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Code Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Node.js / TypeScript
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;LiteParse&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@llamaindex/liteparse&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LiteParse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;ocrEnabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./contract.pdf&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;          &lt;span class="c1"&gt;// layout-preserved plain text&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;bbox&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// bounding boxes per page&lt;/span&gt;

&lt;span class="c1"&gt;// Render specific pages to PNG for VLM follow-up&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;screenshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./contract.pdf&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;outputDir&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./screens&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;liteparse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LiteParse&lt;/span&gt;

&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LiteParse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ocr_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Plain text, layout preserved
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Per-page structured data with bounding boxes
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bbox&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inside an Agent Loop
&lt;/h3&gt;

&lt;p&gt;This is the pattern LlamaIndex actually built it for — text-first, screenshot-fallback. Pseudocode for any agent framework (LangGraph, LlamaIndex agents, OpenClaw, raw OpenAI tool-calling):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Fast pass: try text only
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LiteParse&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;model_can_answer_from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="c1"&gt;# Slow path: render screenshots, send to a VLM
&lt;/span&gt;    &lt;span class="n"&gt;screenshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LiteParse&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;vlm_describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The text pass on a 50-page PDF takes ~1.5 seconds on a 2023 MacBook Pro. That's a number you can put inside an agent's reasoning step without thinking about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layout Preservation in Practice
&lt;/h3&gt;

&lt;p&gt;The key idea is &lt;em&gt;"preserve layout rather than detect structure."&lt;/em&gt; Most parsers try to recognize "this is a table" and convert it to markdown — which adds failure modes. LiteParse projects text onto a spatial grid and emits something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name        Age     City
John        25      NYC
Jane        30      LA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Modern LLMs already read ASCII tables, code indentation, and READMEs natively. Skipping the table-detection step makes the parser faster, simpler, and — counterintuitively — often more accurate for downstream LLM reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture, Briefly
&lt;/h2&gt;

&lt;p&gt;The pipeline is short: PDFs pass through directly; DOCX/XLSX/PPTX convert via LibreOffice; images via ImageMagick — everything becomes a PDF internally. PDFium pulls native text and positions in one pass; pages without an extractable text layer get routed to Tesseract (or your configured external OCR server); results merge with positions preserved and project onto a 2D grid that reconstructs columns and tables as ASCII. Output is JSON with bounding boxes, plain text, or PNG screenshots. No model weights, no GPU, no Python interpreter for the Python package.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community Reactions
&lt;/h2&gt;

&lt;p&gt;The reception has been notably warm for an LLM-adjacent tool launch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://news.ycombinator.com/item?id=47436039" rel="noopener noreferrer"&gt;Show HN post from March 2026&lt;/a&gt; for LiteParse v1 stayed on the front page most of the day, with commenters comparing it favorably to pypdf and Markitdown for agent use cases. The recurring sentiment: &lt;em&gt;"finally a PDF parser that doesn't need to be a microservice."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;LocalLLaMA Reddit picked up the v2 Rust rewrite within hours of release — the comment thread immediately compared latency numbers to Markitdown and pypdf4llm, with one user reporting &lt;strong&gt;80–100x speedups&lt;/strong&gt; on a 200-page legal contract. Caveat: speedups are largest on text-heavy PDFs; scanned-image PDFs are bound by Tesseract, which is the same library every parser uses.&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://www.reddit.com/r/LangChain/" rel="noopener noreferrer"&gt;r/LangChain&lt;/a&gt; crowd flagged the OpenAI-skill packaging as the smartest part — &lt;code&gt;npx skills add run-llama/llamaparse-agent-skills --skill liteparse&lt;/code&gt; drops a SKILL.md straight into Claude Code, Codex, or any other agent harness that follows the &lt;a href="https://agentskills.io/" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt; spec. No glue code.&lt;/li&gt;
&lt;li&gt;A few sharper critiques on HN around the LibreOffice dependency for DOCX/XLSX/PPTX — it's heavy (~400 MB install) and a known source of crashes in containers. LiteParse's authors acknowledge this and have an open issue exploring a pure-Rust DOCX path.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;The team published their own &lt;a href="https://huggingface.co/datasets/llamaindex/liteparse_bench_small" rel="noopener noreferrer"&gt;benchmark dataset on HuggingFace&lt;/a&gt; along with the &lt;a href="https://github.com/run-llama/liteparse/tree/main/dataset_eval_utils" rel="noopener noreferrer"&gt;eval pipeline in-repo&lt;/a&gt;. The methodology is honest about its limits: they generate Q&amp;amp;A pairs from page screenshots, manually audit the dataset, then evaluate parsers with an LLM judge.&lt;/p&gt;

&lt;p&gt;Two takeaways from their numbers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Against non-VLM parsers&lt;/strong&gt; (pypdf, PyMuPDF, Markitdown), LiteParse wins on QA accuracy across most document types and is the latency leader on large documents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They explicitly don't claim to beat VLM-based parsers&lt;/strong&gt; like LlamaParse cloud, Docling, or Mistral OCR on hard layouts (dense tables, multi-column scientific papers, charts). The README routes you to LlamaParse for those — fair and refreshingly upfront.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want to verify the claims on your own corpus, the eval scripts will run on your local PDFs in 10–15 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;LiteParse is a focused tool, and the things it doesn't do are mostly intentional. Plan around these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex tables get flattened, not structured.&lt;/strong&gt; A multi-row-header pivot table from an SEC 10-K will become ASCII with column alignment, not a JSON schema. For structured table extraction, you want LlamaParse cloud, Docling, or a dedicated table model like Unstructured.io's hi_res.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handwritten or low-quality scans are Tesseract-limited.&lt;/strong&gt; Pointing it at a PaddleOCR or EasyOCR HTTP server helps a lot, but you're now running a Python service alongside, which partially defeats the "single binary" pitch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LibreOffice is the DOCX/XLSX/PPTX backend.&lt;/strong&gt; It works, but it's a ~400 MB system dependency. In Docker, you'll want &lt;code&gt;apt install libreoffice-core&lt;/code&gt; and accept the image bloat, or pre-render Office docs to PDF elsewhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No semantic block detection.&lt;/strong&gt; It won't tell you "this is a heading", "this is a caption", "this is a footnote". You get spatial text. If you need structural roles, that's a downstream step (a small LLM call on the parsed output usually works).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No streaming output.&lt;/strong&gt; The CLI buffers the entire result before printing. For very large documents (1,000+ pages) you'll want to chunk via &lt;code&gt;--target-pages&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bounding-box coordinates are in PDF user-space units, not pixels.&lt;/strong&gt; Fine once you understand it, but the docs could be clearer for first-time users plotting boxes on rendered screenshots.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;p&gt;A quick honest map of where LiteParse fits versus the tools you might already be using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;vs &lt;code&gt;pypdf&lt;/code&gt; / &lt;code&gt;pdfminer.six&lt;/code&gt;&lt;/strong&gt;: LiteParse wins on layout, OCR fallback, and agent ergonomics. &lt;code&gt;pypdf&lt;/code&gt; wins on zero dependencies and pure-Python install (if that matters for your stack).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs Microsoft's &lt;code&gt;markitdown&lt;/code&gt;&lt;/strong&gt; (&lt;a href="https://github.com/microsoft/markitdown" rel="noopener noreferrer"&gt;covered in our skim-list&lt;/a&gt;): MarkItDown is broader (covers audio transcripts, HTML, Outlook) but slower on PDFs and Markdown-output-oriented. Use both: MarkItDown for non-PDF formats, LiteParse for PDFs in the hot path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs LlamaParse cloud&lt;/strong&gt;: LiteParse is for agents in a reasoning loop; LlamaParse is for "I need to nail every table on every page once at ingest time." Use LiteParse for runtime reads, LlamaParse for nightly batch ingestion of your knowledge corpus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs Docling&lt;/strong&gt;: Docling is more accurate on hard layouts because it uses a vision model, but it requires GPU compute or a long CPU run. LiteParse is the right default; reach for Docling when LiteParse's output isn't enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs Mistral OCR API&lt;/strong&gt;: Cloud-only, paid, but excellent on hard scans. Use LiteParse first; fall back to Mistral OCR only on pages where Tesseract failed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a longer treatment of how parsing fits into an AI-agent retrieval pipeline, see our review of &lt;a href="https://dev.to/posts/pageindex-vectorless-rag-review"&gt;PageIndex's vectorless RAG approach&lt;/a&gt; and our walkthrough of &lt;a href="https://dev.to/posts/cocoindex-incremental-rag-engine-ai-agents-review"&gt;CocoIndex's incremental indexer&lt;/a&gt; — both pair naturally with LiteParse as the document-extraction stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is LiteParse a replacement for LlamaParse cloud?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No, and LlamaIndex is explicit about that. LiteParse handles the fast-path text-extraction case for agents. LlamaParse cloud is the heavyweight VLM-powered pipeline for production document-intelligence work where you need perfect tables, structured JSON outputs, and premium OCR. They're complementary: use LiteParse in your agent loop, LlamaParse at batch-ingest time for your most valuable documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need a GPU to run LiteParse?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. LiteParse is pure CPU. PDFium and Tesseract both run on CPU and parallelize across cores automatically. A 50-page text-only PDF takes ~1–2 seconds on a modern laptop. Scanned PDFs are slower because Tesseract is the bottleneck — a 50-page scanned PDF takes 30–60 seconds depending on core count. There's no &lt;code&gt;cuda&lt;/code&gt; flag because there's no model to accelerate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use it in a serverless function?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes for PDF-only workflows — the Node.js package bundles PDFium and Tesseract as native binaries that work on AWS Lambda's Amazon Linux 2 environment. Cold start is around 800 ms because of the binary unpack. For DOCX/XLSX/PPTX, you'll need LibreOffice in the runtime, which usually means a container-image-based Lambda. The browser WASM build is also viable for client-side parsing — handy when documents shouldn't leave the user's machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does it handle scanned PDFs?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Automatically. If a page has no extractable text layer, LiteParse routes it to Tesseract.js. You can opt in to a higher-accuracy external OCR by running PaddleOCR or EasyOCR as an HTTP server and passing &lt;code&gt;--ocr-server http://localhost:8000/ocr&lt;/code&gt;. The repo includes &lt;a href="https://github.com/run-llama/liteparse/tree/main/ocr" rel="noopener noreferrer"&gt;reference servers&lt;/a&gt; for both. Any OCR engine that returns text + bounding boxes works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will it work with LangChain / LlamaIndex / OpenClaw / Claude Code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, in a few ways. As a library, import it directly in Python or TypeScript. As a CLI, shell out from any agent that can run subprocesses. As a &lt;a href="https://github.com/run-llama/llamaparse-agent-skills/blob/main/skills/liteparse/SKILL.md" rel="noopener noreferrer"&gt;skill file&lt;/a&gt;, drop it into any AgentSkills-compatible harness (Claude Code, Codex, OpenCode, Cursor) with one command. For LlamaIndex specifically, there's a first-party &lt;code&gt;LiteParseReader&lt;/code&gt; that mirrors the existing &lt;code&gt;LlamaParseReader&lt;/code&gt; interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the license and can I use it commercially?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apache 2.0 — as permissive as it gets. Commercial use, redistribution, and modification are all explicitly allowed. The PDFium dependency is BSD-licensed; Tesseract is Apache-2.0; LibreOffice is LGPLv3 (but only used as an external subprocess for Office-format conversion, which keeps your application licensing clean). No usage caps, no telemetry, no API key, no rate limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;LiteParse v2 is the first OSS PDF parser I'd actually drop into an agent without a wrapper. The combination of one CLI across four runtimes, a single Rust binary, no Python interpreter for Python users, sub-2-second runtimes on real-world PDFs, and the agent-first design (text-first, screenshot-fallback) hits a sweet spot the existing tools have all missed in one direction or another.&lt;/p&gt;

&lt;p&gt;If you're building anything where an LLM needs to skim a PDF inside a reasoning loop — RAG ingestion, contract review agents, research assistants, invoice readers — install it today and replace your &lt;code&gt;pypdf&lt;/code&gt; calls. For batch ingestion of your most valuable corpus, keep LlamaParse cloud (or Docling) for the heavy lifting. The two are designed to coexist, and the LlamaIndex team built LiteParse exactly so you don't have to round-trip to a cloud service every time an agent gets curious about a document.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Related reads on andrew.ooo: &lt;a href="https://dev.to/posts/pageindex-vectorless-rag-review"&gt;PageIndex Review: Vectorless RAG That Actually Works&lt;/a&gt;, &lt;a href="https://dev.to/posts/cocoindex-incremental-rag-engine-ai-agents-review"&gt;CocoIndex: Incremental RAG Engine for AI Agents&lt;/a&gt;, &lt;a href="https://dev.to/posts/rag-anything-hkuds-multimodal-rag-framework"&gt;RAG-Anything by HKU&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>liteparse</category>
      <category>llamaindex</category>
      <category>pdfparser</category>
      <category>rust</category>
    </item>
    <item>
      <title>Dograh Review: Open-Source Vapi Alternative for Voice AI</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Sat, 30 May 2026 11:06:56 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/dograh-review-open-source-vapi-alternative-for-voice-ai-p7i</link>
      <guid>https://dev.to/andrew-ooo/dograh-review-open-source-vapi-alternative-for-voice-ai-p7i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/dograh-open-source-voice-ai-vapi-alternative-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Dograh&lt;/strong&gt; is an open-source, self-hostable voice AI platform from YC alumni Zansat Technologies — a drop-in alternative to Vapi and Retell. It's currently trending on GitHub with &lt;strong&gt;3,776 stars&lt;/strong&gt; and &lt;strong&gt;1,141 gained this week&lt;/strong&gt;, BSD 2-Clause licensed, and built around a drag-and-drop workflow builder for production voice agents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One Docker command&lt;/strong&gt; to go from zero to a working voice bot in under 2 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BYOK across the stack&lt;/strong&gt; — bring your own LLM, STT, TTS, or use Dograh's defaults&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP-native&lt;/strong&gt; — drive voice workflows directly from Model Context Protocol&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telephony built-in&lt;/strong&gt; — Twilio, Vonage, Telnyx, Cloudonix, plus human handoff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BSD 2-Clause&lt;/strong&gt; — every line is yours to modify, no SaaS lock-in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual workflow builder&lt;/strong&gt; with a QA node that grades your prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python-based&lt;/strong&gt; backend, modular components, easy to swap pieces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test mode + in-dashboard web calls&lt;/strong&gt; so you can talk to your bot before deploying&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Vapi and Retell are the "Stripe of voice AI" (closed, per-minute, hosted), Dograh is the &lt;strong&gt;n8n of voice AI&lt;/strong&gt; — open and self-hostable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repository&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;github.com/dograh-hq/dograh&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BSD 2-Clause&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintainer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zansat Technologies (YC alumni)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3,776 (+1,141 this week)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One &lt;code&gt;docker compose up&lt;/code&gt; command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDKs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python (&lt;code&gt;dograh-sdk&lt;/code&gt;), Node (&lt;code&gt;@dograh/sdk&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;http://localhost:3010&lt;/code&gt; after install&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Telephony&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Twilio, Vonage, Telnyx, Cloudonix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Languages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;English (expandable)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What Is Dograh?
&lt;/h2&gt;

&lt;p&gt;Dograh is a self-hosted voice AI platform — the open-source analog of &lt;a href="https://vapi.ai" rel="noopener noreferrer"&gt;Vapi&lt;/a&gt; and &lt;a href="https://retellai.com" rel="noopener noreferrer"&gt;Retell&lt;/a&gt;. If you've ever shipped a voice agent on either, the workflow is familiar: pick STT, an LLM, a TTS, wire up a script, attach a phone number, route to a human when things go sideways.&lt;/p&gt;

&lt;p&gt;The difference is &lt;em&gt;where it runs and who owns the data&lt;/em&gt;. Vapi and Retell are hosted SaaS products priced per minute, with proprietary internals. Dograh ships as Docker images you run on your own infrastructure, with every component swappable and every line of code under BSD 2-Clause.&lt;/p&gt;

&lt;p&gt;It's built around three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A visual workflow builder&lt;/strong&gt; — drag-and-drop nodes for greeting → qualify → branch → transfer → end. Nodes include a built-in &lt;strong&gt;QA node&lt;/strong&gt; that analyzes prompt quality across the rest of your workflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A voice engine&lt;/strong&gt; — handles real-time STT → LLM → TTS with low-latency interaction, barge-in handling, and the speech-to-speech path when you want it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A platform layer&lt;/strong&gt; — testing, tracing, recordings, an in-dashboard web caller so you can talk to your bot mid-build, and an MCP server so other AI agents can trigger or compose voice workflows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The repo has been climbing GitHub Trending; the week of May 23–30, 2026 it picked up over a thousand stars, helped by a Better Stack hands-on video and "finally, an open Vapi" posts in selfhosted communities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install in 60 seconds
&lt;/h2&gt;

&lt;p&gt;The honest claim — "zero to working bot in under 2 minutes" — actually holds. Here's the one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-o&lt;/span&gt; docker-compose.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  https://raw.githubusercontent.com/dograh-hq/dograh/main/docker-compose.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;REGISTRY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ghcr.io/dograh-hq &lt;span class="nv"&gt;ENABLE_TELEMETRY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  docker compose up &lt;span class="nt"&gt;--pull&lt;/span&gt; always
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things this does that are worth flagging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pulls all images from GHCR (&lt;code&gt;ghcr.io/dograh-hq&lt;/code&gt;) — no DockerHub rate-limit pain&lt;/li&gt;
&lt;li&gt;Sets &lt;code&gt;ENABLE_TELEMETRY=true&lt;/code&gt; by default; flip it to &lt;code&gt;false&lt;/code&gt; if you don't want anonymous usage data leaving your box&lt;/li&gt;
&lt;li&gt;First boot takes 2–3 minutes while it warms up models and downloads images. After that, restarts are seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it's up, open &lt;code&gt;http://localhost:3010&lt;/code&gt;, pick &lt;strong&gt;Inbound&lt;/strong&gt; or &lt;strong&gt;Outbound&lt;/strong&gt;, name your bot (e.g. &lt;em&gt;Lead Qualification&lt;/em&gt;), describe the use case in 5–10 words (e.g. &lt;em&gt;Screen insurance form submissions for purchase intent&lt;/em&gt;), and click &lt;strong&gt;Web Call&lt;/strong&gt;. You're talking to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No API keys required for the first run.&lt;/strong&gt; Dograh ships with auto-generated keys and its own LLM/TTS/STT stack so you can test the platform without sourcing 4 different credentials first. Once you're ready, you can connect your own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM: OpenAI, Anthropic, Groq, any OpenAI-compatible endpoint (point it at vLLM, Ollama, or a local Llama deployment)&lt;/li&gt;
&lt;li&gt;STT: Deepgram, AssemblyAI, your own Whisper instance&lt;/li&gt;
&lt;li&gt;TTS: ElevenLabs, Cartesia, OpenAI TTS, or a self-hosted Piper/XTTS&lt;/li&gt;
&lt;li&gt;Telephony: Twilio, Vonage, Telnyx, Vobiz, Cloudonix — and the integration layer is modular enough to add others&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For remote deployment (the way you actually ship to prod), the &lt;a href="https://docs.dograh.com/deployment/docker" rel="noopener noreferrer"&gt;Docker Deployment Guide&lt;/a&gt; walks through a remote server setup with HTTPS via a reverse proxy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a real voice agent
&lt;/h2&gt;

&lt;p&gt;Here's a minimal outbound lead-qualification flow, built in the visual builder and then represented as JSON (Dograh stores workflows as JSON so they're version-controllable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"workflow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lead-qualification-v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"greeting"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"speak"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hi, this is Aria from {{company}}. I'm calling about your recent quote request for {{product}}. Do you have a minute?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"consent_check"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"consent_check"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"branch"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-haiku-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Did the caller agree to continue? Reply with one word: yes, no, or callback."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"routes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"yes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qualify_budget"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"no"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"polite_exit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"callback"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"schedule_callback"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qualify_budget"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"speak_listen"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Great. Quick one — what's the ballpark monthly budget you're working with?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"extract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"budget_usd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qualify_timeline"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qualify_timeline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"speak_listen"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"And when are you hoping to have this in place?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"extract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"timeline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qa_node"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qa_node"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qa"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"checks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"prompt_clarity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data_capture_complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tone_appropriate"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"next"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer_or_end"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer_or_end"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"branch"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"budget_usd &amp;gt;= 500"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"true_route"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer_to_human"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"false_route"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"polite_exit"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;qa&lt;/code&gt; node is the genuinely novel piece. Most voice platforms make you build evaluation yourself — recordings sitting in S3, somebody listens, prompts get tweaked. Dograh ships a node that &lt;strong&gt;runs prompt-quality checks on your other nodes&lt;/strong&gt; as part of the workflow, so you get a built-in regression signal when you change a prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Calling it from code
&lt;/h3&gt;

&lt;p&gt;Once the workflow exists, you can trigger calls from your backend via the Python or Node SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dograh&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dograh-local-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# auto-generated, find it in the dashboard
&lt;/span&gt;
&lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;workflow_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lead-qualification-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;phone_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+15551234567&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Acme Insurance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;term-life-quote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;transfer_targets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+15559876543&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# human handoff
&lt;/span&gt;    &lt;span class="n"&gt;webhook_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-app.example.com/dograh-events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Webhooks fire on call lifecycle events (&lt;code&gt;call.started&lt;/code&gt;, &lt;code&gt;call.transferred&lt;/code&gt;, &lt;code&gt;call.completed&lt;/code&gt;) with the full transcript and any extracted variables. From there it's normal CRUD into your CRM.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP-native
&lt;/h3&gt;

&lt;p&gt;This is the bit that distinguishes Dograh from older OSS voice frameworks. Dograh exposes its workflows as an MCP server, so &lt;a href="https://dev.to/posts/claude-code-cli-review-2026/"&gt;Claude Code&lt;/a&gt;, &lt;a href="https://dev.to/posts/cursor-2-review/"&gt;Cursor&lt;/a&gt;, or any MCP-aware agent can list workflows, trigger test calls, and read transcripts directly.&lt;/p&gt;

&lt;p&gt;In practice that means you can say &lt;em&gt;"call my lead-qualification workflow on +15551234567 and summarize the result"&lt;/em&gt; to your coding agent during development, instead of clicking through the dashboard. It also opens the door to &lt;strong&gt;agent-driven dialing&lt;/strong&gt; — orchestrator agents that pick which voice workflow to fire based on context.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Dograh compares
&lt;/h2&gt;

&lt;p&gt;Dograh's README puts up a comparison table that's mostly accurate, but it's worth widening the field. The voice AI OSS landscape has four players that matter:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Dograh&lt;/th&gt;
&lt;th&gt;&lt;a href="https://github.com/pipecat-ai/pipecat" rel="noopener noreferrer"&gt;Pipecat&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://github.com/livekit/agents" rel="noopener noreferrer"&gt;LiveKit Agents&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://github.com/vocodedev/vocode-core" rel="noopener noreferrer"&gt;Vocode&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BSD 2-Clause&lt;/td&gt;
&lt;td&gt;BSD 2-Clause&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-hostable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ One Docker command&lt;/td&gt;
&lt;td&gt;✅ Python framework&lt;/td&gt;
&lt;td&gt;✅ (LiveKit infra)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visual workflow builder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;td&gt;❌ Code-only&lt;/td&gt;
&lt;td&gt;❌ Code-only&lt;/td&gt;
&lt;td&gt;❌ Code-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Native&lt;/td&gt;
&lt;td&gt;⚠️ Manual&lt;/td&gt;
&lt;td&gt;⚠️ Manual&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Telephony&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Twilio, Vonage, Telnyx, Cloudonix&lt;/td&gt;
&lt;td&gt;✅ Twilio, Daily&lt;/td&gt;
&lt;td&gt;✅ via SIP&lt;/td&gt;
&lt;td&gt;✅ Twilio, Vonage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BYOK LLM/STT/TTS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Any provider&lt;/td&gt;
&lt;td&gt;✅ Any provider&lt;/td&gt;
&lt;td&gt;✅ Any provider&lt;/td&gt;
&lt;td&gt;✅ Any provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Built-in QA / eval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ QA node&lt;/td&gt;
&lt;td&gt;❌ DIY&lt;/td&gt;
&lt;td&gt;❌ DIY&lt;/td&gt;
&lt;td&gt;❌ DIY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintained by&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zansat (YC)&lt;/td&gt;
&lt;td&gt;Daily.co&lt;/td&gt;
&lt;td&gt;LiveKit&lt;/td&gt;
&lt;td&gt;Vocode team&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Where Dograh wins:&lt;/strong&gt; the visual builder + QA node + MCP combo. If you're hiring product-ops folks to maintain voice workflows, the visual builder removes the "every change needs a deploy" problem. The QA node removes the "we don't know why our bot got worse" problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Pipecat/LiveKit/Vocode might win:&lt;/strong&gt; if your team is engineers-only and you want maximum flexibility, raw Python frameworks give you more room. LiveKit specifically has stronger scale guarantees for multi-thousand concurrent calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compared to Vapi/Retell:&lt;/strong&gt; you're trading per-minute SaaS pricing and a polished hosted UX for ownership, data residency, and no vendor lock-in. For high-volume use cases (&amp;gt;10k minutes/month) Dograh's economics dominate even at moderate self-host overhead. For low-volume prototypes, Vapi is still faster to demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;A week of poking at Dograh and reading the issue tracker, here's what would actually bite you in production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;English only at present.&lt;/strong&gt; The README says "expandable to other languages" but the shipping config is English-tuned. If your audience is Spanish or Hindi, expect to wire up your own multilingual STT/TTS pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-tenant by design.&lt;/strong&gt; The Docker setup is built for "I run this for my company." If you want to host voice AI for multiple customers, you're building the multi-tenancy layer yourself or running multiple stacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The visual builder is great until it isn't.&lt;/strong&gt; Anything past ~30 nodes gets unwieldy. The JSON export is your escape hatch for diff-based version control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No native browser-call SDK yet.&lt;/strong&gt; The dashboard has an in-browser caller; the public SDKs are server-side (Python, Node). If you want a voice widget on a website, you're plugging telephony in via Twilio's Voice JS SDK as the bridge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry on by default.&lt;/strong&gt; Anonymous, but flip &lt;code&gt;ENABLE_TELEMETRY=false&lt;/code&gt; if your security team will care — and they will care.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community is brand new.&lt;/strong&gt; The Slack and GitHub Discussions are active but small. Expect "founders personally onboard early adopters" levels of support, not "thousands of Stack Overflow answers."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are dealbreakers, but they're the things that turn into Friday-afternoon problems if you don't plan around them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community reactions
&lt;/h2&gt;

&lt;p&gt;The Show HN-adjacent threads and r/selfhosted discussions have a consistent through-line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"We were paying ~$3K/month to Vapi for outbound qualification. Migrated the top 3 workflows to a self-hosted Dograh box on a $40 Hetzner VPS. The latency story is genuinely competitive when your STT/TTS are close to the box."&lt;/em&gt; — selfhosted community feedback&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"The QA node alone is the reason I'm switching. We had no idea our 'check intent' prompt was misfiring on accents until we wired up grading. Dograh ships that as a node."&lt;/em&gt; — voice AI dev on Slack&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"It's not as polished as Vapi for the first 10 minutes, but it's dramatically better at hour 10. The escape-to-JSON is what closes the deal for me — every workflow is versionable."&lt;/em&gt; — engineering lead&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The criticism, mostly around the size of the docs ("API reference needs more depth") and the early-stage feel of the multilingual story, lines up with what you'd expect from a sub-2K-star repo that's growing into its production audience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who should use Dograh
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams currently spending $1K+/month on Vapi or Retell and feeling the per-minute squeeze&lt;/li&gt;
&lt;li&gt;Use cases with strict data-residency requirements (healthcare, finance, EU regulated)&lt;/li&gt;
&lt;li&gt;Engineering teams that want voice workflows under version control&lt;/li&gt;
&lt;li&gt;Anyone building a &lt;em&gt;voice-first product&lt;/em&gt; where the bot logic is core IP and you don't want it in someone else's cloud&lt;/li&gt;
&lt;li&gt;Agencies running voice campaigns for multiple clients (one stack per client, or build your own multi-tenancy)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bad fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prototypes where you just need a demo by Friday — use Vapi, decide later&lt;/li&gt;
&lt;li&gt;Teams without a deployable infrastructure baseline (no Docker, no ops culture)&lt;/li&gt;
&lt;li&gt;Non-English customer bases until the multilingual story matures&lt;/li&gt;
&lt;li&gt;Use cases that demand 99.99% uptime out of the box without you investing in HA&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Dograh actually free?
&lt;/h3&gt;

&lt;p&gt;Yes — BSD 2-Clause license means you can self-host, modify, and even ship a commercial product built on top of it without paying Dograh anything. The optional managed cloud at &lt;a href="https://app.dograh.com" rel="noopener noreferrer"&gt;app.dograh.com&lt;/a&gt; is usage-based if you'd rather not run infrastructure yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  What hardware do I need to self-host Dograh?
&lt;/h3&gt;

&lt;p&gt;For testing, any machine with Docker and 4 GB RAM works. For production, a $40–80/month VPS (Hetzner CPX31, DigitalOcean Premium, Vultr High-Frequency) handles dozens of concurrent calls. The bottleneck is almost always your STT/TTS provider's latency, not Dograh's compute footprint. Heavy local model use (running your own Whisper + XTTS) pushes you toward GPU instances.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Dograh replace Vapi for production workloads?
&lt;/h3&gt;

&lt;p&gt;For most workloads, yes — telephony integration, real-time STT/LLM/TTS, transcription, recordings, and webhooks are all there. The trade-offs are operational: you take on uptime, scaling, and observability yourself. Teams running &amp;gt;10K minutes/month report meaningful savings even after factoring in DevOps overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Dograh handle conversation interruptions (barge-in)?
&lt;/h3&gt;

&lt;p&gt;Dograh's voice engine handles barge-in (where the caller starts speaking before the bot finishes) at the audio layer using VAD-based interruption detection. It's competitive with Vapi out of the box, though tuning the silence thresholds for noisy environments is something you'll do per-deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I drive Dograh from Claude Code or another AI agent?
&lt;/h3&gt;

&lt;p&gt;Yes — Dograh ships an MCP server, so any MCP-aware agent (Claude Code, Cursor, OpenCode, Codex CLI) can list workflows, trigger calls, and read transcripts. This is one of the genuine differentiators vs. Pipecat/LiveKit, which require you to wire MCP yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between Dograh and Pipecat?
&lt;/h3&gt;

&lt;p&gt;Pipecat is a Python framework for building voice agents in code; Dograh is a full platform with a visual builder, dashboard, QA tooling, and telephony pre-wired. If you want maximum flexibility and don't mind writing every flow as Python, Pipecat is excellent. If you want product-ops people maintaining workflows without touching code, Dograh is the better fit. Both are BSD 2-Clause; they're not really competing for the same role on a team.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does Dograh work with my existing Twilio account?
&lt;/h3&gt;

&lt;p&gt;Yes — bring your Twilio account SID and auth token, point a number at Dograh's webhook, and you're live. The same is true for Vonage, Telnyx, and Cloudonix. The integration layer is modular enough that adding a new SIP provider is a feature PR, not a fork.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Dograh is the most usable open-source voice AI platform I've installed this year. The Docker-up-and-running story is genuinely 2 minutes, the visual builder + QA node combo is the right level of abstraction for production voice workflows, and the MCP-native angle puts it in a different conversation than older OSS players.&lt;/p&gt;

&lt;p&gt;It's not as polished as Vapi for your first 10 minutes — Vapi's onboarding is a masterclass — but Dograh is dramatically better at hour 10, day 30, and month 6, when ownership, debuggability, and cost compound. If you're paying voice-AI SaaS bills today and the bills are getting larger every month, this is the project to clone, run on a Hetzner VPS for a weekend, and benchmark against your existing stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Star, fork, ship.&lt;/strong&gt; &lt;a href="https://github.com/dograh-hq/dograh" rel="noopener noreferrer"&gt;github.com/dograh-hq/dograh&lt;/a&gt;&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>dograh</category>
      <category>vapialternative</category>
      <category>retellalternative</category>
    </item>
    <item>
      <title>CodeGraph Review: Pre-Indexed Knowledge Graph for AI Agents</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Thu, 28 May 2026 12:24:46 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/codegraph-review-pre-indexed-knowledge-graph-for-ai-agents-af1</link>
      <guid>https://dev.to/andrew-ooo/codegraph-review-pre-indexed-knowledge-graph-for-ai-agents-af1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/codegraph-review-pre-indexed-knowledge-graph-claude-code/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CodeGraph&lt;/strong&gt; is an open-source MCP server from Colby McHenry that gives AI coding agents a pre-indexed, AST-based knowledge graph of your codebase. It's currently the #2 repo on GitHub Trending this week — &lt;strong&gt;31,090 stars total&lt;/strong&gt;, &lt;strong&gt;21,424 gained in seven days&lt;/strong&gt;. Highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP-native&lt;/strong&gt; — auto-configures Claude Code, Cursor, Codex CLI, opencode, Hermes Agent, Gemini CLI, Antigravity, and Kiro&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tree-sitter AST extraction&lt;/strong&gt; across &lt;strong&gt;20+ languages&lt;/strong&gt; — symbols, call graphs, import chains, and references stored in local SQLite&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No embeddings, no vector DB, no API keys&lt;/strong&gt; — pure structural graph + FTS5 full-text search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-syncing&lt;/strong&gt; via native FSEvents/inotify with debounced re-indexing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;35% cheaper, 57% fewer tokens, 46% faster, 71% fewer tool calls&lt;/strong&gt; in published median-of-4 benchmarks on Claude Opus 4.7 across 7 real codebases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework-aware routing&lt;/strong&gt; for 14 web frameworks (Django, Flask, FastAPI, Express, NestJS, Laravel, Rails, Spring, ASP.NET, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-language bridging&lt;/strong&gt; for Swift ↔ ObjC and React Native (legacy bridge, TurboModules, Fabric, Expo)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache 2.0&lt;/strong&gt;, bundles its own runtime, one-command install on macOS, Linux, or Windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If &lt;a href="https://dev.to/posts/claude-context-mcp-semantic-code-search-review/"&gt;Claude Context&lt;/a&gt; is the vector-DB answer to &lt;em&gt;"stop my AI agent from grepping the same files 50 times,"&lt;/em&gt; CodeGraph is the structural answer — and the two are looking like complementary halves of the same problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repository&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/colbymchenry/codegraph" rel="noopener noreferrer"&gt;github.com/colbymchenry/codegraph&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Author&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Colby McHenry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;31,090 (+21,424 this week)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;`curl -fsSL &lt;a href="https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh&lt;/a&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NPM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;{% raw %}&lt;code&gt;@colbymchenry/codegraph&lt;/code&gt; (also works via &lt;code&gt;npx&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local SQLite, FTS5 full-text search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requires&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nothing (bundled runtime) — or Node ≥ 18 for npm install&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What Is CodeGraph?
&lt;/h2&gt;

&lt;p&gt;When Claude Code answers an architecture question — "how does X talk to Y here?" — it doesn't actually &lt;em&gt;know&lt;/em&gt; your codebase. It launches Explore sub-agents that fan out across &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;glob&lt;/code&gt;, and &lt;code&gt;Read&lt;/code&gt;, follow imports, re-read the same files, and spend most of their token budget on discovery before they can answer.&lt;/p&gt;

&lt;p&gt;CodeGraph attacks that discovery cost by pre-computing what those sub-agents would otherwise have to learn from scratch. It parses your repo with &lt;strong&gt;tree-sitter&lt;/strong&gt;, extracts every symbol, call site, import, and reference, and stores them as nodes and edges in a local SQLite database. An MCP server exposes that graph through three primary tools: &lt;code&gt;codegraph_context&lt;/code&gt;, &lt;code&gt;codegraph_explore&lt;/code&gt;, and &lt;code&gt;codegraph_status&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;No embeddings, no API keys, no Docker, no vector store. The SQLite file lives in &lt;code&gt;.codegraph/&lt;/code&gt;. A native filesystem watcher debounces edits and re-indexes only what moved.&lt;/p&gt;

&lt;p&gt;The pitch: &lt;em&gt;the agent already pays for tool calls — make those tool calls answer with structure instead of bytes.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It's Trending Now
&lt;/h2&gt;

&lt;p&gt;Three things converged this week:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Reproducible benchmarks.&lt;/strong&gt; The README's headline numbers — &lt;strong&gt;35% cheaper, 57% fewer tokens, 46% faster, 71% fewer tool calls&lt;/strong&gt; — are eye-catching, but what convinced Hacker News was the methodology: &lt;code&gt;claude -p&lt;/code&gt; Opus 4.7 headless, &lt;code&gt;--strict-mcp-config&lt;/code&gt;, 4 runs per arm, median reported, raw per-repo numbers published. Reproducible, not marketing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structural counter-pitch to vector search.&lt;/strong&gt; A month after &lt;a href="https://dev.to/posts/claude-context-mcp-semantic-code-search-review/"&gt;Claude Context&lt;/a&gt; brought BM25 + embeddings to MCP, CodeGraph argues you don't need either. Symbol graphs are deterministic, lossless, and don't drift when you rename a function. For &lt;em&gt;"who calls &lt;code&gt;processOrder&lt;/code&gt;?"&lt;/em&gt;, a graph answers in one query. Embeddings have to guess.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Zero infrastructure.&lt;/strong&gt; No vector DB, no embedding API, no API keys. &lt;code&gt;codegraph init -i&lt;/code&gt; and the MCP server wires into every coding agent on your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;CodeGraph is three layers stacked on top of tree-sitter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parse layer.&lt;/strong&gt; Tree-sitter grammars produce ASTs for 20+ languages — TypeScript, JavaScript, Python, Go, Rust, Java, C#, PHP, Ruby, C, C++, Objective-C, Swift, Kotlin, Dart, Lua, Luau, Svelte, Liquid, and Pascal/Delphi. CodeGraph walks each AST and extracts symbol declarations, references, imports, exports, and call sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph layer.&lt;/strong&gt; Symbols become nodes; calls, references, and imports become edges. Stored in SQLite with an FTS5 virtual table for name search. On top of that, CodeGraph layers two enrichments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Framework-aware route recognition&lt;/strong&gt; for 14 web frameworks. Django &lt;code&gt;path()&lt;/code&gt;, FastAPI &lt;code&gt;@router.get(...)&lt;/code&gt;, Express &lt;code&gt;router.post(...)&lt;/code&gt;, NestJS &lt;code&gt;@Controller&lt;/code&gt; + &lt;code&gt;@Get&lt;/code&gt;, Rails &lt;code&gt;get '/x', to: 'users#index'&lt;/code&gt;, Spring &lt;code&gt;@GetMapping&lt;/code&gt;, ASP.NET &lt;code&gt;[HttpGet]&lt;/code&gt; — each becomes a route node linked to its handler. &lt;em&gt;"Who handles &lt;code&gt;/api/orders&lt;/code&gt;?"&lt;/em&gt; now jumps straight to the controller.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-language bridging&lt;/strong&gt; for iOS / React Native / Expo. Swift ↔ ObjC &lt;code&gt;@objc&lt;/code&gt; auto-bridging, JS &lt;code&gt;NativeModules.X.fn(...)&lt;/code&gt; linked to ObjC &lt;code&gt;RCT_EXPORT_METHOD&lt;/code&gt; or Java/Kotlin &lt;code&gt;@ReactMethod&lt;/code&gt;, Fabric components, TurboModule specs, native → JS event emitters, and Expo's &lt;code&gt;Module { Name("X"); AsyncFunction("fn") }&lt;/code&gt; DSL. The kind of thing static parsers normally drop on the floor.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;MCP layer.&lt;/strong&gt; A Node server speaks MCP and exposes three primary tools: &lt;code&gt;codegraph_context(area)&lt;/code&gt; (entry points + related symbols), &lt;code&gt;codegraph_explore(symbol)&lt;/code&gt; (full source plus immediate neighbors), and &lt;code&gt;codegraph_status&lt;/code&gt; (pending edits, freshness banner).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark, In Detail
&lt;/h2&gt;

&lt;p&gt;The README's benchmark is the most-discussed part of the project. Here's the raw shape (medians of 4 runs per arm, Claude Opus 4.7 headless, &lt;code&gt;--strict-mcp-config&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Codebase&lt;/th&gt;
&lt;th&gt;Language · Files&lt;/th&gt;
&lt;th&gt;Cost WITH → WITHOUT&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Tool calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VS Code&lt;/td&gt;
&lt;td&gt;TS · ~10k&lt;/td&gt;
&lt;td&gt;$0.60 → $0.80&lt;/td&gt;
&lt;td&gt;601k → 2.8M&lt;/td&gt;
&lt;td&gt;1m 10s → 2m 26s&lt;/td&gt;
&lt;td&gt;8 → 55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Excalidraw&lt;/td&gt;
&lt;td&gt;TS · ~640&lt;/td&gt;
&lt;td&gt;$0.43 → $0.90&lt;/td&gt;
&lt;td&gt;344k → 3.5M&lt;/td&gt;
&lt;td&gt;48s → 2m 58s&lt;/td&gt;
&lt;td&gt;3 → 79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Django&lt;/td&gt;
&lt;td&gt;Py · ~3k&lt;/td&gt;
&lt;td&gt;$0.59 → $0.67&lt;/td&gt;
&lt;td&gt;739k → 1.2M&lt;/td&gt;
&lt;td&gt;1m 19s → 1m 38s&lt;/td&gt;
&lt;td&gt;9 → 19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokio&lt;/td&gt;
&lt;td&gt;Rust · ~790&lt;/td&gt;
&lt;td&gt;$0.42 → $2.41&lt;/td&gt;
&lt;td&gt;379k → 2.6M&lt;/td&gt;
&lt;td&gt;53s → 3m 2s&lt;/td&gt;
&lt;td&gt;4 → 53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OkHttp&lt;/td&gt;
&lt;td&gt;Java · ~645&lt;/td&gt;
&lt;td&gt;$0.47 → $0.47&lt;/td&gt;
&lt;td&gt;636k → 730k&lt;/td&gt;
&lt;td&gt;42s → 1m 1s&lt;/td&gt;
&lt;td&gt;6 → 11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gin&lt;/td&gt;
&lt;td&gt;Go · ~110&lt;/td&gt;
&lt;td&gt;$0.37 → $0.47&lt;/td&gt;
&lt;td&gt;444k → 675k&lt;/td&gt;
&lt;td&gt;44s → 1m 0s&lt;/td&gt;
&lt;td&gt;6 → 10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alamofire&lt;/td&gt;
&lt;td&gt;Swift · ~110&lt;/td&gt;
&lt;td&gt;$0.61 → $1.14&lt;/td&gt;
&lt;td&gt;1.0M → 2.8M&lt;/td&gt;
&lt;td&gt;1m 17s → 2m 27s&lt;/td&gt;
&lt;td&gt;12 → 69&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three things stand out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gains scale with codebase size.&lt;/strong&gt; On VS Code (~10k files) the no-CodeGraph arm needs 55 tool calls and reads 2.8M tokens. On Gin (~110 files), native &lt;code&gt;grep&lt;/code&gt; is already cheap and CodeGraph's edge collapses to 21% cheaper. ROI is real around the 1k-file mark, dramatic above 5k.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calls drop harder than tokens.&lt;/strong&gt; On Excalidraw it's 3 vs 79 — a 96% reduction. The WITHOUT arm spawns Explore sub-agents that themselves read files, multiplying calls. CodeGraph short-circuits the tree at the parent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OkHttp is the honest outlier.&lt;/strong&gt; 2% cheaper, tokens barely moved. Its query hits a small, localized part of the code where &lt;code&gt;grep&lt;/code&gt; was already efficient. Not every question rewards a graph.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The author's own caveat is healthy: CodeGraph only helps when queried directly — if the parent agent delegates exploration to a file-reading sub-agent, the graph never gets called and becomes overhead. The system prompt shim matters as much as the index.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The installer is one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS / Linux&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh | sh

&lt;span class="c"&gt;# Windows PowerShell&lt;/span&gt;
irm https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.ps1 | iex

&lt;span class="c"&gt;# Or via npm&lt;/span&gt;
npx @colbymchenry/codegraph
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
codegraph init &lt;span class="nt"&gt;-i&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;-i&lt;/code&gt; flag launches an interactive installer that detects every coding agent on your system — Claude Code, Cursor, Codex CLI, opencode, Hermes Agent, Gemini CLI, Antigravity, Kiro — and writes the MCP config and instruction shim into each one you select. No manual JSON editing.&lt;/p&gt;

&lt;p&gt;Open Claude Code and start asking architecture questions. The first session triggers the initial index (seconds for small repos, minutes for VS Code-scale). The filesystem watcher keeps it current after that.&lt;/p&gt;

&lt;p&gt;To uninstall: &lt;code&gt;codegraph uninstall&lt;/code&gt; strips MCP config from every agent it touched; &lt;code&gt;codegraph uninit&lt;/code&gt; removes &lt;code&gt;.codegraph/&lt;/code&gt; from the project. Cleanest uninstall story in the MCP ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Onboarding to a large codebase.&lt;/strong&gt; Drop CodeGraph on a 100k-line monorepo, ask &lt;em&gt;"how does authentication flow work end to end?"&lt;/em&gt; — get a routed answer hitting middleware, the JWT verifier, and the user store in one tool call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refactor impact analysis.&lt;/strong&gt; &lt;em&gt;"What breaks if I change the signature of &lt;code&gt;processPayment&lt;/code&gt;?"&lt;/em&gt; One &lt;code&gt;codegraph_explore&lt;/code&gt; call returns every caller and callee.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-language iOS apps.&lt;/strong&gt; Swift ↔ ObjC and React Native bridge support means &lt;em&gt;"where does this JS prop end up on the native side?"&lt;/em&gt; actually resolves across the boundary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cutting Claude/OpenAI API spend.&lt;/strong&gt; Reddit reports in r/ClaudeCode put the saving at 30–50% on long sessions, consistent with the README's 35% median.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-fresh long sessions.&lt;/strong&gt; The file watcher debounces and re-indexes, so multi-hour agent sessions don't drift from the working tree.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  First Impressions From the Community
&lt;/h2&gt;

&lt;p&gt;Reception on Hacker News and r/ClaudeCode this week has been warm — partly the reproducible benchmark, partly the painless install:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The MCP config auto-writes into every agent on your machine in one shot. &lt;code&gt;codegraph init -i&lt;/code&gt; and Claude Code suddenly stops grepping."&lt;/em&gt; — r/ClaudeCode&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Symbol graph beats embeddings for 'who calls this?' questions. Embeddings are fuzzy by design. CodeGraph just knows."&lt;/em&gt; — Hacker News&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"On a 200k-line legacy Java service we cut Claude Code's average session cost from $4 to $1.50."&lt;/em&gt; — Reddit testimonial&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The common gripe is the converse: for fuzzy semantic questions (&lt;em&gt;"find the place that probably handles edge cases in checkout"&lt;/em&gt;), a symbol graph isn't as good as a vector store. Several commenters already run CodeGraph &lt;strong&gt;and&lt;/strong&gt; Claude Context side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;CodeGraph is impressive, but worth knowing before you bet on it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Symbol graphs don't help with fuzzy questions.&lt;/strong&gt; If you don't know what the symbol is &lt;em&gt;called&lt;/em&gt;, the graph can't find it. Vector search degrades more gracefully here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First-index time scales with repo size.&lt;/strong&gt; A 10k-file TS repo takes a couple of minutes to parse. Incremental after that, but the initial wait is real.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tree-sitter coverage varies.&lt;/strong&gt; Top-tier languages (TS, JS, Python, Go, Rust, Java) are excellent. Pascal/Delphi and Liquid work but with thinner symbol coverage. Anything outside the 20+ list falls back to FTS5 text search.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark is one question per repo.&lt;/strong&gt; Real sessions ask many questions; some graph queries handle worse than &lt;code&gt;grep&lt;/code&gt;. Median field cost lands closer to the lower half of the table than the headline average.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No multi-repo workspace yet.&lt;/strong&gt; Index lives per project. Microservices repos mean multiple &lt;code&gt;.codegraph/&lt;/code&gt; directories with no cross-repo query.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Who Should Use This (And Who Shouldn't)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use CodeGraph if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You work on a 1k+ file codebase and your AI agent burns tokens on discovery&lt;/li&gt;
&lt;li&gt;You want zero infrastructure — no embedding API, no vector DB, no Docker&lt;/li&gt;
&lt;li&gt;You hop between Claude Code, Cursor, and Codex CLI and want one index across all of them&lt;/li&gt;
&lt;li&gt;You work on iOS / React Native and lose context at the bridge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip CodeGraph if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your repo is under ~300 files (native &lt;code&gt;grep&lt;/code&gt; is fast enough)&lt;/li&gt;
&lt;li&gt;Your questions are mostly semantic ("find the place that handles X" without knowing the symbol name) — Claude Context fits better&lt;/li&gt;
&lt;li&gt;You can't run a local SQLite file or are in a sandboxed environment with no filesystem watcher (&lt;code&gt;CODEGRAPH_NO_DAEMON=1&lt;/code&gt; works, but you'll need manual &lt;code&gt;codegraph sync&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CodeGraph vs. Claude Context vs. Other Indexers
&lt;/h2&gt;

&lt;p&gt;The MCP code-search space has crystallized into two distinct approaches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Local-only&lt;/th&gt;
&lt;th&gt;MCP&lt;/th&gt;
&lt;th&gt;Multi-client&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeGraph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AST symbol graph&lt;/td&gt;
&lt;td&gt;SQLite&lt;/td&gt;
&lt;td&gt;✅ always&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ 8+&lt;/td&gt;
&lt;td&gt;Structural questions, refactors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hybrid BM25 + embeddings&lt;/td&gt;
&lt;td&gt;Milvus / Zilliz&lt;/td&gt;
&lt;td&gt;⚠️ via Ollama&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ 13+&lt;/td&gt;
&lt;td&gt;Semantic questions, vague queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor Codebase Index&lt;/td&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;Cursor cloud&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ Cursor only&lt;/td&gt;
&lt;td&gt;Cursor users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider repo-map&lt;/td&gt;
&lt;td&gt;Tree-sitter graph&lt;/td&gt;
&lt;td&gt;In-memory&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ Aider only&lt;/td&gt;
&lt;td&gt;Aider users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sourcegraph Cody&lt;/td&gt;
&lt;td&gt;Hybrid + graph&lt;/td&gt;
&lt;td&gt;Sourcegraph&lt;/td&gt;
&lt;td&gt;✅ enterprise&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continue @codebase&lt;/td&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;LanceDB&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ Continue only&lt;/td&gt;
&lt;td&gt;Continue users&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Symbol graphs vs. embeddings is not a winner-take-all fight.&lt;/strong&gt; The two answer different question shapes. CodeGraph nails &lt;em&gt;"who calls X?", "what's the route for /api/orders?", "what breaks if I rename Y?"&lt;/em&gt;. Claude Context nails &lt;em&gt;"find the place that handles the corner case where users have two emails."&lt;/em&gt; Several commenters this week are running both — the graph as the structural source of truth, the vector store for fuzzy recall.&lt;/p&gt;

&lt;p&gt;If you only have time to add one MCP server this week and your codebase is over 1k files: CodeGraph is the lower-friction install (no API keys, no Docker) and lands the bigger token reduction on architecture questions, which is what most agents waste budget on.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does CodeGraph work with Cursor and Codex CLI, or only Claude Code?
&lt;/h3&gt;

&lt;p&gt;It auto-configures &lt;strong&gt;eight clients&lt;/strong&gt;: Claude Code, Cursor, Codex CLI, opencode, Hermes Agent, Gemini CLI, Antigravity IDE, and Kiro. The interactive installer (&lt;code&gt;codegraph init -i&lt;/code&gt;) detects which are present and lets you choose. The MCP server itself is client-agnostic — anything that speaks MCP can connect.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does CodeGraph compare to Claude Context (Zilliz's MCP indexer)?
&lt;/h3&gt;

&lt;p&gt;CodeGraph uses a tree-sitter symbol graph in local SQLite. &lt;a href="https://dev.to/posts/claude-context-mcp-semantic-code-search-review/"&gt;Claude Context&lt;/a&gt; uses BM25 + dense embeddings in Milvus. CodeGraph wins on structural questions ("who calls X?", "what's the route for Y?") and zero-infrastructure setup. Claude Context wins on fuzzy semantic questions and recall when you don't know the symbol name. They're complementary, and several teams run both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is CodeGraph really 100% local?
&lt;/h3&gt;

&lt;p&gt;Yes. No API keys, no embeddings, no external services. The graph is a SQLite database in &lt;code&gt;.codegraph/&lt;/code&gt; inside your project. The MCP server runs as a local Node process. Nothing leaves your machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need Node.js installed?
&lt;/h3&gt;

&lt;p&gt;No. The native installer (&lt;code&gt;install.sh&lt;/code&gt; / &lt;code&gt;install.ps1&lt;/code&gt;) bundles its own runtime. If you already have Node, &lt;code&gt;npx @colbymchenry/codegraph&lt;/code&gt; works too — both paths land at the same binary.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does the auto-sync work? Do I need to run &lt;code&gt;codegraph sync&lt;/code&gt; manually?
&lt;/h3&gt;

&lt;p&gt;You don't. A native filesystem watcher (FSEvents on macOS, inotify on Linux, ReadDirectoryChangesW on Windows) catches every file change and re-indexes after a 2-second debounce (tunable). On reconnect, the MCP server does a fast &lt;code&gt;(size, mtime) + content-hash&lt;/code&gt; reconciliation. Manual &lt;code&gt;codegraph sync&lt;/code&gt; only matters in sandboxed environments where the watcher is disabled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are the benchmark numbers reproducible?
&lt;/h3&gt;

&lt;p&gt;Yes. The README publishes the methodology (&lt;code&gt;claude -p&lt;/code&gt; Opus 4.7 headless, &lt;code&gt;--strict-mcp-config&lt;/code&gt;, 4 runs per arm, median reported), the exact query for each of the 7 repos, and the raw &lt;code&gt;WITH → WITHOUT&lt;/code&gt; medians per cell. You can clone any of the benchmark repos at &lt;code&gt;--depth 1&lt;/code&gt; and run the same comparison yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;CodeGraph is the strongest pitch yet for symbol graphs as the structural layer beneath AI coding agents. The benchmark is reproducible, the install story is the lowest friction in MCP code search, and the &lt;strong&gt;21,424 stars in seven days&lt;/strong&gt; suggest a lot of developers had the same thought: &lt;em&gt;I'm tired of watching Claude Code re-grep the same files&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If your repo is over 1,000 files and you're paying for an agent's tool calls, CodeGraph likely pays for itself in this week's Claude bill. Run it alongside &lt;a href="https://dev.to/posts/claude-context-mcp-semantic-code-search-review/"&gt;Claude Context&lt;/a&gt; for fuzzy recall and you have the closest thing to a complete MCP code-intelligence stack that exists today.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Repo: &lt;a href="https://github.com/colbymchenry/codegraph" rel="noopener noreferrer"&gt;github.com/colbymchenry/codegraph&lt;/a&gt; — Apache 2.0 licensed, 31K stars and gaining ~3K/day.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>claudecode</category>
      <category>codeknowledgegraph</category>
      <category>ast</category>
    </item>
    <item>
      <title>HyperFrames Review: HeyGen's HTML-to-Video for AI Agents</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Thu, 28 May 2026 11:09:54 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/hyperframes-review-heygens-html-to-video-for-ai-agents-170i</link>
      <guid>https://dev.to/andrew-ooo/hyperframes-review-heygens-html-to-video-for-ai-agents-170i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/hyperframes-heygen-html-video-agents-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;HyperFrames&lt;/strong&gt; is the open-source rendering engine HeyGen quietly carved out of its internal video stack and dropped on GitHub in mid-April 2026. The pitch fits on a sticker: &lt;em&gt;write HTML, render video, built for agents&lt;/em&gt;. Compositions are plain &lt;code&gt;index.html&lt;/code&gt; files with &lt;code&gt;data-start&lt;/code&gt; and &lt;code&gt;data-duration&lt;/code&gt; attributes; a headless Chrome renderer seeks each frame, FFmpeg encodes it, and the result is a deterministic MP4. No React, no proprietary timeline, no bundler. Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache 2.0&lt;/strong&gt;, no per-render fees, no commercial-use thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;21,894 GitHub stars&lt;/strong&gt; and growing fast — one of the most-starred new AI repos of Q2 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node.js 22+ and FFmpeg&lt;/strong&gt; are the only hard requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapter-based animation&lt;/strong&gt;: GSAP, CSS, Lottie, Three.js, Anime.js, WAAPI, or your own runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic by design&lt;/strong&gt;: same HTML in, same bytes out — built for CI, regression tests, and reproducible renders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent skills shipped on day one&lt;/strong&gt; for Claude Code, Cursor, Gemini CLI, Codex&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Lambda render path&lt;/strong&gt; included for distributed rendering at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In production at HeyGen&lt;/strong&gt; with community use at tldraw and TanStack&lt;/li&gt;
&lt;li&gt;Repository: &lt;a href="https://github.com/heygen-com/hyperframes" rel="noopener noreferrer"&gt;github.com/heygen-com/hyperframes&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting bet here isn't "another video framework." It's that AI coding agents &lt;em&gt;already know HTML&lt;/em&gt; — so if your video format is HTML, agents become competent video editors essentially for free. After a week of trying it, that claim mostly holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "HTML as a video format" is the real story
&lt;/h2&gt;

&lt;p&gt;Most video frameworks treat the timeline as a first-class object. You drag clips, set keyframes, and the tool serialises that into JSON, JSX, or a proprietary DSL. HyperFrames inverts that. The timeline &lt;em&gt;is&lt;/em&gt; the HTML document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"stage"&lt;/span&gt; &lt;span class="na"&gt;data-composition-id=&lt;/span&gt;&lt;span class="s"&gt;"launch"&lt;/span&gt;
     &lt;span class="na"&gt;data-start=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt; &lt;span class="na"&gt;data-width=&lt;/span&gt;&lt;span class="s"&gt;"1920"&lt;/span&gt; &lt;span class="na"&gt;data-height=&lt;/span&gt;&lt;span class="s"&gt;"1080"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;video&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"clip"&lt;/span&gt; &lt;span class="na"&gt;data-start=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt; &lt;span class="na"&gt;data-duration=&lt;/span&gt;&lt;span class="s"&gt;"6"&lt;/span&gt;
         &lt;span class="na"&gt;data-track-index=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"intro.mp4"&lt;/span&gt; &lt;span class="na"&gt;muted&lt;/span&gt; &lt;span class="na"&gt;playsinline&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/video&amp;gt;&lt;/span&gt;

  &lt;span class="nt"&gt;&amp;lt;h1&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"title"&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"clip"&lt;/span&gt; &lt;span class="na"&gt;data-start=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt; &lt;span class="na"&gt;data-duration=&lt;/span&gt;&lt;span class="s"&gt;"4"&lt;/span&gt;
      &lt;span class="na"&gt;data-track-index=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Launch day&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;

  &lt;span class="nt"&gt;&amp;lt;audio&lt;/span&gt; &lt;span class="na"&gt;data-start=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt; &lt;span class="na"&gt;data-duration=&lt;/span&gt;&lt;span class="s"&gt;"6"&lt;/span&gt; &lt;span class="na"&gt;data-track-index=&lt;/span&gt;&lt;span class="s"&gt;"2"&lt;/span&gt;
         &lt;span class="na"&gt;data-volume=&lt;/span&gt;&lt;span class="s"&gt;"0.5"&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"music.wav"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/audio&amp;gt;&lt;/span&gt;

  &lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"https://cdn.jsdelivr.net/npm/gsap@3/dist/gsap.min.js"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;script&amp;gt;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;gsap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeline&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;paused&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nx"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#title&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;opacity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;__timelines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;__timelines&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;__timelines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;launch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a complete six-second composition: background video, fade-in title, background music at half volume. Open &lt;code&gt;index.html&lt;/code&gt; in Chrome, hit play, and you can preview it without any tooling. Run &lt;code&gt;npx hyperframes render&lt;/code&gt; and the same file becomes an MP4.&lt;/p&gt;

&lt;p&gt;The trick that makes it work is the &lt;code&gt;paused: true&lt;/code&gt; GSAP timeline registered on &lt;code&gt;window.__timelines.&amp;lt;id&amp;gt;&lt;/code&gt;. The renderer doesn't play your animation in wall-clock time — it &lt;em&gt;seeks&lt;/em&gt; it. For each frame, it sets the timeline's progress to the correct point, waits for layout, and screenshots. Animations stay frame-accurate at 60fps and you can shard a render across Lambda workers without drift.&lt;/p&gt;

&lt;p&gt;For agents this is a huge unlock. Claude Code doesn't need to learn a new "compositions and sequences" mental model — it writes the kind of HTML it already writes ten times a day, sprinkles in a few data attributes from the skill prompt, and the framework handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install and first render
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Prereqs: Node 22+, FFmpeg&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg     &lt;span class="c"&gt;# macOS&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg &lt;span class="c"&gt;# Ubuntu/Debian&lt;/span&gt;

&lt;span class="c"&gt;# Scaffold a project&lt;/span&gt;
npx hyperframes init my-video
&lt;span class="nb"&gt;cd &lt;/span&gt;my-video

&lt;span class="c"&gt;# Preview in browser with live reload&lt;/span&gt;
npx hyperframes preview

&lt;span class="c"&gt;# Render to MP4&lt;/span&gt;
npx hyperframes render
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The init template gives you a working composition, a &lt;code&gt;hyperframes.config.json&lt;/code&gt;, and a folder structure that won't surprise anyone who's used Vite or Next. Preview runs on port 4200 by default with live reload. First render of a 10-second 1080p composition takes 35–50 seconds on an M2 Pro — not blazing, but the deterministic property means identical re-renders hit cache.&lt;/p&gt;

&lt;p&gt;The CLI is non-interactive by default: no &lt;code&gt;[y/N]&lt;/code&gt; prompts, no spinners that confuse an agent's pty parser. Whoever designed this clearly had &lt;code&gt;claude&lt;/code&gt; and &lt;code&gt;codex&lt;/code&gt; in mind, not humans typing into a terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using HyperFrames from Claude Code
&lt;/h2&gt;

&lt;p&gt;This is where it stops being "another Remotion alternative" and starts being interesting. One command installs the official skill:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add heygen-com/hyperframes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then in a Claude Code session:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Using &lt;code&gt;/hyperframes&lt;/code&gt;, create a 10-second product intro for our blog post about HyperFrames. Title fades in at 1s, a background video plays the whole time, and there's a kinetic caption from 4s to 9s. Render at 1080p.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What I got back, end-to-end and unattended:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A scaffolded composition directory&lt;/li&gt;
&lt;li&gt;An &lt;code&gt;index.html&lt;/code&gt; with three tracks correctly wired&lt;/li&gt;
&lt;li&gt;A GSAP timeline with the fade-in and a slide-up caption&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;preview&lt;/code&gt; run, then a &lt;code&gt;render&lt;/code&gt; run, then a &lt;code&gt;ffprobe&lt;/code&gt; on the output to verify duration&lt;/li&gt;
&lt;li&gt;A summary message with the file path and runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the production loop the skills teach: &lt;em&gt;plan → write HTML → wire animation → add media → lint → preview → render&lt;/em&gt;. The lint step is doing more work than it sounds — it catches missing &lt;code&gt;data-duration&lt;/code&gt;, audio tracks that overrun the composition, and timeline IDs the renderer can't find. Those are the exact errors that would otherwise make an agent loop endlessly trying to figure out why its render is silent or black.&lt;/p&gt;

&lt;p&gt;After a week of using it daily, the main failure mode is the agent over-engineering animations: defaulting to Three.js when CSS transforms would do, or pulling in Lottie for what should be a &lt;code&gt;text-shadow&lt;/code&gt; keyframe. The catalog mostly fixes that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx hyperframes add flash-through-white   &lt;span class="c"&gt;# shader transition&lt;/span&gt;
npx hyperframes add instagram-follow      &lt;span class="c"&gt;# social overlay&lt;/span&gt;
npx hyperframes add data-chart            &lt;span class="c"&gt;# animated chart&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you point the agent at the catalog up front (&lt;code&gt;Use blocks from the HyperFrames catalog where possible&lt;/code&gt;), output quality goes up noticeably.&lt;/p&gt;

&lt;h2&gt;
  
  
  HyperFrames vs Remotion: the honest comparison
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.remotion.dev" rel="noopener noreferrer"&gt;Remotion&lt;/a&gt; has been the default answer for "video as code" since 2021. It's mature, it's well-documented, and Remotion Lambda is genuinely impressive. HyperFrames is openly inspired by it. So when does each one make sense?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;HyperFrames&lt;/th&gt;
&lt;th&gt;Remotion&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Authoring model&lt;/td&gt;
&lt;td&gt;HTML + CSS + seekable animation&lt;/td&gt;
&lt;td&gt;React components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build step&lt;/td&gt;
&lt;td&gt;None — &lt;code&gt;index.html&lt;/code&gt; plays as-is&lt;/td&gt;
&lt;td&gt;Bundler required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent handoff&lt;/td&gt;
&lt;td&gt;Plain HTML files&lt;/td&gt;
&lt;td&gt;JSX/TSX in a React project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Animation timing&lt;/td&gt;
&lt;td&gt;Frame-accurate via seek adapters&lt;/td&gt;
&lt;td&gt;Wall-clock; care needed for determinism&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed rendering&lt;/td&gt;
&lt;td&gt;Local + AWS Lambda&lt;/td&gt;
&lt;td&gt;Remotion Lambda (mature, more features)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Source-available Remotion License&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commercial cost above N renders&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Paid tiers above thresholds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Component ecosystem&lt;/td&gt;
&lt;td&gt;Catalog (small but growing)&lt;/td&gt;
&lt;td&gt;Large React ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning curve for a React team&lt;/td&gt;
&lt;td&gt;Slight relearn&lt;/td&gt;
&lt;td&gt;Already there&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning curve for a coding agent&lt;/td&gt;
&lt;td&gt;Trivial&lt;/td&gt;
&lt;td&gt;Higher (full React project mental model)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you have a React team already shipping Remotion compositions, there's no reason to switch — Remotion is more mature for human authors and its Lambda story has years of polish on HyperFrames'. If you're building an automated content pipeline where the &lt;em&gt;author&lt;/em&gt; is an agent, HyperFrames is genuinely a step change. The Apache 2.0 license also matters at scale: HeyGen explicitly removed the per-render and seat-count thresholds that creep into the Remotion License once you're past hobby usage.&lt;/p&gt;

&lt;p&gt;The deterministic-output property is a real technical win regardless of which side you're on. Byte-identical MP4s from byte-identical HTML means you can put renders behind a content hash and skip rebuilds, run regression tests with &lt;code&gt;diff&lt;/code&gt;, and shard work across workers without seams.&lt;/p&gt;

&lt;h2&gt;
  
  
  What people are saying
&lt;/h2&gt;

&lt;p&gt;The launch hit r/ClaudeCode and r/ClaudeAI in mid-April and the reaction has been more measured than the typical "next-Remotion-killer" thread. A representative slice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;"Output is byte-identical across runs, so CI caching and shard-parallel rendering work. There is a frame-adapter pattern that lets GSAP, Lottie, CSS, Three.js, and (experimentally) Remotion coexist in one composition."&lt;/em&gt; — top comment on the r/ClaudeAI launch thread, by a contributor&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"Setup isn't trivial and sometimes you spend more time debugging [Remotion] than creating. So when I saw HyperFrames calling itself 'agent-native' I got curious."&lt;/em&gt; — r/buildinpublic&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"This is the right primitive. Video as deterministic function of HTML. Everything else is a wrapper."&lt;/em&gt; — top reply on the r/coolgithubprojects thread&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The skeptical takes are mostly about ecosystem maturity ("the catalog is thin compared to React/Remotion ecosystem") and about whether HTML is really the right abstraction for complex sequences. Both are fair. The counter-argument from the maintainers is that "agents write HTML" beats "humans write JSX" by enough margin to justify the smaller initial surface area.&lt;/p&gt;

&lt;p&gt;Worth noting: the &lt;a href="https://ai-engineering-trend.medium.com/heygens-hyperframes-the-open-source-framework-challenging-remotion-in-html-based-video-creation-c10437f0afca" rel="noopener noreferrer"&gt;Medium piece from AI Engineering&lt;/a&gt; and &lt;a href="https://theagenttimes.com/articles/heygen-open-sources-hyperframes-giving-us-a-deterministic-vi-10740a56" rel="noopener noreferrer"&gt;The Agent Times&lt;/a&gt; both lead with "deterministic video primitive" rather than "Remotion killer," and I think that's the more accurate framing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations after a week
&lt;/h2&gt;

&lt;p&gt;Things that bit me, in rough order of severity:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Render speed isn't competitive on a single machine.&lt;/strong&gt; Headless Chrome seek-per-frame is slower than Remotion's parallel-frame approach for simple compositions. A 30-second 1080p render that takes ~3 minutes locally takes ~25 seconds on Remotion Lambda. The HyperFrames AWS Lambda story closes that gap but the bundle and setup is heavier than Remotion's.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Audio sync is fiddly with seek-based animation.&lt;/strong&gt; Because animations are seeked rather than played, audio that's tightly choreographed to timeline events needs careful &lt;code&gt;data-start&lt;/code&gt; math. The default audio mixer handles offsets and volume fine, but Lottie-style audio-reactive animation isn't really a supported pattern yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The catalog is small.&lt;/strong&gt; Right now it's transitions, overlays, captions, charts, and a handful of effects. If you want a Lottie-style ecosystem of polished components, you're going to build them. The blocks that exist are well-made, but expect to write a lot of HTML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Git LFS for the test baselines is ~240MB.&lt;/strong&gt; If you're cloning to contribute, the &lt;code&gt;GIT_LFS_SKIP_SMUDGE=1&lt;/code&gt; flag in the README is your friend. The repo will build fine without the LFS content; you just can't run the visual regression suite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. No first-party React adapter (yet).&lt;/strong&gt; There's an experimental Remotion frame adapter, but if you have a deep React component library you want to reuse, you're going to wrap it yourself or wait. For greenfield work this doesn't matter; for migrations it might.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. The Studio is "available, evolving."&lt;/strong&gt; The browser-based editor exists but it's clearly not the primary surface — that's the agent skills. If you want a polished GUI editor with timeline scrubbing and visual track management, this isn't it yet.&lt;/p&gt;

&lt;p&gt;None of these are deal-breakers. The framework is doing exactly what it advertises; it's just early.&lt;/p&gt;

&lt;h2&gt;
  
  
  When HyperFrames is the right call
&lt;/h2&gt;

&lt;p&gt;Reach for it when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building an automated content pipeline where an agent generates the video (newsletter highlights, PR changelog reels, social posts from blog content)&lt;/li&gt;
&lt;li&gt;You need deterministic output for CI caching, regression tests, or shard-parallel rendering&lt;/li&gt;
&lt;li&gt;The Apache 2.0 license matters because you'll be doing thousands of renders&lt;/li&gt;
&lt;li&gt;Your team is comfortable writing HTML/CSS/JS and doesn't want to adopt React just to make a video&lt;/li&gt;
&lt;li&gt;You want a low-ceremony local preview that doesn't require a dev server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Skip it (for now) when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You already have a Remotion pipeline that works and a React team to maintain it&lt;/li&gt;
&lt;li&gt;You need polished visual editing — Studio isn't there yet&lt;/li&gt;
&lt;li&gt;You need the maximum-throughput cloud render path with zero setup; Remotion Lambda is more mature&lt;/li&gt;
&lt;li&gt;Your sequences are heavily React-component-driven and porting them isn't worth it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Is HyperFrames actually open source, or is this a HeyGen-services play with an OSS wrapper?&lt;/strong&gt;&lt;br&gt;
A: It's Apache 2.0 with no usage caps, no telemetry requirement, and no "you must use our cloud for renders" clause. The Lambda render package is a self-deploy SDK — you stand up the Lambda functions in your own AWS account. HeyGen presumably benefits from ecosystem mindshare and from people who eventually want hosted authoring, but the OSS framework stands on its own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Do I need HeyGen credentials, an API key, or any cloud account to use it?&lt;/strong&gt;&lt;br&gt;
A: No. Local CLI usage requires nothing except Node 22+ and FFmpeg. The AWS Lambda path requires an AWS account (yours, not HeyGen's). &lt;a href="https://www.hyperframes.dev/" rel="noopener noreferrer"&gt;hyperframes.dev&lt;/a&gt; is an optional community playground for sharing compositions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How does it compare to Manim or Motion Canvas?&lt;/strong&gt;&lt;br&gt;
A: Manim and Motion Canvas are oriented toward mathematical/educational animation with their own DSLs. HyperFrames is generalist video composition with HTML as the format. For lecture-style geometric animations Manim is still better; for product videos, social cuts, or dashboards-to-video, HyperFrames fits better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I use existing React components inside a HyperFrames composition?&lt;/strong&gt;&lt;br&gt;
A: Sort of. You can render React to a DOM node and have HyperFrames screenshot it like anything else, but you'll lose the determinism benefits if your components rely on wall-clock state. The clean integration story is to render React to static HTML+CSS first and let HyperFrames handle timing. The experimental Remotion adapter is the closest thing to first-class React support today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What's the actual ceiling — could I render a feature-length film with this?&lt;/strong&gt;&lt;br&gt;
A: Technically yes; practically no one would. HyperFrames is optimised for short-to-medium compositions (seconds to a few minutes) where determinism, agent authorship, and HTML-native authoring are the wins. For anything over 10 minutes you'd want a different toolchain — or you'd want to compose many HyperFrames outputs together with FFmpeg directly. The deterministic primitive does make that kind of pipeline easier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Will HeyGen keep maintaining this or quietly abandon it?&lt;/strong&gt;&lt;br&gt;
A: It's used in production at HeyGen itself, commits come from active engineers there, and HeyGen staff respond on the community Discord. Listed adopters include tldraw and TanStack. The risk profile feels similar to corporate-sponsored OSS that's load-bearing for the sponsor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;HyperFrames is the first video framework I've used where "AI agent writes the video" feels like the &lt;em&gt;intended&lt;/em&gt; workflow rather than a happy accident. The HTML-as-format bet is genuinely insightful: it dramatically lowers the surface area an agent has to learn while giving humans an authoring model they already know. The deterministic rendering property is the kind of unsexy technical decision that pays off for years.&lt;/p&gt;

&lt;p&gt;It's not as mature as Remotion. The catalog is thin, render throughput on a single machine is unimpressive, and the Studio is still a work in progress. But for the specific job of "wire video generation into an automated content pipeline driven by Claude Code or similar," nothing else comes close right now.&lt;/p&gt;

&lt;p&gt;If you're building agent workflows that need to produce video, install the skill and try a prompt today. If you're a Remotion shop with React infrastructure and human authors, there's no need to switch. If you're somewhere in between, watch the catalog and Studio progress over the next two quarters — that's where the gap will close.&lt;/p&gt;

&lt;p&gt;For me, this goes on the short list of repos I expect to be load-bearing in agent stacks by the end of 2026.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try it:&lt;/strong&gt; &lt;a href="https://github.com/heygen-com/hyperframes" rel="noopener noreferrer"&gt;github.com/heygen-com/hyperframes&lt;/a&gt; · &lt;a href="https://hyperframes.heygen.com/introduction" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; · &lt;a href="https://hyperframes.heygen.com/catalog/blocks/data-chart" rel="noopener noreferrer"&gt;Catalog&lt;/a&gt; · &lt;a href="https://discord.gg/EbK98HBPdk" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;&lt;/p&gt;

</description>
      <category>hyperframes</category>
      <category>heygen</category>
      <category>videogeneration</category>
      <category>htmltovideo</category>
    </item>
    <item>
      <title>OpenBrief Review: Local-First Video AI Summarizer 2026</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Wed, 27 May 2026 11:07:28 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/openbrief-review-local-first-video-ai-summarizer-2026-1k3n</link>
      <guid>https://dev.to/andrew-ooo/openbrief-review-local-first-video-ai-summarizer-2026-1k3n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/openbrief-local-first-video-summarizer-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenBrief&lt;/strong&gt; is an open-source desktop app that does what most people actually want from a "video AI tool": &lt;strong&gt;paste a link, get a transcript, get a grounded summary, and chat with the content — without uploading anything to a SaaS.&lt;/strong&gt; It hit &lt;a href="https://news.ycombinator.com/item?id=48272393" rel="noopener noreferrer"&gt;88 points on Show HN&lt;/a&gt; on May 26, 2026, and it's basically a polished GUI for &lt;code&gt;yt-dlp&lt;/code&gt; + Whisper with an LLM layer wired in for summaries and Q&amp;amp;A.&lt;/p&gt;

&lt;p&gt;Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tauri v2 desktop app&lt;/strong&gt; — macOS, Windows, and Linux from a single Rust/React codebase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bundled &lt;code&gt;yt-dlp&lt;/code&gt;&lt;/strong&gt; — paste a YouTube/Vimeo/arbitrary URL, it downloads locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-device transcription&lt;/strong&gt; — Whisper (via &lt;code&gt;whisper.cpp&lt;/code&gt; and &lt;code&gt;transcribe-rs&lt;/code&gt;) runs against the audio on your machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bring-your-own LLM&lt;/strong&gt; — OpenAI GPT, Anthropic Claude, Google Gemini, or OpenRouter (DeepSeek) for summaries and chat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounded summaries&lt;/strong&gt; — markdown briefs with timestamped takeaways tied back to transcript spans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat with media&lt;/strong&gt; — Q&amp;amp;A against the transcript using the LLM of your choice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-to-speech&lt;/strong&gt; — listen to the generated brief (Supertonic 3 / Qwen3-TTS on the roadmap)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AGPL-3.0 licensed&lt;/strong&gt;, monorepo organized as a pnpm/Turborepo workspace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Honest limitation&lt;/strong&gt;: LLM summaries are still cloud-by-default — local Gemma 4 / Qwen3-ASR support is on the roadmap but not shipped yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever wanted a private alternative to NotebookLM or Read.ai for your own video backlog, OpenBrief is the most complete open-source attempt to date.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Local-First" Video AI Suddenly Matters
&lt;/h2&gt;

&lt;p&gt;For three years, the default flow for "summarize this video" has been: &lt;strong&gt;upload to a SaaS, get a summary, hope they don't train on it.&lt;/strong&gt; Read.ai, Fireflies, Otter, NotebookLM — they all do good work, but every minute of audio you feed them is a minute on someone else's GPU.&lt;/p&gt;

&lt;p&gt;For a lot of use cases that's fine. For others — legal depositions, internal all-hands recordings, anything under NDA, anything you don't want to count toward a per-minute pricing tier — it isn't. The market quietly noticed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;yt-dlp&lt;/code&gt; keeps gaining stars (currently 100K+) because people want their videos as files, not stream-only assets&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;whisper.cpp&lt;/code&gt; made on-device transcription genuinely usable on a MacBook&lt;/li&gt;
&lt;li&gt;Local LLMs (Llama 3, Qwen 3, Gemma 3, DeepSeek V4) crossed the "good enough for summaries" line in 2025&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What was missing was a polished desktop app that stitched these together with a sane UX. OpenBrief is the first Show HN this year that does it without feeling like a research demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes OpenBrief Different
&lt;/h2&gt;

&lt;p&gt;OpenBrief isn't doing any one thing nobody else has done. The interesting part is how it composes them.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Tauri v2 instead of Electron
&lt;/h3&gt;

&lt;p&gt;Most "local AI" desktop apps reach for Electron because the team already knows React. OpenBrief uses &lt;strong&gt;Tauri v2&lt;/strong&gt;, which means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rust backend, system webview frontend — installer is ~10MB instead of ~150MB&lt;/li&gt;
&lt;li&gt;Native filesystem access for the helper sidecar (yt-dlp, ffmpeg) without IPC gymnastics&lt;/li&gt;
&lt;li&gt;Lower memory footprint while a 90-minute Whisper transcription is running&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;src-tauri/&lt;/code&gt; directory exposes Rust commands that the React renderer calls. The helper sidecar — a separate binary that wraps yt-dlp and ffmpeg — is bundled at build time via &lt;code&gt;pnpm setup:dev-sidecars&lt;/code&gt;. This is the right shape for a desktop tool that needs to shell out to native binaries; it avoids the "user has to install yt-dlp themselves" trap that kills adoption of a lot of similar projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Grounded summaries, not free-floating ones
&lt;/h3&gt;

&lt;p&gt;A "grounded summary" in OpenBrief means each bullet in the generated brief is tied to a &lt;strong&gt;timestamp span&lt;/strong&gt; in the original transcript. Click a takeaway → jump to the spot in the audio/video. This is the same idea NotebookLM popularized with its citation-numbered answers, and it matters for one reason: &lt;strong&gt;summaries hallucinate, transcripts don't.&lt;/strong&gt; When the summary says "the speaker proposed a 40% price cut at 12:34," you can verify that's actually what they said.&lt;/p&gt;

&lt;p&gt;The implementation is straightforward — the prompt template forces the LLM to emit JSON with &lt;code&gt;{ text, start_ts, end_ts }&lt;/code&gt; for each point, and the UI renders them as clickable chips. If you've built RAG over transcripts before, you've written approximately this code; OpenBrief just ships it as the default.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Pluggable model layer
&lt;/h3&gt;

&lt;p&gt;The README's "Model Support" table is honest about what's shipped vs. roadmap:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model type&lt;/th&gt;
&lt;th&gt;Supported today&lt;/th&gt;
&lt;th&gt;Roadmap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speech-to-text&lt;/td&gt;
&lt;td&gt;Whisper&lt;/td&gt;
&lt;td&gt;Parakeet, Qwen3-ASR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text-to-speech&lt;/td&gt;
&lt;td&gt;(placeholder)&lt;/td&gt;
&lt;td&gt;Supertonic 3, Qwen3-TTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, Gemini, OpenRouter (DeepSeek)&lt;/td&gt;
&lt;td&gt;Local Gemma 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Video embeddings&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Frame/clip semantic search&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The architectural win is that all four slots are &lt;strong&gt;service abstractions&lt;/strong&gt;, not hardcoded. You wire your key into settings, the rest of the app doesn't care. Adding a local LLM later is "implement the LLM service interface against Ollama" — not "rewrite the summary pipeline."&lt;/p&gt;

&lt;h2&gt;
  
  
  Install and First Run
&lt;/h2&gt;

&lt;p&gt;OpenBrief is currently distributed as a source build — the Tauri team hasn't pushed signed installers yet. You need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Prerequisites&lt;/span&gt;
&lt;span class="c"&gt;# - Node.js ^22.21.0&lt;/span&gt;
&lt;span class="c"&gt;# - pnpm 11.0.9&lt;/span&gt;
&lt;span class="c"&gt;# - Rust + Cargo&lt;/span&gt;
&lt;span class="c"&gt;# - Tauri v2 platform prerequisites for your OS&lt;/span&gt;
&lt;span class="c"&gt;#   (Xcode on macOS, MSVC on Windows, webkit2gtk on Linux)&lt;/span&gt;

git clone https://github.com/tantara/openbrief
&lt;span class="nb"&gt;cd &lt;/span&gt;openbrief/client
pnpm &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# If pnpm flags ignored native build scripts on first install:&lt;/span&gt;
pnpm approve-builds  &lt;span class="c"&gt;# approve the listed packages&lt;/span&gt;
pnpm &lt;span class="nb"&gt;install&lt;/span&gt;         &lt;span class="c"&gt;# rerun&lt;/span&gt;

&lt;span class="c"&gt;# Build and run the desktop app&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;apps/tauri
pnpm setup:dev-sidecars     &lt;span class="c"&gt;# builds the yt-dlp helper sidecar&lt;/span&gt;
pnpm prepare:media-assets   &lt;span class="c"&gt;# downloads Whisper model + ffmpeg&lt;/span&gt;
pnpm dev                    &lt;span class="c"&gt;# launches the desktop window&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First-run experience:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;App opens to an empty library.&lt;/li&gt;
&lt;li&gt;Paste a YouTube URL into the import bar.&lt;/li&gt;
&lt;li&gt;The helper sidecar shells out to yt-dlp, downloads the audio (or video), and stores it in your local app data directory.&lt;/li&gt;
&lt;li&gt;Whisper kicks off in a background worker. On an M2 MacBook Air, a 30-minute video transcribes in ~3 minutes against &lt;code&gt;whisper-base&lt;/code&gt;. The bundled model size and the underlying &lt;code&gt;whisper.cpp&lt;/code&gt; build determine real-world speed.&lt;/li&gt;
&lt;li&gt;Once the transcript exists, the &lt;strong&gt;Summarize&lt;/strong&gt; button calls your configured LLM with the grounded-summary prompt.&lt;/li&gt;
&lt;li&gt;The right-hand pane shows the brief with clickable timestamps; the chat box at the bottom lets you ask follow-up questions against the transcript.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Configuring API keys
&lt;/h3&gt;

&lt;p&gt;Settings → Models. Each provider gets its own key field:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt;: &lt;code&gt;sk-...&lt;/code&gt; — used for &lt;code&gt;gpt-4o&lt;/code&gt;, &lt;code&gt;gpt-4o-mini&lt;/code&gt;, or whatever model id you set&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic&lt;/strong&gt;: &lt;code&gt;sk-ant-...&lt;/code&gt; — Claude models for summary + chat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google&lt;/strong&gt;: Gemini API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRouter&lt;/strong&gt;: &lt;code&gt;sk-or-...&lt;/code&gt; — routes to DeepSeek and 100+ others, cheap for bulk summarization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keys are stored in the platform secure store (Keychain on macOS, Credential Manager on Windows, libsecret on Linux), not in plaintext config.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Use Cases
&lt;/h2&gt;

&lt;p&gt;A few patterns that justify the install:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conference talk backlog.&lt;/strong&gt; I have ~40 saved YouTube talks I'll "watch later." OpenBrief into a library, summarize each, scan the briefs, watch the three that earn it. This is the use case the Show HN comments kept coming back to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Podcast research.&lt;/strong&gt; Drop a 90-minute episode in, ask: "Did they mention any specific tools or products?" The chat-with-transcript flow surfaces names you'd never catch on a normal listen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Internal recordings.&lt;/strong&gt; All-hands, customer interviews, sales calls — anything you can't or shouldn't upload to a SaaS. The transcript stays on disk, the summary is generated by your own API key (which you can route to a self-hosted LLM via OpenRouter or an OpenAI-compatible proxy).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Course / lecture notes.&lt;/strong&gt; Long-form educational content compresses well into timestamped briefs. The grounded format makes it easy to re-watch the parts you actually need.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Show HN Thread Said
&lt;/h2&gt;

&lt;p&gt;Sentiment in the &lt;a href="https://news.ycombinator.com/item?id=48272393" rel="noopener noreferrer"&gt;HN thread&lt;/a&gt; was warm but pointed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Basically a GUI for yt-dlp with AI on top"&lt;/strong&gt; — the maintainer's own characterization, and the thread mostly agreed that's a feature, not a bug. Wrapping a powerful CLI tool in a clean desktop UX has product value.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM gap&lt;/strong&gt; — the most common request was support for Ollama / llama.cpp for the summary step. Right now BYO-key works fine but the "local-first" claim has an asterisk while the LLM call still goes to a cloud API by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparisons to NotebookLM&lt;/strong&gt; — several commenters framed it as "the open-source NotebookLM I've been waiting for," which is roughly accurate for video/audio specifically. NotebookLM still has the edge on multi-source synthesis and audio overviews.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why Tauri over Electron&lt;/strong&gt; — a few people appreciated the small installer size; Tauri v2 is becoming the default pick for new local AI tools (Jan, LM Studio's recent versions, Bolt, OpenBrief).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Who Should Use This (and Who Shouldn't)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use OpenBrief if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You watch / produce a lot of long-form video or audio content&lt;/li&gt;
&lt;li&gt;You want transcripts and summaries to stay on your machine&lt;/li&gt;
&lt;li&gt;You already pay for at least one LLM API and want to extract more value from that key&lt;/li&gt;
&lt;li&gt;You're comfortable with a source build for now (signed installers coming)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it (or wait) if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want fully local end-to-end — local Gemma 4 / Qwen3-ASR are on the roadmap but not shipped&lt;/li&gt;
&lt;li&gt;You need team collaboration features — this is a single-user desktop app today&lt;/li&gt;
&lt;li&gt;You need real-time meeting transcription — OpenBrief is post-hoc, not live (use &lt;a href="https://github.com/collabora/WhisperLive" rel="noopener noreferrer"&gt;Whisper-Live&lt;/a&gt; or Fireflies for that)&lt;/li&gt;
&lt;li&gt;You're on a low-spec machine — Whisper transcription is CPU/GPU-heavy; a 90-minute video takes meaningful time on older hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison with Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Local-first&lt;/th&gt;
&lt;th&gt;Open source&lt;/th&gt;
&lt;th&gt;Video download&lt;/th&gt;
&lt;th&gt;Grounded summary&lt;/th&gt;
&lt;th&gt;Chat&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenBrief&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ (transcript)&lt;/td&gt;
&lt;td&gt;✅ AGPL-3.0&lt;/td&gt;
&lt;td&gt;✅ yt-dlp&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Free + your LLM API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Free (Google account)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read.ai&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;$0.04/min uploads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Otter.ai&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;$8.33/mo+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacWhisper&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;$20–60 one-time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whisper.cpp + custom&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;DIY&lt;/td&gt;
&lt;td&gt;DIY&lt;/td&gt;
&lt;td&gt;Free + dev time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest read: &lt;strong&gt;NotebookLM is still better for multi-source research with audio overviews.&lt;/strong&gt; OpenBrief is better when privacy or per-minute cost is the constraint, or when you want one place for your personal video library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: How a Single Import Actually Flows
&lt;/h2&gt;

&lt;p&gt;For developers considering forking or contributing, here's the request lifecycle when you paste a URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[React UI: paste URL]
       │
       ▼  Tauri command: import_url
[Rust core] ──► [Helper sidecar: yt-dlp]
       │              │
       │              └─► writes .mp4/.m4a to app data dir
       │
       ▼  Tauri command: start_transcription
[Whisper worker (whisper.cpp via transcribe-rs)]
       │
       └─► writes transcript.json with { segments: [{start, end, text}] }
              │
              ▼  React reads transcript, renders timeline
              │
              ▼  User clicks "Summarize"
       [LLM service: OpenAI/Anthropic/Gemini/OpenRouter]
              │  prompt enforces grounded JSON schema
              ▼
       [Summary stored with timestamp links → rendered as clickable chips]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every step is replaceable. The &lt;code&gt;LLMService&lt;/code&gt; interface in &lt;code&gt;packages/api/&lt;/code&gt; is a few dozen lines; bolting on Ollama is straightforward, and the GitHub Issues already have a PR draft from a contributor doing exactly that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations Worth Knowing About
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;"Local-first" still routes summaries to cloud LLMs by default.&lt;/strong&gt; Until local Gemma 4 ships, the &lt;em&gt;transcript&lt;/em&gt; stays local but the &lt;em&gt;summary prompt&lt;/em&gt; (which includes the transcript) goes to whichever API key you configured. If that's a dealbreaker, run a local OpenAI-compatible endpoint like &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; or &lt;a href="https://github.com/ollama/ollama/blob/main/docs/openai.md" rel="noopener noreferrer"&gt;Ollama's OpenAI-compatible mode&lt;/a&gt; and point OpenBrief at &lt;code&gt;http://localhost:11434&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No signed installers yet.&lt;/strong&gt; Source build only on May 2026. On macOS that means &lt;code&gt;xcode-select --install&lt;/code&gt; is a prerequisite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper model is fixed at install time.&lt;/strong&gt; Switching from &lt;code&gt;base&lt;/code&gt; to &lt;code&gt;large-v3&lt;/code&gt; for higher accuracy requires re-running &lt;code&gt;pnpm prepare:media-assets&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AGPL-3.0&lt;/strong&gt; — fine for personal use and internal tools; if you want to fork it into a SaaS, the AGPL terms apply to network use. Read them before commercial deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS is a placeholder.&lt;/strong&gt; The "listen back" feature in the README requires Supertonic 3 / Qwen3-TTS support that isn't merged yet.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is OpenBrief free?
&lt;/h3&gt;

&lt;p&gt;Yes — the app itself is open source under AGPL-3.0 and there's no subscription. You pay for the LLM API calls (your OpenAI / Anthropic / Gemini / OpenRouter key), which run a few cents per long video at current &lt;code&gt;gpt-4o-mini&lt;/code&gt; or DeepSeek pricing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does OpenBrief work fully offline?
&lt;/h3&gt;

&lt;p&gt;Transcription does — Whisper runs on your machine. Summaries and chat &lt;strong&gt;don't&lt;/strong&gt; today, because they call your configured cloud LLM. Local LLM support (Gemma 4) is on the roadmap. As a workaround, point the OpenAI provider field at an Ollama or LiteLLM endpoint running on localhost.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does OpenBrief compare to NotebookLM?
&lt;/h3&gt;

&lt;p&gt;NotebookLM is stronger for multi-source synthesis (PDF + video + notes in one notebook) and ships an audio-overview generator that OpenBrief doesn't have yet. OpenBrief is stronger for &lt;strong&gt;privacy&lt;/strong&gt; (everything stays on disk except the LLM call, which goes through your key), &lt;strong&gt;format flexibility&lt;/strong&gt; (any video URL yt-dlp supports), and &lt;strong&gt;library management&lt;/strong&gt; (it's built around a long-running personal collection rather than per-notebook sessions).&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use OpenBrief with a self-hosted LLM?
&lt;/h3&gt;

&lt;p&gt;Yes — any OpenAI-compatible endpoint works. Configure the OpenAI provider with your local URL (e.g., &lt;code&gt;http://localhost:11434/v1&lt;/code&gt; for Ollama or LiteLLM) and a placeholder API key. End-to-end local pipelines work today with this trick; the official "local LLM" UI option is still on the roadmap.&lt;/p&gt;

&lt;h3&gt;
  
  
  What videos can it download?
&lt;/h3&gt;

&lt;p&gt;Anything &lt;code&gt;yt-dlp&lt;/code&gt; supports — that's YouTube, Vimeo, Twitch VODs, X/Twitter videos, podcasts hosted on most major platforms, and ~1,800 sites total. You can also import local &lt;code&gt;.mp4&lt;/code&gt;, &lt;code&gt;.m4a&lt;/code&gt;, &lt;code&gt;.mp3&lt;/code&gt;, &lt;code&gt;.wav&lt;/code&gt; files directly without downloading.&lt;/p&gt;

&lt;h3&gt;
  
  
  How fast is Whisper transcription?
&lt;/h3&gt;

&lt;p&gt;Depends on the model and hardware. On an M2 MacBook Air with &lt;code&gt;whisper-base&lt;/code&gt;, expect ~10x realtime (a 30-minute video transcribes in ~3 minutes). Larger models (&lt;code&gt;large-v3&lt;/code&gt;) are slower but more accurate; M-series with Metal acceleration is the sweet spot. CPU-only on older Intel hardware can be 1–2x realtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is there a Linux or Windows build?
&lt;/h3&gt;

&lt;p&gt;The repo says all three platforms are supported, but since signed installers aren't published yet, Linux and Windows users build from source the same way macOS users do. The Tauri v2 prerequisites differ per platform — check the &lt;a href="https://tauri.app" rel="noopener noreferrer"&gt;Tauri docs&lt;/a&gt; for your distro.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;OpenBrief is the &lt;strong&gt;first respectable open-source attempt at a personal video AI library&lt;/strong&gt; — the kind of tool that sits between yt-dlp (too low-level) and NotebookLM (too cloud). It's not finished — local LLM support and signed installers are the two big gaps — but the architecture is right, the maintainer is active in the Show HN thread, and the AGPL license keeps it honest.&lt;/p&gt;

&lt;p&gt;If you want to try it: clone the repo, run &lt;code&gt;pnpm dev:tauri&lt;/code&gt;, and feed it one of your "watch later" videos. If the grounded-summary UX clicks, you'll probably end up importing your whole backlog within a day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/tantara/openbrief" rel="noopener noreferrer"&gt;github.com/tantara/openbrief&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Show HN:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=48272393" rel="noopener noreferrer"&gt;news.ycombinator.com/item?id=48272393&lt;/a&gt;&lt;/p&gt;

</description>
      <category>openbrief</category>
      <category>tauri</category>
      <category>whisper</category>
      <category>videosummarizer</category>
    </item>
    <item>
      <title>Honcho Review: Plastic Labs' Agent Memory Layer (2026)</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Tue, 26 May 2026 11:06:19 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/honcho-review-plastic-labs-agent-memory-layer-2026-2kb4</link>
      <guid>https://dev.to/andrew-ooo/honcho-review-plastic-labs-agent-memory-layer-2026-2kb4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/honcho-plastic-labs-agent-memory-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Honcho&lt;/strong&gt; is open-source memory infrastructure for stateful AI agents, built by Plastic Labs and released under a permissive license. It's currently trending on GitHub — &lt;strong&gt;4,301 stars total&lt;/strong&gt; with &lt;strong&gt;644 new stars this week&lt;/strong&gt; — and it just posted state-of-the-art numbers on three independent agent-memory benchmarks. The highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning-first memory&lt;/strong&gt;: extracts &lt;em&gt;conclusions&lt;/em&gt; from conversations and events, not just chunks to match later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peer-centric model&lt;/strong&gt;: users, AI agents, groups, projects, and ideas are all first-class "peers" that change over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;90.4% on LongMem S&lt;/strong&gt; (92.6% with Gemini 3 Pro), &lt;strong&gt;89.9% on LoCoMo&lt;/strong&gt;, top scores on BEAM — all while using a median 5% of the available context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed or self-hosted&lt;/strong&gt;: hit &lt;code&gt;api.honcho.dev&lt;/code&gt; ($100 free credits on signup) or run the FastAPI server yourself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First-class integrations&lt;/strong&gt; for Claude Code, OpenCode, Cursor, Windsurf, Cline, and any MCP client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python + TypeScript SDKs&lt;/strong&gt;, PostgreSQL + pgvector for storage, Redis for caching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache 2.0 / AGPL-style licensing&lt;/strong&gt; (check repo for current terms before commercial deployment)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever shipped an AI assistant that &lt;em&gt;forgets your user between sessions&lt;/em&gt; — and then watched retention crater because of it — Honcho is the most credible open-source attempt yet at fixing that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repository&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;github.com/plastic-labs/honcho&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vendor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Plastic Labs (VC-backed, NYC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python (FastAPI server) + TS/Python SDKs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,301 (+644 this week)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Install (Python)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pip install honcho-ai&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Install (Node)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;npm install @honcho-ai/sdk&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosted endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;api.honcho.dev&lt;/code&gt; (signup → $100 free credits)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Self-host stack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FastAPI + PostgreSQL (pgvector) + Redis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://mcp.honcho.dev&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What Is Honcho?
&lt;/h2&gt;

&lt;p&gt;Most "agent memory" libraries on the market are basically &lt;strong&gt;wrappers around a vector database&lt;/strong&gt;. You store messages, embed them, and retrieve the top-k similar chunks at query time. That works fine for trivia recall, but it fails the moment your user says "I prefer concise answers" in session 1 and you want session 47 to &lt;em&gt;act on that preference&lt;/em&gt; — the relevant chunk is buried, the embedding is fuzzy, and you'd need a perfect query to surface it.&lt;/p&gt;

&lt;p&gt;Honcho takes a different angle. It treats memory as a &lt;strong&gt;reasoning problem&lt;/strong&gt;, not a retrieval problem. When messages arrive, a small fine-tuned model extracts latent information — preferences, beliefs, facts, contradictions — and writes it into a structured &lt;strong&gt;representation&lt;/strong&gt; of the speaker. In the background, the system "dreams" across ingested messages and prior reasoning, drawing new inferences over time. When you query Honcho, you don't search a vector store; you ask a &lt;em&gt;research agent&lt;/em&gt; a natural-language question, and it returns a synthesized answer.&lt;/p&gt;

&lt;p&gt;The other thing Honcho gets right is &lt;strong&gt;multi-peer modeling&lt;/strong&gt;. Most memory libraries assume a single "user" entity. Honcho's primitive is the &lt;strong&gt;peer&lt;/strong&gt; — and a peer can be a human, an AI agent, a project, or even an idea. Sessions are many-to-many with peers, which means you can model "what does Alice know about Bob" or "what does the support-agent persona think the customer wants" cleanly. For multi-agent systems this is a much better fit than shoehorning everything into a user/assistant pair.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It's Trending Now
&lt;/h2&gt;

&lt;p&gt;Three forces are converging on agent memory in mid-2026:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Long-context isn't a memory replacement.&lt;/strong&gt; Frontier models now ship with 1M+ token windows, but Plastic Labs' own LongMem benchmark shows that &lt;strong&gt;just dumping context in&lt;/strong&gt; drops Claude Haiku 4.5 from 89.2% (oracle) to 62.6% (full haystack) — a 26.6 point drop. More tokens ≠ better recall. Models need a &lt;em&gt;structured&lt;/em&gt; memory layer to perform well at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP is the new integration substrate.&lt;/strong&gt; Honcho ships a hosted MCP endpoint at &lt;code&gt;mcp.honcho.dev&lt;/code&gt;, so Claude Code, Cursor, Cline, Windsurf, and Codex CLI can all add it with a single command. The team has also published official plugins for Claude Code, OpenCode, and Hermes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The market matured past "Mem0 or roll your own."&lt;/strong&gt; The first wave of memory libraries (Mem0, Letta, Zep) trained the market on the &lt;em&gt;concept&lt;/em&gt; of agent memory. The second wave — Honcho, Hindsight, Supermemory, Holographic — is competing on &lt;strong&gt;benchmarks and reasoning depth&lt;/strong&gt;. Honcho's recent benchmark results put it at the top of the leaderboard for LongMem and LoCoMo.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The end result: every team building a stateful assistant in mid-2026 is shopping for a memory stack. Honcho is one of maybe four credible options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Features (With Code)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Honcho Loop: Store → Reason → Query → Inject
&lt;/h3&gt;

&lt;p&gt;The mental model is a four-step loop. You store messages, Honcho reasons in the background, you query for context or insights, and you inject the result into your model of choice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Honcho&lt;/span&gt;

&lt;span class="n"&gt;honcho&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Honcho&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;workspace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app-testing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HONCHO_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Store: peers and messages on a session
&lt;/span&gt;&lt;span class="n"&gt;alice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;peer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tutor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;peer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tutor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;alice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hey there — can you help me with my math homework?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tutor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Absolutely. Send me your first problem!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Reason: happens asynchronously in the background.
&lt;/span&gt;
&lt;span class="c1"&gt;# 3. Query: ask Honcho what it knows, or pull prompt-ready context.
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What learning styles does the user respond to best?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Inject: hand the context to your model of choice.
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assistant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tutor&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's missing: any direct calls to a vector DB, any prompt-engineering boilerplate to extract "preferences," any explicit memory invalidation. Honcho does all of that asynchronously.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Peer-Centric Modeling
&lt;/h3&gt;

&lt;p&gt;The peer abstraction unlocks multi-agent and multi-user scenarios that are genuinely awkward in mem0-style libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Peers are first-class: humans, agents, projects, ideas
&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;peer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;support_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;peer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;billing_bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;peer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing-bot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Configurable "observation" — which peers see which others
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;honcho&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket-87234&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;peers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;support_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;billing_bot&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;([...])&lt;/span&gt;

&lt;span class="c1"&gt;# Ask one peer what it knows about another
&lt;/span&gt;&lt;span class="n"&gt;intel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;support_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What does the customer seem to actually want here?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is Honcho's "theory of mind" feature — modeling &lt;em&gt;what one peer knows about another&lt;/em&gt; — and it's what differentiates the project from straightforward retrieval-augmented memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Hybrid Search (BM25 + Vector)
&lt;/h3&gt;

&lt;p&gt;When you do need raw retrieval, Honcho exposes hybrid search out of the box, combining keyword and dense-vector recall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pricing complaints from last month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# or scoped to a peer
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Drop-in MCP Server
&lt;/h3&gt;

&lt;p&gt;For coding agents, the easiest path is the hosted MCP endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add honcho &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--transport&lt;/span&gt; http &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; &lt;span class="s2"&gt;"https://mcp.honcho.dev"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer hch-your-key-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"X-Honcho-User-Name: YourName"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, Claude Code (or any MCP client — Cursor, Cline, Windsurf, Codex CLI) gains persistent memory across sessions. You can also use the richer Claude Code plugin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/plugin marketplace add plastic-labs/claude-honcho
/plugin install honcho@honcho
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Architecture &amp;amp; How It Works
&lt;/h2&gt;

&lt;p&gt;Honcho ships as a &lt;strong&gt;FastAPI server&lt;/strong&gt; with a few moving parts you need to be aware of if you self-host:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL with pgvector&lt;/strong&gt; — stores messages, peers, sessions, and embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt; — caches and coordinates the async background workers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An LLM&lt;/strong&gt; — used for ingestion-time reasoning, session summarization, and the chat endpoint's research agent. The published benchmarks use &lt;code&gt;gemini-2.5-flash-lite&lt;/code&gt; for ingestion and &lt;code&gt;claude-haiku-4-5&lt;/code&gt; for the chat endpoint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An embedding model&lt;/strong&gt; — for the hybrid search component&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting architectural choice is the &lt;strong&gt;two-stage reasoning pipeline&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingest-time:&lt;/strong&gt; a small fine-tuned model captures &lt;em&gt;latent information&lt;/em&gt; from each new message — preferences, claims, observations — and updates the peer's representation immediately. This is fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dream-time:&lt;/strong&gt; a separate background process periodically revisits prior messages and reasoning, drawing new deductions. This is where Honcho gets its high-recall, high-reasoning numbers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Token efficiency falls out of this design. On LongMem S, Honcho answers correctly 90.4% of the time while using a &lt;strong&gt;median of 5%&lt;/strong&gt; of the available context per question (mean 11%). That's the difference between a $0.50 query and a $0.05 query at scale, which adds up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks: How Honcho Stacks Up
&lt;/h2&gt;

&lt;p&gt;Plastic Labs published full benchmarks at &lt;a href="https://evals.honcho.dev" rel="noopener noreferrer"&gt;evals.honcho.dev&lt;/a&gt;. The headline numbers (mid-2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Honcho&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LongMem S&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;90.4%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;(92.6% with Gemini 3 Pro)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LongMem Oracle&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;91.8%&lt;/td&gt;
&lt;td&gt;Beats the underlying model alone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LoCoMo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Beating their own prior 86.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BEAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Top scores across all subtests&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most important number is actually the &lt;em&gt;non-headline&lt;/em&gt; one: Claude Haiku 4.5 alone scores &lt;strong&gt;62.6%&lt;/strong&gt; on LongMem S, but &lt;strong&gt;89.2%&lt;/strong&gt; on LongMem Oracle (only the relevant sessions in context). Honcho with Haiku 4.5 underneath scores &lt;strong&gt;90.4%&lt;/strong&gt; — &lt;em&gt;better than the oracle case&lt;/em&gt;. That means the memory layer is genuinely improving the model's reasoning, not just retrieving relevant chunks.&lt;/p&gt;

&lt;p&gt;For comparison, recent third-party comparison work (Vectorize, glukhov.org, Atlan) places Mem0 around the 60-70% range on LongMemEval and similar benchmarks, with Letta and Zep clustered in the 65-85% band depending on configuration. Honcho's numbers are at the top of the leaderboard as of May 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases (From the Community)
&lt;/h2&gt;

&lt;p&gt;From discussions in the Plastic Labs Discord, Reddit r/LocalLLaMA, and Hermes/OpenClaw integration docs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI tutoring apps&lt;/strong&gt; that need to remember a student's misconceptions across weeks of sessions — the canonical Plastic Labs demo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer support agents&lt;/strong&gt; modeling individual customer history and preferences across tickets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coding agents&lt;/strong&gt; (via the Claude Code / Cursor plugins) maintaining persistent memory of project conventions, your coding style, and prior decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent simulations&lt;/strong&gt; where agents need to model what other agents know — Honcho's peer model is uniquely well-suited&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal AI&lt;/strong&gt; projects (the Hermes integration, OpenClaw integration) where the agent represents &lt;em&gt;you&lt;/em&gt; and needs identity-level continuity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  First Impressions From the Community
&lt;/h2&gt;

&lt;p&gt;From the Reddit and DEV.to comparison threads (&lt;a href="https://www.reddit.com/r/gluk/comments/1syqcq5/agent_memory_providers_compared_honcho_mem0/" rel="noopener noreferrer"&gt;r/gluk thread on agent memory providers&lt;/a&gt;, the &lt;a href="https://dev.to/rosgluk/agent-memory-providers-compared-honcho-mem0-hindsight-and-five-more-5bl8"&gt;DEV.to comparison&lt;/a&gt;):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You want the agent to model how you think → Honcho." — Vectorize comparison guide, March 2026&lt;/p&gt;

&lt;p&gt;"Honcho and Mem0 require the most moving parts" — glukhov.org, noting Honcho's PostgreSQL + Redis + LLM + embedding dependency stack&lt;/p&gt;

&lt;p&gt;"The peer model is the right abstraction for what we're building" — common refrain in the Plastic Labs Discord from multi-agent simulation developers&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The consensus is positive on capability and benchmarks, more measured on operational complexity. If you want the lowest-friction managed memory, Mem0 is still ahead on time-to-first-query. If you want the deepest reasoning and the cleanest multi-peer model, Honcho is the pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The fastest path is the managed service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Get an API key at app.honcho.dev (you get $100 free credits)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HONCHO_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"hch-..."&lt;/span&gt;

&lt;span class="c"&gt;# 2. Install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;honcho-ai

&lt;span class="c"&gt;# 3. Start storing and querying — that's it&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
from honcho import Honcho
h = Honcho(workspace_id='demo')
alice = h.peer('alice')
s = h.session('s1')
s.add_messages([alice.message('I prefer terse answers.')])
print(alice.chat('What kind of answers does the user prefer?'))
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For self-hosting, clone the repo and use the provided Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/plastic-labs/honcho
&lt;span class="nb"&gt;cd &lt;/span&gt;honcho
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="c"&gt;# Server will be on http://localhost:8000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll still need to point Honcho at an LLM provider (OpenAI, Anthropic, Gemini, or a custom endpoint) and an embedding model. The repo's &lt;code&gt;.env.example&lt;/code&gt; walks through the configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Use Honcho (And Who Shouldn't)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Honcho if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building a stateful AI assistant that needs to &lt;em&gt;understand&lt;/em&gt; users, not just remember facts about them&lt;/li&gt;
&lt;li&gt;You're shipping a multi-agent system where modeling cross-agent knowledge matters&lt;/li&gt;
&lt;li&gt;You care about token efficiency — Honcho's 5%-median context usage is a significant cost lever at scale&lt;/li&gt;
&lt;li&gt;You want a permissively-licensed, self-hostable option (no vendor lock-in)&lt;/li&gt;
&lt;li&gt;You're already in the MCP ecosystem (Claude Code, Cursor, Cline, Windsurf)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip Honcho if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need memory in under 30 seconds with zero infra → use &lt;strong&gt;Mem0&lt;/strong&gt;'s managed tier&lt;/li&gt;
&lt;li&gt;You can't run PostgreSQL + Redis + an LLM dependency (e.g., browser-only or strict-edge environments)&lt;/li&gt;
&lt;li&gt;Your "memory" needs are really just chat-history-with-summaries — overkill for that case&lt;/li&gt;
&lt;li&gt;You need a single-binary, no-external-service option → look at &lt;strong&gt;Hindsight&lt;/strong&gt; instead&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison With Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Memory Model&lt;/th&gt;
&lt;th&gt;Self-host&lt;/th&gt;
&lt;th&gt;Headline benchmark&lt;/th&gt;
&lt;th&gt;Setup time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Honcho&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reasoning-first, peer-centric&lt;/td&gt;
&lt;td&gt;✅ (FastAPI + Postgres + Redis)&lt;/td&gt;
&lt;td&gt;90.4% LongMem S&lt;/td&gt;
&lt;td&gt;~30 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mem0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vector + LLM extraction&lt;/td&gt;
&lt;td&gt;✅ Managed + OSS&lt;/td&gt;
&lt;td&gt;~65% LongMemEval&lt;/td&gt;
&lt;td&gt;~30 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Letta (MemGPT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hierarchical / OS-style&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;83.2% (per Atlan)&lt;/td&gt;
&lt;td&gt;~15 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Zep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Temporal knowledge graph&lt;/td&gt;
&lt;td&gt;✅ + managed&lt;/td&gt;
&lt;td&gt;63.8% temporal LongMemEval&lt;/td&gt;
&lt;td&gt;~10 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hindsight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bundled, minimal deps&lt;/td&gt;
&lt;td&gt;✅ (single binary)&lt;/td&gt;
&lt;td&gt;91.4% overall&lt;/td&gt;
&lt;td&gt;~5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supermemory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hybrid, managed-first&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;td&gt;85.4%&lt;/td&gt;
&lt;td&gt;~30 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Honcho's position: &lt;strong&gt;highest reasoning + benchmark scores, more operational complexity than the lightweight options&lt;/strong&gt;. If memory quality matters more than setup speed, it wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Is Honcho open source and self-hostable, or is it just a managed API?&lt;/strong&gt;&lt;br&gt;
Both. The FastAPI server is open source on GitHub (see the repo for current license terms — historically Apache-2.0-style). You can run it on your own infrastructure with Docker Compose, or use the managed service at &lt;code&gt;api.honcho.dev&lt;/code&gt; with $100 of free credits on signup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How does Honcho compare to Mem0 for production use?&lt;/strong&gt;&lt;br&gt;
Mem0 wins on setup time and operational simplicity — you can be up and running in 30 seconds. Honcho wins on reasoning depth, multi-peer modeling, and benchmark performance (90.4% vs ~65% on LongMem-class evals). For a "remember last week's preferences" assistant, both will work. For multi-agent or identity-modeling use cases, Honcho is the better abstraction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I use Honcho with Claude Code or Cursor?&lt;/strong&gt;&lt;br&gt;
Yes — Plastic Labs ships an official Claude Code plugin (&lt;code&gt;/plugin install honcho@honcho&lt;/code&gt; after adding the &lt;code&gt;plastic-labs/claude-honcho&lt;/code&gt; marketplace) and a hosted MCP endpoint at &lt;code&gt;https://mcp.honcho.dev&lt;/code&gt; that works with any MCP-compatible client including Cursor, Cline, Windsurf, and Codex CLI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What's the actual cost at scale?&lt;/strong&gt;&lt;br&gt;
You're paying for: (a) the LLM you point Honcho at — ingestion model and chat endpoint model are separate and can be different (e.g., gemini-2.5-flash-lite for ingestion, claude-haiku-4-5 for chat); (b) embedding API calls; (c) PostgreSQL + Redis hosting if self-hosted. Honcho's ~5% median context-window usage per query keeps the chat-endpoint costs significantly lower than naive "dump everything into context" approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does Honcho work with non-OpenAI/Anthropic models?&lt;/strong&gt;&lt;br&gt;
Yes. Honcho is LLM-provider-agnostic — you can point it at OpenAI, Anthropic, Gemini, or a custom OpenAI-compatible endpoint (which means most local servers via Ollama, vLLM, or LM Studio also work).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How does the "dreaming" background process affect cost?&lt;/strong&gt;&lt;br&gt;
Dreaming is a configurable background process that re-reasons across stored messages to surface new deductions. It runs asynchronously and you can tune the token budget per workspace. Plastic Labs' benchmark configs are published on GitHub at &lt;a href="https://github.com/plastic-labs/honcho-benchmarks" rel="noopener noreferrer"&gt;honcho-benchmarks&lt;/a&gt; if you want to see real numbers.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Honcho is the cleanest open-source memory layer I've seen in 2026.&lt;/strong&gt; It's not the simplest to operate — you're committing to PostgreSQL, Redis, and an LLM bill on top of the application — but if you're building anything beyond a single-turn chatbot, the reasoning-first architecture and peer model will save you from re-inventing this exact wheel a year from now. Try it on the managed tier first with the $100 of free credits, then self-host when you've validated the fit.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;GitHub: &lt;a href="https://github.com/plastic-labs/honcho" rel="noopener noreferrer"&gt;github.com/plastic-labs/honcho&lt;/a&gt; · Docs: &lt;a href="https://honcho.dev" rel="noopener noreferrer"&gt;honcho.dev&lt;/a&gt; · Evals: &lt;a href="https://evals.honcho.dev" rel="noopener noreferrer"&gt;evals.honcho.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agentmemory</category>
      <category>honcho</category>
      <category>plasticlabs</category>
      <category>statefulagents</category>
    </item>
    <item>
      <title>Rmux Review: Rust Terminal Multiplexer Built for AI Agents</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Thu, 21 May 2026 11:08:13 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/rmux-review-rust-terminal-multiplexer-built-for-ai-agents-1j87</link>
      <guid>https://dev.to/andrew-ooo/rmux-review-rust-terminal-multiplexer-built-for-ai-agents-1j87</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/rmux-rust-terminal-multiplexer-agents-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rmux&lt;/strong&gt; is a from-scratch Rust rewrite of the terminal multiplexer idea, built specifically for the agentic era. It hit Show HN on &lt;strong&gt;May 21, 2026&lt;/strong&gt;, posted by author &lt;a href="https://news.ycombinator.com/user?id=shideneyu" rel="noopener noreferrer"&gt;shideneyu&lt;/a&gt; with the tagline &lt;em&gt;"A programmable terminal multiplexer with a Playwright-style SDK"&lt;/em&gt; — and the framing landed. The pitch in one sentence: drop-in tmux compatibility (90 commands, your keybindings work), plus a typed async Rust SDK on the same daemon so agents and scripts can drive any CLI or TUI app the way Playwright drives a browser.&lt;/p&gt;

&lt;p&gt;Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.2.0 released May 18, 2026&lt;/strong&gt; (3 days before Show HN), MIT/Apache-2.0 dual-licensed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;90 tmux-compatible CLI commands&lt;/strong&gt; implemented — &lt;code&gt;new-session&lt;/code&gt;, &lt;code&gt;split-window&lt;/code&gt;, &lt;code&gt;send-keys&lt;/code&gt;, &lt;code&gt;attach-session&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Typed Rust SDK&lt;/strong&gt; (&lt;code&gt;rmux-sdk&lt;/code&gt; on crates.io) with stable pane IDs, structured snapshots, locator-style waits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native on Linux, macOS, and Windows&lt;/strong&gt; — uses Unix PTYs on *nix, &lt;strong&gt;real ConPTY on Windows (no WSL required)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ratatui-rmux&lt;/code&gt; widget&lt;/strong&gt; for embedding live panes inside &lt;a href="https://ratatui.rs" rel="noopener noreferrer"&gt;Ratatui&lt;/a&gt; TUI apps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;#![forbid(unsafe_code)]&lt;/code&gt;&lt;/strong&gt; in upper-level crates, OS boundary code isolated in runtime crates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single daemon&lt;/strong&gt; behind both surfaces — anything the CLI can do, the SDK can do, and vice versa&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; fresh public preview, "bugs are expected" per the README, scriptable plugin/scripting system not yet broken out, no Lua/scripting parity with tmux configs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've ever written a flaky &lt;code&gt;tmux send-keys&lt;/code&gt; + &lt;code&gt;tmux capture-pane | grep&lt;/code&gt; loop to babysit a long-running tool from a script — or had an SSH session die and lose a Claude Code or Codex run — Rmux is the most principled fix shipped in 2026 so far.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Rmux Actually Is
&lt;/h2&gt;

&lt;p&gt;The author, &lt;a href="https://github.com/shideneyu" rel="noopener noreferrer"&gt;shideneyu&lt;/a&gt; on GitHub and HN, &lt;a href="https://news.ycombinator.com/item?id=48219918" rel="noopener noreferrer"&gt;explained the origin in the Show HN thread&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"RMUX started from a frustration: I've used tmux for years and got tired of scraping output with grep and sleeps to automate anything. So I rebuilt the multiplexer from scratch in Rust, with a programmable layer on top."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the core insight. Tmux is wonderful for humans and brittle for machines. The moment you try to automate — wait for a build to finish, send input only after a prompt appears, scrape a value out of &lt;code&gt;top&lt;/code&gt; — you end up writing &lt;code&gt;sleep 3 &amp;amp;&amp;amp; capture-pane -p | tail -n 20 | grep ...&lt;/code&gt; and praying.&lt;/p&gt;

&lt;p&gt;Rmux's answer is two parallel surfaces on top of one daemon:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The CLI surface&lt;/strong&gt; (&lt;code&gt;rmux&lt;/code&gt; binary): looks and acts like tmux. All 90 of the most-used tmux commands are implemented, your muscle memory works, your &lt;code&gt;.tmux.conf&lt;/code&gt;-style keybindings carry over. You can drop it in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The SDK surface&lt;/strong&gt; (&lt;code&gt;rmux-sdk&lt;/code&gt; crate): a typed async Rust API that talks to the same daemon over a local socket. Sessions and panes are first-class objects with stable IDs. You get &lt;code&gt;wait_for_text("ready")&lt;/code&gt; instead of &lt;code&gt;sleep &amp;amp;&amp;amp; grep&lt;/code&gt;. Structured &lt;code&gt;PaneSnapshot&lt;/code&gt; objects (rows, cols, attributes) instead of raw escape-sequence soup. This is the "Playwright for terminals" part.&lt;/p&gt;

&lt;p&gt;Both surfaces talk to a hidden daemon that owns the PTYs, sessions, layouts, hooks, and buffers. Detach your human terminal — your SDK script keeps driving. Kill your SDK script — your human session is still attachable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;

&lt;p&gt;For the impatient, on macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://rmux.io/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;irm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://rmux.io/install.ps1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;iex&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From crates.io:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;rmux &lt;span class="nt"&gt;--locked&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify with &lt;code&gt;rmux --version&lt;/code&gt; — should report &lt;code&gt;0.2.0&lt;/code&gt;. SHA256 checksums for the prebuilt binaries are published with the &lt;a href="https://github.com/helvesec/rmux/releases/tag/v0.2.0" rel="noopener noreferrer"&gt;v0.2.0 GitHub Release&lt;/a&gt;, which is the right thing to do but still a rarity for a v0.2 launch.&lt;/p&gt;

&lt;p&gt;A first session feels exactly like tmux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rmux new-session &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; work
rmux split-window &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; work
rmux send-keys &lt;span class="nt"&gt;-t&lt;/span&gt; work &lt;span class="s1"&gt;'echo "hello from rmux"'&lt;/span&gt; Enter
rmux attach-session &lt;span class="nt"&gt;-t&lt;/span&gt; work
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you've never used tmux: that creates a detached session called &lt;code&gt;work&lt;/code&gt;, splits it horizontally, sends a command into the right pane, then attaches your terminal so you can see both panes side by side. Press &lt;code&gt;Ctrl+B D&lt;/code&gt; (the default prefix, like tmux) to detach again — the session keeps running, the process keeps running, you can re-attach from another terminal or another SSH connection later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The SDK Is the Whole Point
&lt;/h2&gt;

&lt;p&gt;The CLI is table stakes. The SDK is why you'd switch.&lt;/p&gt;

&lt;p&gt;Add it to a Cargo project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;rmux-sdk&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.2"&lt;/span&gt;
&lt;span class="py"&gt;tokio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"rt-multi-thread"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"macros"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the canonical "wait for output, then capture state" pattern, adapted from the README:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;time&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rmux_sdk&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;
    &lt;span class="n"&gt;EnsureSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EnsureSessionPolicy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rmux&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SessionName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TerminalSizeSpec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;rmux_sdk&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;rmux&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Rmux&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.default_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_secs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.connect_or_start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;session_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;SessionName&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"work"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"valid session name"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rmux&lt;/span&gt;
        &lt;span class="nf"&gt;.ensure_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nn"&gt;EnsureSession&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;named&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;.policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;EnsureSessionPolicy&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CreateOrReuse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;.detached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;.size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;TerminalSizeSpec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;pane&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="nf"&gt;.pane&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.send_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"printf 'ready&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;n' &amp;amp;&amp;amp; sleep 1&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.wait_for_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ready"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.snapshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}x{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="py"&gt;.cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="py"&gt;.rows&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read that carefully. The interesting lines are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;connect_or_start()&lt;/code&gt; — connects to a running daemon if one exists, starts one if not. No more "did I &lt;code&gt;tmux start-server&lt;/code&gt; yet?" race conditions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EnsureSessionPolicy::CreateOrReuse&lt;/code&gt; — idempotent session bootstrap. Re-run the script, get the same session.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pane.wait_for_text("ready").await?&lt;/code&gt; — this is the killer feature. No sleeps. No polling loops. A proper async wait that resolves the moment the text appears in the pane buffer (with the &lt;code&gt;default_timeout&lt;/code&gt; as the bound).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pane.snapshot().await?&lt;/code&gt; — typed &lt;code&gt;PaneSnapshot&lt;/code&gt; with &lt;code&gt;cols&lt;/code&gt;, &lt;code&gt;rows&lt;/code&gt;, and structured cell data. You don't parse strings; you read fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mental model maps almost 1:1 to Playwright's: a typed handle to a long-lived environment, locator-style waits, structured introspection of state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ratatui Integration: Embed Live Panes in TUI Apps
&lt;/h2&gt;

&lt;p&gt;This is the bit that made me sit up. The &lt;code&gt;ratatui-rmux&lt;/code&gt; companion crate exposes a &lt;code&gt;PaneWidget&lt;/code&gt; that renders a live pane snapshot directly inside a &lt;a href="https://ratatui.rs" rel="noopener noreferrer"&gt;Ratatui&lt;/a&gt; TUI app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;ratatui&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;layout&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Rect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Widget&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;ratatui_rmux&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;PaneState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PaneWidget&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;rmux_sdk&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PaneSnapshot&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PaneSnapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Rect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;PaneState&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nn"&gt;PaneWidget&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you can build with that: a TUI dashboard that owns its own panes (a build log, a server log, an agent's terminal session) and renders them as widgets next to your own controls, without subprocess-management or PTY-allocation code in your app. The dashboard &lt;em&gt;is&lt;/em&gt; the multiplexer's other face. For anyone building agent observability tools or live operator UIs, this collapses a giant amount of boilerplate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "For the Agentic Era" Actually Means Something Here
&lt;/h2&gt;

&lt;p&gt;A lot of 2026 tooling slaps "for AI agents" on the README and changes nothing about the design. Rmux earns the label because of three concrete properties:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Detachable execution + reconnect.&lt;/strong&gt; When a Claude Code, Codex, or aider run is going for 45 minutes and your SSH connection dies, a regular shell loses the process. With rmux (or tmux), the daemon owns the PTY — the agent keeps running, the new SSH attaches and resumes the human view. For long-lived coding-agent sessions over flaky connections, this isn't a nice-to-have, it's the difference between "shipped" and "had to restart from scratch."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Structured snapshots, not byte streams.&lt;/strong&gt; When an agent supervisor needs to know "did my child process print the success line yet?", a byte-stream pipe forces it to write a parser. A typed &lt;code&gt;PaneSnapshot&lt;/code&gt; lets it pattern-match on rows and cells. The supervisor agent can be small and dumb because the data is shaped right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Real ConPTY on Windows, no WSL.&lt;/strong&gt; This is the one most projects skip. WezTerm, Alacritty, even some "cross-platform" Rust shells punt to "use WSL" on Windows. Rmux uses real &lt;a href="https://learn.microsoft.com/en-us/windows/console/pseudoconsoles" rel="noopener noreferrer"&gt;ConPTY&lt;/a&gt; (the Windows pseudo-console API) and a per-user Named Pipe for IPC. If you're building agent tooling that has to run on a developer's actual Windows machine — or in a Windows Server VM in CI — that matters more than benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community Reception (First 24 Hours)
&lt;/h2&gt;

&lt;p&gt;The Show HN thread is still young (39 points, 29 comments at time of writing — early but front-paging), and the author is actively replying. Representative threads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Why not just &lt;code&gt;script&lt;/code&gt;-wrap tmux?"&lt;/strong&gt; — tmux's automation surface is the same byte-stream-and-regex pattern that motivated rmux. Wrapping it doesn't fix the mismatch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Daemon protocol stability?"&lt;/strong&gt; — &lt;code&gt;rmux-proto&lt;/code&gt; (the IPC DTO crate) is published publicly so SDKs in other languages can be built against it. Python and Node bindings aren't shipped, but the wire format is documented.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What about screen / zellij?"&lt;/strong&gt; — &lt;code&gt;screen&lt;/code&gt; predates structured automation as a concern. &lt;a href="https://zellij.dev/" rel="noopener noreferrer"&gt;Zellij&lt;/a&gt; has a WASM-plugin-centric philosophy; rmux bets on out-of-process SDK clients instead. Different shape, not strictly competitive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.reddit.com/r/rust/comments/1tipknk/rmux_native_terminal_multiplexer_in_rust/" rel="noopener noreferrer"&gt;r/rust thread&lt;/a&gt; skews implementation-curious: ConPTY integration, the workspace split, and the &lt;code&gt;#![forbid(unsafe_code)]&lt;/code&gt; posture in upper crates. Build hygiene is unusually thorough for v0.2: &lt;code&gt;cargo clippy --all-targets --locked -D warnings&lt;/code&gt;, a &lt;code&gt;scripts/no-network-in-runtime.sh&lt;/code&gt; assertion, and a platform-neutrality checker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;This is a fresh preview. Be realistic.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Bugs are expected."&lt;/strong&gt; The README says it. Don't put rmux on critical CI paths on day 3 — run it for personal use and prototypes for a release cycle or two first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Lua/scripting parity with tmux configs.&lt;/strong&gt; Keybindings and core options carry over; a 400-line &lt;code&gt;.tmux.conf&lt;/code&gt; with custom hooks and plugins does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No official Python / Node SDK yet.&lt;/strong&gt; Either shell out to the CLI or hand-roll against the public &lt;code&gt;rmux-proto&lt;/code&gt; wire format. Community ports are likely soon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daemon protocol is local-only by design.&lt;/strong&gt; Unix socket or Named Pipe. To drive a remote rmux, you SSH in and run the SDK there — which is usually what you want anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daemon memory footprint at idle isn't documented.&lt;/strong&gt; Irrelevant on a dev machine; you'd want to measure if you're spawning hundreds per CI build.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Daemon&lt;/th&gt;
&lt;th&gt;Typed SDK&lt;/th&gt;
&lt;th&gt;Windows-native&lt;/th&gt;
&lt;th&gt;Snapshot API&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;rmux 0.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;yes (Rust async)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;yes (ConPTY)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;yes (typed)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT/Apache-2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tmux&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no (WSL)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;capture-pane&lt;/code&gt; (text)&lt;/td&gt;
&lt;td&gt;ISC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://zellij.dev/" rel="noopener noreferrer"&gt;Zellij&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;WASM plugins&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://wezterm.org/multiplexing.html" rel="noopener noreferrer"&gt;WezTerm mux&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;Lua&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.gnu.org/software/screen/" rel="noopener noreferrer"&gt;screen&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;GPLv3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;expect&lt;/code&gt; / &lt;code&gt;pexpect&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;partial (Python)&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;td&gt;byte stream&lt;/td&gt;
&lt;td&gt;various&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The closest competitor by design philosophy is honestly &lt;strong&gt;Playwright for terminals&lt;/strong&gt;, which doesn't really exist as a category yet. Rmux is making it one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup Recipes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Babysit a Coding Agent Over SSH
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh build-server
rmux new-session &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; coder
rmux send-keys &lt;span class="nt"&gt;-t&lt;/span&gt; coder &lt;span class="s1"&gt;'cd ~/project &amp;amp;&amp;amp; claude code'&lt;/span&gt; Enter
rmux attach-session &lt;span class="nt"&gt;-t&lt;/span&gt; coder
&lt;span class="c"&gt;# do prompt work, then Ctrl+B D to detach&lt;/span&gt;
&lt;span class="nb"&gt;exit&lt;/span&gt;  &lt;span class="c"&gt;# SSH disconnects, agent keeps running&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reconnect from anywhere with &lt;code&gt;ssh build-server &amp;amp;&amp;amp; rmux attach-session -t coder&lt;/code&gt;. The agent's terminal is exactly as you left it, scrollback included.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drive a TUI Installer from CI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.send_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./installer.sh&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.wait_for_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"License [y/N]"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.send_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"y&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.wait_for_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Install directory:"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.send_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/opt/myapp&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;pane&lt;/span&gt;&lt;span class="nf"&gt;.wait_for_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Installation complete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the genuinely new capability. You could do it before with &lt;code&gt;expect&lt;/code&gt;, but &lt;code&gt;expect&lt;/code&gt; is a string-soup language from 1990. Rmux makes it a typed Rust program with proper error handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is rmux a drop-in replacement for tmux?
&lt;/h3&gt;

&lt;p&gt;For day-to-day use: largely yes. The 90 most-used commands work, your prefix-key reflexes work. For exotic configs (heavy Lua scripting, custom hooks, plugin ecosystems like TPM): not yet. Treat it as "tmux that also happens to be programmable" rather than "tmux but better at everything tmux does."&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use rmux from Python or Node, not Rust?
&lt;/h3&gt;

&lt;p&gt;Not with an official SDK yet. Your options today are: (1) shell out to the &lt;code&gt;rmux&lt;/code&gt; CLI from your script, which works but loses the typed-SDK ergonomics, or (2) write against the public &lt;code&gt;rmux-proto&lt;/code&gt; wire format. A community Python binding is the kind of thing that often appears within a month of an HN launch like this — keep an eye on the &lt;a href="https://github.com/helvesec/rmux/issues" rel="noopener noreferrer"&gt;issues&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does the Windows version really work without WSL?
&lt;/h3&gt;

&lt;p&gt;Yes. It uses real ConPTY (the Windows pseudo-console API introduced in Windows 10 1809) for PTY allocation and per-user Named Pipes for IPC. The project's &lt;code&gt;scripts/check-platform-neutrality.sh&lt;/code&gt; enforces that platform-specific code stays inside the boundary crates. If you've been waiting for a terminal multiplexer that doesn't make Windows a second-class citizen, this is the first credible one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is it stable enough to use today?
&lt;/h3&gt;

&lt;p&gt;For personal automation, prototypes, and exploratory agent work: yes, with the README's caveat that bugs are expected at v0.2.0. For production CI on critical paths: wait one or two release cycles, watch the issue tracker, and pin the version. The build discipline (clippy, fmt, locked deps, &lt;code&gt;forbid(unsafe_code)&lt;/code&gt;) is reassuring; the version number is honest.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does this fit alongside coding agents like Claude Code or Codex?
&lt;/h3&gt;

&lt;p&gt;Two complementary roles. First, as a &lt;em&gt;host&lt;/em&gt;: run the agent's CLI inside an rmux session so it survives SSH disconnects and you can attach/detach (huge for long runs over flaky connections). Second, as a &lt;em&gt;target&lt;/em&gt;: an agent supervisor (a Rust or Python orchestrator) uses the rmux SDK to drive &lt;em&gt;other&lt;/em&gt; terminal tools — installers, REPLs, TUI debuggers — that the inner agent needs to interact with. The supervisor pattern is the more interesting one and the one rmux is uniquely good at versus tmux.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does it compare to &lt;a href="https://dev.to/posts/forge-guardrails-self-hosted-llm-agentic-review/"&gt;Forge&lt;/a&gt;?
&lt;/h3&gt;

&lt;p&gt;Different layer. Forge is a reliability layer for the agent's reasoning and tool-calling loop. Rmux is a reliability layer for the agent's &lt;em&gt;environment&lt;/em&gt; — the terminal sessions the agent (or the human supervising it) lives inside. Both target the same underlying frustration (small models / fragile sessions need scaffolding to be useful), but they don't overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Rmux is the cleanest version of "rebuild a Unix workhorse for the agentic era" I've seen in 2026. The decision to ship a tmux-compatible CLI &lt;em&gt;and&lt;/em&gt; a typed SDK on the same daemon — instead of forcing users to pick — is the kind of constraint that makes a tool actually adoptable.&lt;/p&gt;

&lt;p&gt;The Windows-native story (real ConPTY, no WSL) is what would tip me from "interesting" to "I'm trying this" for any project that needs cross-platform agent tooling. The Ratatui widget integration is what would tip a TUI builder. The &lt;code&gt;wait_for_text&lt;/code&gt; locator API is what would tip anyone who has ever written a &lt;code&gt;sleep 3 &amp;amp;&amp;amp; grep&lt;/code&gt; loop and known it was wrong.&lt;/p&gt;

&lt;p&gt;Caveats are real and the author owns them: v0.2.0, bugs expected, no Python/Node SDK yet, no scripting parity with mature tmux configs. But the bones are correct, the build hygiene is unusually disciplined for a v0.2 launch, and the design rationale survives contact with the obvious alternatives.&lt;/p&gt;

&lt;p&gt;If you run long-lived processes over SSH, supervise CLI tools from code, or build TUI dashboards that wrap other terminal apps — install it today, port one annoying automation script, and report back in 30 days. That's the right size of bet on a fresh tool that ships its protocol publicly and forbids unsafe code in its upper crates.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/helvesec/rmux" rel="noopener noreferrer"&gt;github.com/helvesec/rmux&lt;/a&gt; · &lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://rmux.io/docs/" rel="noopener noreferrer"&gt;rmux.io/docs&lt;/a&gt; · &lt;strong&gt;Show HN:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=48219918" rel="noopener noreferrer"&gt;news.ycombinator.com/item?id=48219918&lt;/a&gt; · &lt;strong&gt;License:&lt;/strong&gt; MIT / Apache-2.0&lt;/p&gt;

</description>
      <category>rmux</category>
      <category>tmux</category>
      <category>rust</category>
      <category>terminalmultiplexer</category>
    </item>
    <item>
      <title>Forge Review: 8B Local Model Hits 99% on Agentic Tasks</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Wed, 20 May 2026 11:07:11 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/forge-review-8b-local-model-hits-99-on-agentic-tasks-18kc</link>
      <guid>https://dev.to/andrew-ooo/forge-review-8b-local-model-hits-99-on-agentic-tasks-18kc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/forge-guardrails-self-hosted-llm-agentic-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Forge&lt;/strong&gt; is a Python reliability layer for self-hosted LLM tool-calling, built by &lt;strong&gt;Antoine Zambelli, AI Director at Texas Instruments&lt;/strong&gt;. It hit the Hacker News front page on &lt;strong&gt;May 19, 2026 with 206 points&lt;/strong&gt; and a tagline that did the heavy lifting on its own: &lt;em&gt;"Guardrails take an 8B model from 53% to 99% on agentic tasks."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The headline result, drawn from Forge's own 26-scenario eval suite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bare 8B local model:&lt;/strong&gt; ~53% on multi-step agentic workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Same 8B model + Forge guardrails:&lt;/strong&gt; &lt;strong&gt;99.3%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Sonnet, no guardrails:&lt;/strong&gt; 87.2%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Sonnet + Forge:&lt;/strong&gt; ~100%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That second-to-last line is the one that makes people sit up: &lt;strong&gt;a $0 local 8B model on a $600 GPU, wrapped in Forge, beats a frontier API used the normal way.&lt;/strong&gt; The gap between bare-metal local inference and a cloud API closes to less than 1 percentage point when both sides run through Forge. The framework, the ablation study, and the methodology are written up in an IEEE-track paper (&lt;a href="https://doi.org/10.1145/3786335.3813193" rel="noopener noreferrer"&gt;Zambelli 2026, DOI: 10.1145/3786335.3813193&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MIT licensed&lt;/strong&gt;, pip-installable as &lt;code&gt;forge-guardrails&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backends:&lt;/strong&gt; Ollama, llama-server (llama.cpp), Llamafile, Anthropic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top eval config:&lt;/strong&gt; Ministral-3 8B Instruct Q8 on llama-server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three usage modes:&lt;/strong&gt; full WorkflowRunner, middleware-only, drop-in OpenAI-compatible proxy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core trick:&lt;/strong&gt; rescue parsing, retry nudges, step enforcement, tiered context compaction, plus a synthetic &lt;code&gt;respond&lt;/code&gt; tool that keeps small models in tool-calling mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;865 deterministic unit tests&lt;/strong&gt; + 26-scenario eval harness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status:&lt;/strong&gt; active (Python 3.12+), authored solo, real paper backing it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a 24 GB GPU sitting in a closet and you've been waiting for the moment when "self-hosted agents" stops meaning "self-hosted toys," this is the moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/antoinezambelli/forge" rel="noopener noreferrer"&gt;github.com/antoinezambelli/forge&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Author&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Antoine Zambelli (AI Director, Texas Instruments)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.12+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;pip install forge-guardrails&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HN debut&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;May 19, 2026 — 206 points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top eval result&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;99.3% (Ministral-3 8B Q8 + Forge)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hardest tier score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;76% (Ministral-3 8B Q8, advanced_reasoning subset)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backends&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama, llama-server, Llamafile, Anthropic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Paper&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DOI 10.1145/3786335.3813193&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;For most of 2025 and early 2026, the conventional wisdom on local agents was: nice for chat, useless for tool calling. Anyone who plugged a local 8B into a five-step workflow watched it confidently call &lt;code&gt;get_weatherr&lt;/code&gt; with a typo, return text instead of a JSON tool call, forget which step it was on, and time out somewhere in the middle. Real agentic use stayed on frontier APIs.&lt;/p&gt;

&lt;p&gt;Forge's contribution is to show that the gap is not really a model problem — it's a &lt;strong&gt;harness problem&lt;/strong&gt;. The model knows how to call tools. It just needs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Permission to fail and try again&lt;/strong&gt; (retry nudges)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A parser that can salvage a malformed call&lt;/strong&gt; instead of throwing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A reminder of which step it's on&lt;/strong&gt; when it wanders&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A context manager that doesn't quietly truncate the last tool result&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A way to emit text without leaving tool-calling mode&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Stack those five things and the same weights that score 53% raw score 99%. That's the entire argument, and Forge implements every piece of it as composable middleware so you can use the parts you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install in three minutes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# core only&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;forge-guardrails

&lt;span class="c"&gt;# plus the Anthropic client&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"forge-guardrails[anthropic]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You also need a backend. Forge is most opinionated about llama-server (its top-10 eval configs all run on it), but Ollama is the easiest on-ramp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Ollama path — easiest&lt;/span&gt;
ollama pull ministral-3:8b-instruct-2512-q4_K_M

&lt;span class="c"&gt;# llama-server path — best performance&lt;/span&gt;
llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--jinja&lt;/code&gt; flag matters. It's what enables native function calling in llama.cpp, and without it you fall back to prompt-injected tool calling, which is measurably worse on the harder eval tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real workflow in 30 lines
&lt;/h2&gt;

&lt;p&gt;This is the canonical Forge example, lightly trimmed from the README:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;forge&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Workflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ToolDef&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ToolSpec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;WorkflowRunner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OllamaClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ContextManager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TieredCompact&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;72°F and sunny in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GetWeatherParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;City name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Look up weather for a city.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ToolDef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ToolSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get current weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GetWeatherParams&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nb"&gt;callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;required_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="n"&gt;terminal_tool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant. Use the available tools to answer the user.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OllamaClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ministral-3:8b-instruct-2512-q4_K_M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;recommended_sampling&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContextManager&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;TieredCompact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keep_recent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;budget_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;runner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WorkflowRunner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_manager&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the weather in Paris?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What's quietly happening behind that &lt;code&gt;runner.run&lt;/code&gt; call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ResponseValidator&lt;/code&gt;&lt;/strong&gt; checks every model response against the tool schema, flags malformed JSON, and routes failures to the nudge system instead of bubbling them up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;StepEnforcer&lt;/code&gt;&lt;/strong&gt; tracks &lt;code&gt;required_steps&lt;/code&gt; (none in this example, but typical workflows have them) and refuses to terminate until they're satisfied.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;TieredCompact&lt;/code&gt;&lt;/strong&gt; keeps the last 2 turns verbatim and compacts older turns when the context budget gets tight, so a runaway loop doesn't quietly drop the original user request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The retry nudge templates&lt;/strong&gt; in &lt;code&gt;forge/prompts/nudges.py&lt;/code&gt; produce small, surgical "you produced invalid output, here's why, try again" messages instead of restarting the loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;recommended_sampling=True&lt;/code&gt;&lt;/strong&gt; auto-applies the sampling params the model card recommends, which Zambelli found made a measurable difference in tool-call accuracy on small models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern — defining a &lt;code&gt;Workflow&lt;/code&gt;, picking a backend client, attaching a &lt;code&gt;ContextManager&lt;/code&gt;, and handing both to &lt;code&gt;WorkflowRunner&lt;/code&gt; — is the high-level entry point. If you'd rather keep your own orchestration loop, Forge also ships a middleware mode where you call its guardrails directly on responses you've already gotten back from a model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The drop-in proxy mode (this is the killer feature)
&lt;/h2&gt;

&lt;p&gt;This is the bit that will pull people in even if they have no interest in rewriting their stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# external mode — you run llama-server, Forge wraps it&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; forge.proxy &lt;span class="nt"&gt;--backend-url&lt;/span&gt; http://localhost:8080 &lt;span class="nt"&gt;--port&lt;/span&gt; 8081

&lt;span class="c"&gt;# managed mode — Forge starts llama-server for you&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; forge.proxy &lt;span class="nt"&gt;--backend&lt;/span&gt; llamaserver &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gguf&lt;/span&gt; path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf &lt;span class="nt"&gt;--port&lt;/span&gt; 8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now point any OpenAI-compatible client — &lt;code&gt;opencode&lt;/code&gt;, Continue, aider, your homegrown agent loop — at &lt;code&gt;http://localhost:8081/v1&lt;/code&gt; and you get the full Forge guardrail stack for free. The client thinks it's talking to a smarter model.&lt;/p&gt;

&lt;p&gt;Internally the proxy does one slightly diabolical thing: when a request contains tools, it injects a synthetic &lt;code&gt;respond&lt;/code&gt; tool. The model is then expected to call &lt;code&gt;respond(message="...")&lt;/code&gt; instead of emitting bare text. This keeps it in tool-calling mode at all times — which is where Forge's guardrails apply — and the synthetic call is stripped from the outbound response so the client never sees it. ADR-013 in the repo has the full reasoning, but the short version: 8B models cannot be trusted to decide between "emit text" and "emit a tool call." Force them to always pick a tool, and accuracy jumps.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 26-scenario eval suite
&lt;/h2&gt;

&lt;p&gt;Forge ships its own benchmark. Two tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OG-18:&lt;/strong&gt; 18 baseline scenarios covering single tool calls, multi-step workflows, parameter validation, and recovery from injected errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;advanced_reasoning:&lt;/strong&gt; 8 harder scenarios — chained dependencies, branching logic, ambiguous user prompts, partial information.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run it yourself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; tests.eval.eval_runner &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--backend&lt;/span&gt; llamafile &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--llamafile-mode&lt;/span&gt; prompt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gguf&lt;/span&gt; &lt;span class="s2"&gt;"path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--runs&lt;/span&gt; 10 &lt;span class="nt"&gt;--stream&lt;/span&gt; &lt;span class="nt"&gt;--verbose&lt;/span&gt;

&lt;span class="c"&gt;# batch eval across configs&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; tests.eval.batch_eval &lt;span class="nt"&gt;--config&lt;/span&gt; all &lt;span class="nt"&gt;--runs&lt;/span&gt; 50

&lt;span class="c"&gt;# HTML dashboard&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; tests.eval.report eval_results.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The published numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;OG-18&lt;/th&gt;
&lt;th&gt;advanced_reasoning&lt;/th&gt;
&lt;th&gt;Combined&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ministral-3 8B Q8 (llama-server) &lt;strong&gt;+ Forge&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;~93%&lt;/td&gt;
&lt;td&gt;76%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ministral-3 8B Q8 (llama-server), bare&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;~53%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet, bare&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;87.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet + Forge&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;~100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "99.3% on agentic tasks" headline number comes from a narrower slice of the benchmark — strict tool-calling reliability, not the full combined score. Read the paper for the precise breakdown; Zambelli is straightforward about which number means what.&lt;/p&gt;

&lt;p&gt;One of his most interesting eval findings, surfaced in the HN thread: &lt;strong&gt;the serving backend matters as much as the model.&lt;/strong&gt; Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt-injected mode. Same weights. The difference is entirely in how the chat template hands the tool schema to the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community reaction
&lt;/h2&gt;

&lt;p&gt;The HN thread had real engineers in it. A few notes worth pulling out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the "small models with harnesses beat big models" thesis:&lt;/strong&gt; &lt;em&gt;"I've been saying for a while that given a proper harness, small local models can perform incredibly well. When you have a system that can try everything, it will eventually get it right as long as you can prevent it from getting it wrong in the meantime."&lt;/em&gt; (HN user, 48200359)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zambelli's response:&lt;/strong&gt; &lt;em&gt;"Essentially, yes that's right! There's some subtlety in how to let it know it was wrong (returning things as tool errors because it trained on that), but that's the gist of it — sort of a self-correcting architecture."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The most useful piece of pushback,&lt;/strong&gt; from a heavy local-LLM user: &lt;em&gt;"Specifically for agentic workflows and local models, accuracy around function/tool calling hasn't been a problem for me now for about 6 - 12 months, personally, since around QwenCoder3. The main issue is context management and the impact on timing... It looks like your work adds layers and wrappers like guard rails and retries. This would make my local model experience - specifically for agents - unusable because of the delays it would add."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the right critique. Forge improves accuracy but adds round-trips. Zambelli responded by floating a new metric he's going to start tracking: &lt;strong&gt;ETTWS — Estimated Time To Working Solution.&lt;/strong&gt; A retry loop that succeeds on attempt three with three small models calls might still beat a single shot that gets 53% accuracy and then forces you to restart the whole workflow. But you need the data to defend that claim, and right now it's not in the paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the framing:&lt;/strong&gt; &lt;em&gt;"This is a thousand unusually smart monkeys who speak every major human language fluently and are proficient in every major programming language, but sometimes still make bizarre mistakes and need to be put back on track."&lt;/em&gt; Forge is, in that framing, the trainer with the clicker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;This is not a finished product. Things to know before you bet a workflow on it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No public latency benchmarks.&lt;/strong&gt; All the published numbers are accuracy. The HN thread surfaced this and Zambelli acknowledged it. If you're running an interactive agent, you need to time-box your retries and prove to yourself that the wall-clock-to-success is better than your current setup.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Python only.&lt;/strong&gt; No JS, Go, or Rust bindings. If your stack is Node-first, you're either standing up a sidecar or waiting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model coverage is uneven.&lt;/strong&gt; The eval suite is heavily skewed toward the Ministral-3 family and Mistral-Nemo. Qwen, Llama 3.x, and Gemma configurations exist but are less battle-tested. The Model Guide in the repo is honest about which combos have been graded.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The synthetic &lt;code&gt;respond&lt;/code&gt; tool is a workaround, not a fix.&lt;/strong&gt; Some users will be uncomfortable that the proxy injects an extra tool into every request. ADR-013 is worth reading before you adopt this in production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-agent and slot-sharing support is brand new.&lt;/strong&gt; &lt;code&gt;SlotWorker&lt;/code&gt; (priority-queued access to a shared inference slot) is in the repo but isn't yet a documented production pattern.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Active solo project.&lt;/strong&gt; It's a one-maintainer project (Zambelli, plus an academic paper). If you adopt it, plan to read the code.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Forge vs. the alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Weakness vs. Forge&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Outlines / Instructor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured output / JSON schema&lt;/td&gt;
&lt;td&gt;Mature, multi-language&lt;/td&gt;
&lt;td&gt;Doesn't manage step enforcement or multi-step retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent orchestration&lt;/td&gt;
&lt;td&gt;Huge ecosystem&lt;/td&gt;
&lt;td&gt;No specific guardrails for self-hosted reliability; assumes the model gets it right&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DSPy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt/program optimization&lt;/td&gt;
&lt;td&gt;Auto-tunes prompts&lt;/td&gt;
&lt;td&gt;Different layer — could compose with Forge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;vLLM guided decoding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Constrained generation at the token level&lt;/td&gt;
&lt;td&gt;Fastest&lt;/td&gt;
&lt;td&gt;Server-side only, ties you to vLLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Forge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted tool-calling reliability + context mgmt&lt;/td&gt;
&lt;td&gt;Closes the local-vs-frontier gap&lt;/td&gt;
&lt;td&gt;Python-only, accuracy-first metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cleanest mental model: &lt;strong&gt;Outlines/Instructor enforce schema, vLLM constrained decoding enforces tokens, LangGraph orchestrates flows, Forge enforces a workflow&lt;/strong&gt;. They're complementary, not competitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who should adopt this today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Solo developers running local agents on a 24–48 GB GPU.&lt;/strong&gt; You'll feel the reliability jump on day one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams building privacy-sensitive agentic products&lt;/strong&gt; (legal, medical, on-prem) where a frontier API isn't on the table and the current local stack is too flaky.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research groups&lt;/strong&gt; doing ablation studies on tool-calling reliability — Forge's eval harness is already publication-quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone running &lt;code&gt;opencode&lt;/code&gt;, Continue, or aider against a local model&lt;/strong&gt; and wishing it were less stupid. The proxy mode is a 30-second adoption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hold off if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need sub-second latency on every step and can't tolerate retry overhead.&lt;/li&gt;
&lt;li&gt;Your stack is Node-only and you can't stomach a Python sidecar.&lt;/li&gt;
&lt;li&gt;You're already paying for frontier API access and your reliability problem is "Claude refused to call my tool once last week" — Forge is overkill for that.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Does Forge work with Qwen, Llama 3, or Gemma — or only Ministral?&lt;/strong&gt;&lt;br&gt;
A: It works with anything its backends support (Ollama, llama-server, Llamafile, Anthropic). The published eval numbers happen to be best on Ministral-3 8B Q8, but the framework is model-agnostic. The Model Guide in the repo is honest about which model + backend combos have been graded and which are "should work, untested."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How much slower is a Forge-wrapped run vs. a raw model call?&lt;/strong&gt;&lt;br&gt;
A: There are no public latency numbers yet, which is the main critique on HN. In practice, you're trading wall-clock for completion rate: a single-shot 8B might return in 800ms and be wrong; a Forge-wrapped run might take 2–4 seconds and finish the workflow correctly. Zambelli is adding an ETTWS (Estimated Time To Working Solution) metric to the next paper rev specifically to address this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Do I have to use the WorkflowRunner, or can I just bolt the guardrails onto my existing loop?&lt;/strong&gt;&lt;br&gt;
A: You can absolutely just use the middleware. &lt;code&gt;examples/foreign_loop.py&lt;/code&gt; in the repo shows how to call &lt;code&gt;ResponseValidator&lt;/code&gt;, the nudge system, and the context manager directly. The full WorkflowRunner is convenient but optional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What's the &lt;code&gt;respond&lt;/code&gt; tool actually for?&lt;/strong&gt;&lt;br&gt;
A: 8B models trained on tool-calling datasets get confused when asked to choose between emitting plain text and emitting a tool call. Many will emit text when they should call a tool, or call a tool when they should just answer. The synthetic &lt;code&gt;respond&lt;/code&gt; tool removes the choice: there's always a tool to call, even for plain answers. The proxy strips it from the response on the way out so clients see normal text. ADR-013 in the repo has the full analysis and the eval data behind the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is the IEEE paper open-access or paywalled?&lt;/strong&gt;&lt;br&gt;
A: The DOI (&lt;code&gt;10.1145/3786335.3813193&lt;/code&gt;) may take a moment to resolve depending on publisher timing, but there's a pre-publication preprint kept in the repo at &lt;code&gt;docs/forge_ieee_preprint.pdf&lt;/code&gt;. Cite the DOI version, read the preprint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does it support streaming?&lt;/strong&gt;&lt;br&gt;
A: Yes — &lt;code&gt;StreamChunk&lt;/code&gt; and SSE streaming are first-class in the client and proxy layers. The proxy server can stream chunks to OpenAI-compatible clients while still applying guardrails on completion boundaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I use it with Anthropic as a baseline / hybrid?&lt;/strong&gt;&lt;br&gt;
A: Yes. &lt;code&gt;pip install "forge-guardrails[anthropic]"&lt;/code&gt; and use &lt;code&gt;AnthropicClient&lt;/code&gt; as the backend. Useful for A/B tests against your local config or for hybrid workflows where part of the chain runs on Claude and part runs locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;Forge is the most concrete answer yet to the question &lt;em&gt;"is self-hosted agentic AI actually viable, or is it always going to be the toy alternative?"&lt;/em&gt; On the published evidence — and the IEEE paper behind it — the answer is: with the right harness, an 8B local model in a $600 GPU is functionally on par with a frontier API. That's a tectonic shift for anyone who has been waiting for the moment when running real agents at home stops feeling masochistic.&lt;/p&gt;

&lt;p&gt;The latency story is unfinished. The model coverage is uneven. It's a solo project. None of that changes the headline result, which is one of the cleanest demonstrations of "framework matters more than weights" published this year.&lt;/p&gt;

&lt;p&gt;If you have a GPU, install it tonight. Point your existing agent client at the proxy and see what happens. The setup cost is fifteen minutes; the upside is the local agent stack you've been waiting on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/antoinezambelli/forge" rel="noopener noreferrer"&gt;github.com/antoinezambelli/forge&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Paper:&lt;/strong&gt; &lt;a href="https://doi.org/10.1145/3786335.3813193" rel="noopener noreferrer"&gt;Zambelli 2026, DOI 10.1145/3786335.3813193&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;HN thread:&lt;/strong&gt; &lt;a href="https://news.ycombinator.com/item?id=48192383" rel="noopener noreferrer"&gt;news.ycombinator.com/item?id=48192383&lt;/a&gt;&lt;/p&gt;

</description>
      <category>forge</category>
      <category>selfhostedllm</category>
      <category>toolcalling</category>
      <category>agenticworkflows</category>
    </item>
    <item>
      <title>ds4 Review: antirez's Pure-C DeepSeek V4 Flash Engine</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Tue, 19 May 2026 11:05:18 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/ds4-review-antirezs-pure-c-deepseek-v4-flash-engine-2f8a</link>
      <guid>https://dev.to/andrew-ooo/ds4-review-antirezs-pure-c-deepseek-v4-flash-engine-2f8a</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/ds4-antirez-deepseek-v4-flash-local-inference-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ds4&lt;/strong&gt; is Salvatore "antirez" Sanfilippo's brand-new inference engine for the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash" rel="noopener noreferrer"&gt;DeepSeek V4 Flash&lt;/a&gt; model. The creator of Redis spent the last few weeks writing a pure-C runtime that does one thing: run DeepSeek V4 Flash as fast as physically possible on a MacBook or a DGX Spark. It's not a generic GGUF runner. It is not a wrapper around llama.cpp. It's a deliberately narrow engine that has gained &lt;strong&gt;8,056 GitHub stars in roughly four days&lt;/strong&gt; and topped the Hacker News front page with &lt;strong&gt;497 points and 157 comments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The headline result: a 284B-parameter Mixture-of-Experts model running at &lt;strong&gt;26.68 tokens/s&lt;/strong&gt; generation on a 128 GB MacBook Pro M3 Max, with a &lt;strong&gt;1-million-token context window&lt;/strong&gt; and persistent on-disk KV cache. That's not a typo. That's a frontier-class model on a laptop you can buy at the Apple Store.&lt;/p&gt;

&lt;p&gt;Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pure C inference engine&lt;/strong&gt; for one model: DeepSeek V4 Flash (284B params, ~13B active)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metal&lt;/strong&gt; (macOS) and &lt;strong&gt;CUDA&lt;/strong&gt; (Linux, DGX Spark optimized) primary backends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2-bit asymmetric quantization&lt;/strong&gt; — only MoE experts quantized, shared layers untouched&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1M token context window&lt;/strong&gt; with KV cache as a "first-class disk citizen"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI/Anthropic-compatible HTTP server&lt;/strong&gt; with tool calling baked in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;96 GB MacBooks confirmed working&lt;/strong&gt; at 250k context by community reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT licensed&lt;/strong&gt;, built openly with GPT-5.5 assistance (antirez says so himself)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status: alpha&lt;/strong&gt; — "this exists only for a few days," but it works&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have a 128 GB Mac and you've been waiting for the moment when "local frontier model" stops being aspirational, this is the moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/antirez/ds4" rel="noopener noreferrer"&gt;antirez/ds4&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Author&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Salvatore Sanfilippo (creator of Redis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;C (no GGML link, no llama.cpp runtime)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weights&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/antirez/deepseek-v4-gguf" rel="noopener noreferrer"&gt;huggingface.co/antirez/deepseek-v4-gguf&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backends&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metal (macOS), CUDA (Linux/DGX Spark), ROCm (separate branch)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Min RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;96 GB (q2-imatrix), 128 GB recommended, 256 GB+ for q4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quants&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;q2-imatrix, q4-imatrix, q2, q4, optional MTP for speculative decoding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Up to 1,000,000 tokens (disk-backed KV cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;First commit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2026-05-06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8,056 (as of 2026-05-13, ~4 days after launch)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What ds4 Actually Is
&lt;/h2&gt;

&lt;p&gt;The README is unusually candid about scope:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"DwarfStar 4 is a small native inference engine specific for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime: it is completely self-contained."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That single sentence explains the whole project. The local inference world is dominated by &lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt;, &lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;, &lt;a href="https://github.com/ml-explore/mlx" rel="noopener noreferrer"&gt;MLX&lt;/a&gt;, and &lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt; — runtimes that try to support hundreds of model architectures. Every time a new model drops, those teams scramble to add support, which means every model gets an okay implementation but nothing gets a &lt;em&gt;great&lt;/em&gt; one. The MoE routing might be generic, the KV cache layout might be conservative, the tokenizer might be wrapped through three abstraction layers.&lt;/p&gt;

&lt;p&gt;ds4 takes the opposite bet: one model, one engine, finished end-to-end. The GGUF files are custom — antirez built his own quantization pipeline specifically for V4 Flash's architecture. The 2-bit quants aren't just q2_K applied uniformly; they're "asymmetric": only the routed Mixture-of-Experts (MoE) expert weights get squeezed to IQ2_XXS and Q2_K, while the shared experts, attention projections, and routing layers stay at full precision. The result is a ~70 GB model that fits in 128 GB of unified memory with room left for KV cache, OS, and your editor.&lt;/p&gt;

&lt;p&gt;The other architectural bet is the KV cache. DeepSeek V4 Flash already compresses KV state aggressively (it's part of why the model is fast). ds4 takes that one step further: KV cache &lt;strong&gt;lives on disk by default&lt;/strong&gt;. Modern MacBook SSDs read at ~7 GB/s — fast enough that paging KV chunks during decode is viable. That's how you get a 1M-token context window on a laptop without 1 TB of RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It's Trending Right Now
&lt;/h2&gt;

&lt;p&gt;Three things colliding in the same week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek V4 Flash dropped&lt;/strong&gt; in late April 2026 — a 284B-parameter MoE with only ~13B active per token, 1M context, and quality that benchmarks competitively with frontier closed models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;antirez published ds4 on May 6, 2026&lt;/strong&gt; — six days after Apple started shipping the M3 Ultra Mac Studio with 512 GB unified memory. Hardware caught up to the model the same month an engine appeared.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The author is antirez.&lt;/strong&gt; When the creator of Redis writes 12,000 lines of pure C to make a specific model run on his MacBook, the open-source world pays attention. The repo crossed 7,000 stars in 4 days without a major Twitter push — pure HN + word of mouth.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Hacker News thread (497 points, 157 comments at &lt;a href="https://news.ycombinator.com/item?id=48142108" rel="noopener noreferrer"&gt;item 48142108&lt;/a&gt;) is mostly people being mildly stunned that frontier-class inference on a laptop is no longer hypothetical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Performance Numbers
&lt;/h2&gt;

&lt;p&gt;These are antirez's own benchmarks from the repo, run with &lt;code&gt;--ctx 32768 --nothink&lt;/code&gt; greedy decoding:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Prompt Length&lt;/th&gt;
&lt;th&gt;Prefill&lt;/th&gt;
&lt;th&gt;Generation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M3 Max 128 GB&lt;/td&gt;
&lt;td&gt;q2&lt;/td&gt;
&lt;td&gt;short&lt;/td&gt;
&lt;td&gt;58.52 t/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26.68 t/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro M3 Max 128 GB&lt;/td&gt;
&lt;td&gt;q2&lt;/td&gt;
&lt;td&gt;11,709 tokens&lt;/td&gt;
&lt;td&gt;250.11 t/s&lt;/td&gt;
&lt;td&gt;21.47 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra 512 GB&lt;/td&gt;
&lt;td&gt;q2&lt;/td&gt;
&lt;td&gt;short&lt;/td&gt;
&lt;td&gt;84.43 t/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;36.86 t/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra 512 GB&lt;/td&gt;
&lt;td&gt;q2&lt;/td&gt;
&lt;td&gt;11,709 tokens&lt;/td&gt;
&lt;td&gt;468.03 t/s&lt;/td&gt;
&lt;td&gt;27.39 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra 512 GB&lt;/td&gt;
&lt;td&gt;q4&lt;/td&gt;
&lt;td&gt;short&lt;/td&gt;
&lt;td&gt;78.95 t/s&lt;/td&gt;
&lt;td&gt;35.50 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra 512 GB&lt;/td&gt;
&lt;td&gt;q4&lt;/td&gt;
&lt;td&gt;12,018 tokens&lt;/td&gt;
&lt;td&gt;448.82 t/s&lt;/td&gt;
&lt;td&gt;26.62 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DGX Spark GB10 128 GB&lt;/td&gt;
&lt;td&gt;q2&lt;/td&gt;
&lt;td&gt;7,047 tokens&lt;/td&gt;
&lt;td&gt;343.81 t/s&lt;/td&gt;
&lt;td&gt;13.75 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To put 26 t/s in context: that's roughly twice the speed of casual reading. For a 284B model, on a laptop, with 1M context support, this is the first time those three constraints have aligned.&lt;/p&gt;

&lt;p&gt;The DGX Spark numbers being &lt;em&gt;slower&lt;/em&gt; than the Mac Studio at generation is worth a pause. DGX Spark prefill is excellent (343 t/s), but Apple Silicon's unified memory architecture wins on the memory-bandwidth-bound decode loop. That's a structural advantage Nvidia can't easily close without an HBM-class consumer card.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The install path is genuinely refreshing — no Python environment, no Docker, no conda:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and download a model&lt;/span&gt;
git clone https://github.com/antirez/ds4.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ds4

&lt;span class="c"&gt;# 96/128 GB Mac → q2-imatrix (~70 GB download)&lt;/span&gt;
./download_model.sh q2-imatrix

&lt;span class="c"&gt;# 256 GB+ machine → q4-imatrix (~140 GB)&lt;/span&gt;
./download_model.sh q4-imatrix

&lt;span class="c"&gt;# Build for your platform&lt;/span&gt;
make                 &lt;span class="c"&gt;# macOS Metal (default)&lt;/span&gt;
make cuda-spark      &lt;span class="c"&gt;# Linux CUDA on DGX Spark / GB10&lt;/span&gt;
make cuda-generic    &lt;span class="c"&gt;# Linux CUDA on other GPUs&lt;/span&gt;
make cpu             &lt;span class="c"&gt;# CPU diagnostics only (do NOT run on macOS — kernel bug)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./ds4 &lt;span class="nt"&gt;--ctx&lt;/span&gt; 32768 &lt;span class="nt"&gt;--nothink&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or start the OpenAI-compatible HTTP server for coding agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./ds4-server &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="nt"&gt;--ctx&lt;/span&gt; 200000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server exposes OpenAI and Anthropic-compatible endpoints, so dropping it into Aider, Continue.dev, OpenClaw, or any tool that speaks those APIs is a matter of changing the base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example: point Aider at local ds4&lt;/span&gt;
aider &lt;span class="nt"&gt;--openai-api-base&lt;/span&gt; http://localhost:8080/v1 &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--openai-api-key&lt;/span&gt; dummy &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--model&lt;/span&gt; deepseek-v4-flash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tool calling works in both formats. The README is explicit that 2-bit quants are "not a joke" for tool use — they call tools reliably under coding agents, which is the failure mode that usually kills aggressive quantization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Thinking" Mode Trick
&lt;/h2&gt;

&lt;p&gt;DeepSeek V4 Flash has a configurable reasoning mode. ds4 exposes it directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./ds4 &lt;span class="nt"&gt;--think&lt;/span&gt; low      &lt;span class="c"&gt;# Short thinking section&lt;/span&gt;
./ds4 &lt;span class="nt"&gt;--think&lt;/span&gt; medium   &lt;span class="c"&gt;# Default&lt;/span&gt;
./ds4 &lt;span class="nt"&gt;--think&lt;/span&gt; high     &lt;span class="c"&gt;# Long deliberation&lt;/span&gt;
./ds4 &lt;span class="nt"&gt;--nothink&lt;/span&gt;        &lt;span class="c"&gt;# Skip thinking entirely&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;antirez argues this is one of V4 Flash's distinguishing features: the thinking section length scales with problem complexity, and it's typically &lt;strong&gt;1/5 the length&lt;/strong&gt; of competing thinking models like QwQ or DeepSeek R1. That makes "thinking enabled" actually usable for coding work, where every extra second of thought is a tax on iteration speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  KV Cache on Disk: The Architectural Bet
&lt;/h2&gt;

&lt;p&gt;This is the part most local-LLM tooling doesn't do, and it's the reason the 1M-context claim isn't marketing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DeepSeek V4 Flash's KV cache is already compressed at the architecture level (similar to MLA from V2/V3).&lt;/li&gt;
&lt;li&gt;ds4 serializes the KV cache to disk per-session.&lt;/li&gt;
&lt;li&gt;On a 7 GB/s SSD, paging KV chunks during decode is faster than recomputing them.&lt;/li&gt;
&lt;li&gt;Sessions persist across restarts — you can resume a 500k-token conversation tomorrow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical implication: you can pre-prefill a giant codebase, save the KV state, and start every coding session with that context already loaded. Cold-start cost amortizes to zero after the first run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community Reaction
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://news.ycombinator.com/item?id=48142108" rel="noopener noreferrer"&gt;Hacker News thread&lt;/a&gt; and &lt;a href="https://www.reddit.com/r/LocalLLaMA/comments/1t72tk9/" rel="noopener noreferrer"&gt;r/LocalLLaMA discussion&lt;/a&gt; converge on a few takes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"This is the antirez software philosophy applied to AI."&lt;/strong&gt; Small, single-purpose, written in C, opinionated about scope. Same approach as Redis in 2009.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"96 GB Macs work."&lt;/strong&gt; Multiple commenters confirmed running q2-imatrix on M3 Max 96 GB at up to 250k context — antirez's official spec says 128 GB but the community is pushing lower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Tool calling actually works."&lt;/strong&gt; This is the surprise. Most 2-bit quants break structured output. The asymmetric quantization (full-precision shared weights) seems to preserve enough of the model's reliability for agents to function.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"The KV-on-disk bet is the real innovation."&lt;/strong&gt; Several commenters singled this out as the structural insight — not the C code, not the quantization, but the decision to stop treating KV cache as a pure RAM resource.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"This will not generalize."&lt;/strong&gt; ds4 won't run Llama 4, won't run Qwen 3.6, won't ever be a generic engine. That's the point. The community is split on whether that's a feature or a limitation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One representative comment from r/LocalLLaMA: &lt;em&gt;"I've been running llama.cpp with V4 Flash and getting ~15 t/s on the same M3 Max. ds4 gave me 26. The KV cache disk persistence alone is worth switching."&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;Worth knowing before you spend a Saturday downloading 70 GB:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;One model only.&lt;/strong&gt; When DeepSeek V5 or V4.1 drops, you'll wait for antirez to update or fork the project. There's no fallback to other models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware floor is real.&lt;/strong&gt; Below 96 GB unified memory on Apple Silicon, this project is not for you. Period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alpha quality.&lt;/strong&gt; README says it explicitly: &lt;em&gt;"this exists only for a few days. It will take months to reach a more stable form."&lt;/em&gt; Expect breaking changes, tokenizer edge cases, and tool-calling malforms in early releases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;macOS CPU build crashes the kernel.&lt;/strong&gt; Genuinely. A macOS VM bug means &lt;code&gt;make cpu&lt;/code&gt; on Mac will hard-crash your machine. The README warns about it. Don't try it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AMD ROCm is community-maintained.&lt;/strong&gt; antirez doesn't have AMD hardware, so the &lt;a href="https://github.com/antirez/ds4/tree/rocm" rel="noopener noreferrer"&gt;rocm branch&lt;/a&gt; lags behind main.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom GGUFs only.&lt;/strong&gt; You can't point ds4 at GGUFs from other sources. The tensor layout, quant mix, and metadata are bespoke. You download what antirez built or nothing works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-assisted code disclosure.&lt;/strong&gt; From the README: &lt;em&gt;"This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging."&lt;/em&gt; If that bothers you, antirez is upfront that this project isn't for you.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Who Should Use This
&lt;/h2&gt;

&lt;p&gt;✅ &lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You own or have access to a 128 GB+ Apple Silicon machine&lt;/li&gt;
&lt;li&gt;You want a frontier-class model for coding agents that runs offline&lt;/li&gt;
&lt;li&gt;You're comfortable with C tooling and Make&lt;/li&gt;
&lt;li&gt;You care about long-context work (50k+ tokens routinely)&lt;/li&gt;
&lt;li&gt;You're okay with alpha-quality software and tracking a fast-moving project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ &lt;strong&gt;Not a fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a 16/32/64 GB laptop (use Ollama with smaller models)&lt;/li&gt;
&lt;li&gt;You want one runtime for many models (use &lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt; or &lt;a href="https://github.com/ollama/ollama" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;You need Windows native support&lt;/li&gt;
&lt;li&gt;You need production stability today&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ds4 vs. The Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;ds4&lt;/th&gt;
&lt;th&gt;llama.cpp&lt;/th&gt;
&lt;th&gt;Ollama&lt;/th&gt;
&lt;th&gt;MLX&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash only&lt;/td&gt;
&lt;td&gt;Many models&lt;/td&gt;
&lt;td&gt;Many models (llama.cpp wrapper)&lt;/td&gt;
&lt;td&gt;Many models, Apple only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;V4 Flash speed (M3 Max)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26 t/s&lt;/td&gt;
&lt;td&gt;~15 t/s&lt;/td&gt;
&lt;td&gt;~15 t/s&lt;/td&gt;
&lt;td&gt;Untested&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1M context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (disk KV)&lt;/td&gt;
&lt;td&gt;RAM-limited&lt;/td&gt;
&lt;td&gt;RAM-limited&lt;/td&gt;
&lt;td&gt;RAM-limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom GGUF needed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool calling at 2-bit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reliable&lt;/td&gt;
&lt;td&gt;Often broken&lt;/td&gt;
&lt;td&gt;Often broken&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production-ready&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alpha&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want the &lt;em&gt;fastest possible&lt;/em&gt; V4 Flash experience on a Mac today, ds4 wins. If you want one tool for all your models, llama.cpp or Ollama still wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Do I really need 128 GB of RAM to run this?&lt;/strong&gt;&lt;br&gt;
A: Officially yes. Practically, the r/LocalLLaMA community has confirmed q2-imatrix running on M3 Max 96 GB machines, even at 250k context. Below 96 GB you'll swap heavily or run out of memory. There's no path to running this on a 32 GB MacBook — the model is fundamentally too large.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How is ds4 different from running DeepSeek V4 Flash in Ollama or llama.cpp?&lt;/strong&gt;&lt;br&gt;
A: Three things: (1) ds4 is ~75% faster on the same hardware because it's hand-tuned for V4 Flash's specific architecture; (2) ds4 supports 1M context with KV cache on disk, which generic runtimes don't; (3) ds4's 2-bit quants are asymmetric (only experts quantized), so tool calling actually works at q2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I use ds4 with Aider, Continue.dev, OpenClaw, or other coding agents?&lt;/strong&gt;&lt;br&gt;
A: Yes. &lt;code&gt;./ds4-server&lt;/code&gt; exposes OpenAI and Anthropic-compatible HTTP APIs. Point your agent's base URL at &lt;code&gt;http://localhost:8080/v1&lt;/code&gt; and it should work as a drop-in replacement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What happens when DeepSeek V5 comes out?&lt;/strong&gt;&lt;br&gt;
A: ds4 won't run it until antirez (or a fork) adds support. That's the explicit tradeoff — depth over breadth. The README says the "exact model may change as the landscape evolves" but only one model at a time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is it safe to run AI-assisted code that antirez wrote with GPT-5.5?&lt;/strong&gt;&lt;br&gt;
A: The MIT license and full source are public. antirez disclosed the AI collaboration upfront in the README. You can audit the C code yourself — it's ~12,000 lines and the project is small enough to review. That's actually more transparent than most modern projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why C and not Rust or Zig?&lt;/strong&gt;&lt;br&gt;
A: antirez has written C for 20 years. Redis is C. Familiarity, predictable performance, easy ABI for Metal/CUDA bindings. Also, "C is for people who have stopped trying to impress people," to paraphrase a different antirez essay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does this work on the DGX Spark?&lt;/strong&gt;&lt;br&gt;
A: Yes, with &lt;code&gt;make cuda-spark&lt;/code&gt;. Prefill is excellent (343 t/s on a 7k prompt), but generation throughput is lower than Apple Silicon (13.75 t/s vs 26.68 t/s on M3 Max). The unified memory architecture wins on decode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Where do the GGUF weights come from? Are they official?&lt;/strong&gt;&lt;br&gt;
A: They're built by antirez specifically for ds4, hosted at &lt;a href="https://huggingface.co/antirez/deepseek-v4-gguf" rel="noopener noreferrer"&gt;huggingface.co/antirez/deepseek-v4-gguf&lt;/a&gt;. They are &lt;em&gt;not&lt;/em&gt; compatible with llama.cpp or other runtimes — the tensor layout and quantization mix are bespoke. The model weights themselves come from DeepSeek's official open release.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; ds4 is the most interesting local inference project of May 2026 — not because it does everything, but because it does one thing exceptionally well. If you have the hardware, this is your weekend project. If you don't, watch the repo; antirez has 20 years of shipping habits and this won't stay alpha forever.&lt;/p&gt;

</description>
      <category>ds4</category>
      <category>antirez</category>
      <category>deepseekv4flash</category>
      <category>localllm</category>
    </item>
    <item>
      <title>Aeon Review: Autonomous AI Agent on GitHub Actions</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Mon, 18 May 2026 11:10:51 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/aeon-review-autonomous-ai-agent-on-github-actions-17o9</link>
      <guid>https://dev.to/andrew-ooo/aeon-review-autonomous-ai-agent-on-github-actions-17o9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/aeon-autonomous-agent-github-actions-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/aaronjmars/aeon" rel="noopener noreferrer"&gt;aaronjmars/aeon&lt;/a&gt;&lt;/strong&gt; is a new open-source agent framework with a simple bet: most of the AI work you actually want done — morning briefs, PR reviews, market monitoring, security scans, research digests — doesn't need you in the loop. It just needs to &lt;em&gt;happen&lt;/em&gt;. Aeon launched as a Show HN on May 15, 2026 and uses &lt;strong&gt;GitHub Actions as its runtime&lt;/strong&gt;, so there's nothing to host, no daemons, no Docker, no VPS.&lt;/p&gt;

&lt;p&gt;The headline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;117 pre-built skills&lt;/strong&gt; across research, dev tooling, crypto/markets, social, productivity, and meta-agent self-management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runs on GitHub Actions cron&lt;/strong&gt; — fork the repo, add an API key, toggle skills in &lt;code&gt;aeon.yml&lt;/code&gt;, push&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing loop&lt;/strong&gt;: every skill output is scored 1–5 by Claude Haiku; 3 consecutive failures auto-fires &lt;code&gt;skill-repair&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory&lt;/strong&gt; in &lt;code&gt;memory/*.json&lt;/code&gt; and &lt;code&gt;memory/MEMORY.md&lt;/code&gt;, committed back to the repo on every run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reactive triggers&lt;/strong&gt; in addition to cron — skills can fire on conditions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spawn fleets&lt;/strong&gt; of specialized instances via &lt;code&gt;spawn-instance&lt;/code&gt; and &lt;code&gt;fleet-control&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Works with Claude Pro/Max OAuth&lt;/strong&gt; (included in plan) or &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; (pay per token) — Bankr LLM gateway cuts Opus ~67%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pitch is "configure once, forget forever." After two days running it on a fork, that's mostly accurate — with caveats about token spend and notification volume that we'll get to.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Aeon actually is
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks today — Claude Code, Codex, Cursor, OpenClaw, Hermes — are &lt;strong&gt;interactive tools&lt;/strong&gt;. You open a TUI or IDE, type a task, approve tool calls, review diffs. Aeon flips that: it's an unattended scheduler for the class of tasks (morning briefs, market monitoring, PR reviews, research digests, security scans) where you just want the work &lt;em&gt;done&lt;/em&gt; while you're not there.&lt;/p&gt;

&lt;p&gt;The architecture is delightfully boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You fork the repo
  ↓
GitHub Actions runs messages.yml every 5 minutes
  ↓
Scheduler checks aeon.yml for skills due to run
  ↓
For each matching skill:
    - Spin up Claude Code in a runner
    - Read SKILL.md prompt
    - Execute (with internet, git, gh CLI, MCP servers)
    - Score output 1–5 via Haiku
    - Commit memory + outputs back to repo
    - Notify Telegram/Discord/Slack if relevant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. There's no Aeon server, no SaaS dashboard, no "agent cloud." The dashboard is a local Next.js app you run with &lt;code&gt;./aeon&lt;/code&gt; to edit YAML and push to GitHub. &lt;strong&gt;All compute happens in GitHub Actions runners&lt;/strong&gt;, which means public forks get unlimited free minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 117 skills, grouped
&lt;/h2&gt;

&lt;p&gt;This is where Aeon earns the "framework" label. Each skill is a &lt;code&gt;SKILL.md&lt;/code&gt; prompt file in &lt;code&gt;skills/&amp;lt;name&amp;gt;/&lt;/code&gt;, independently schedulable, chainable, and (importantly) &lt;strong&gt;disabled by default&lt;/strong&gt;. Categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Example skills&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Research &amp;amp; Content&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;deep-research&lt;/code&gt;, &lt;code&gt;hacker-news-digest&lt;/code&gt;, &lt;code&gt;paper-digest&lt;/code&gt;, &lt;code&gt;reddit-digest&lt;/code&gt;, &lt;code&gt;huggingface-trending&lt;/code&gt;, &lt;code&gt;rss-digest&lt;/code&gt;, &lt;code&gt;technical-explainer&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dev &amp;amp; Code&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pr-review&lt;/code&gt;, &lt;code&gt;code-health&lt;/code&gt;, &lt;code&gt;github-trending&lt;/code&gt;, &lt;code&gt;issue-triage&lt;/code&gt;, &lt;code&gt;vuln-scanner&lt;/code&gt;, &lt;code&gt;workflow-security-audit&lt;/code&gt;, &lt;code&gt;auto-merge&lt;/code&gt;, &lt;code&gt;repo-pulse&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Crypto &amp;amp; Markets&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;defi-monitor&lt;/code&gt;, &lt;code&gt;monitor-polymarket&lt;/code&gt;, &lt;code&gt;monitor-kalshi&lt;/code&gt;, &lt;code&gt;token-alert&lt;/code&gt;, &lt;code&gt;narrative-tracker&lt;/code&gt;, &lt;code&gt;unlock-monitor&lt;/code&gt;, &lt;code&gt;price-threshold-alert&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Social &amp;amp; Writing&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;write-tweet&lt;/code&gt;, &lt;code&gt;thread-formatter&lt;/code&gt;, &lt;code&gt;farcaster-digest&lt;/code&gt;, &lt;code&gt;show-hn-draft&lt;/code&gt;, &lt;code&gt;syndicate-article&lt;/code&gt;, &lt;code&gt;tweet-roundup&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Productivity&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;morning-brief&lt;/code&gt;, &lt;code&gt;evening-recap&lt;/code&gt;, &lt;code&gt;daily-routine&lt;/code&gt;, &lt;code&gt;goal-tracker&lt;/code&gt;, &lt;code&gt;weekly-review&lt;/code&gt;, &lt;code&gt;idea-capture&lt;/code&gt;, &lt;code&gt;startup-idea&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Meta / Agent&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;heartbeat&lt;/code&gt;, &lt;code&gt;skill-health&lt;/code&gt;, &lt;code&gt;skill-repair&lt;/code&gt;, &lt;code&gt;self-improve&lt;/code&gt;, &lt;code&gt;skill-evals&lt;/code&gt;, &lt;code&gt;cost-report&lt;/code&gt;, &lt;code&gt;skill-leaderboard&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The crypto/markets category is heavier than I expected — there's a clear bias toward "monitor things and tell me when they move." But the dev tooling section (PR review, vuln scan, workflow security audit, GitHub trending) is genuinely useful for any maintainer, and the productivity skills are model-agnostic.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;meta skills are the interesting bit&lt;/strong&gt;. &lt;code&gt;skill-health&lt;/code&gt; audits quality scores, &lt;code&gt;skill-evals&lt;/code&gt; runs assertion-based tests against skill outputs to catch regressions, &lt;code&gt;skill-repair&lt;/code&gt; diagnoses and patches failing skills automatically, and &lt;code&gt;self-improve&lt;/code&gt; evolves prompts and configs based on past performance. This is the closest thing I've seen in open-source to an agent that maintains itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup: from fork to first run
&lt;/h2&gt;

&lt;p&gt;The flow is genuinely tight. Here's what I did:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone and start the local dashboard&lt;/span&gt;
git clone https://github.com/aaronjmars/aeon
&lt;span class="nb"&gt;cd &lt;/span&gt;aeon &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./aeon
&lt;span class="c"&gt;# Opens http://localhost:5555&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the dashboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authenticate&lt;/strong&gt; — paste an &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; or run &lt;code&gt;claude setup-token&lt;/code&gt; to get a 1-year OAuth token from your Claude Pro/Max subscription&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connect a channel&lt;/strong&gt; — I added Telegram with &lt;code&gt;TELEGRAM_BOT_TOKEN&lt;/code&gt; and &lt;code&gt;TELEGRAM_CHAT_ID&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick skills&lt;/strong&gt; — toggled &lt;code&gt;hacker-news-digest&lt;/code&gt;, &lt;code&gt;github-trending&lt;/code&gt;, &lt;code&gt;morning-brief&lt;/code&gt;, &lt;code&gt;pr-review&lt;/code&gt;, plus the default &lt;code&gt;heartbeat&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set schedules&lt;/strong&gt; — daily 8am UTC for content, every 6h for &lt;code&gt;pr-review&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push&lt;/strong&gt; — one button commits &lt;code&gt;aeon.yml&lt;/code&gt; and triggers &lt;code&gt;git push&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then on the command line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./onboard &lt;span class="nt"&gt;--remote&lt;/span&gt;
&lt;span class="c"&gt;# Validates secrets, workflows, memory dir, and notification channel&lt;/span&gt;
&lt;span class="c"&gt;# Posts a checklist to Telegram when it's done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;./onboard&lt;/code&gt; is a small touch but it matters — it catches the half-configured forks where you forgot to add a secret, and it does the check &lt;strong&gt;inside Actions&lt;/strong&gt;, so it's verifying the same environment your skills will run in.&lt;/p&gt;

&lt;p&gt;Within ~10 minutes of pushing, &lt;code&gt;heartbeat&lt;/code&gt; fired and Telegram got its first message: &lt;code&gt;HEARTBEAT_OK&lt;/code&gt;. The next morning at 08:00 UTC, the &lt;code&gt;morning-brief&lt;/code&gt; skill landed in my chat, pulling from the prior 24h of &lt;code&gt;hacker-news-digest&lt;/code&gt; and &lt;code&gt;github-trending&lt;/code&gt; outputs because I'd chained them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The skill chain feature
&lt;/h2&gt;

&lt;p&gt;This is where Aeon gets interesting beyond "scheduled prompts." Skills can be &lt;strong&gt;chained&lt;/strong&gt; so their outputs flow into downstream skills. The relevant chunk of &lt;code&gt;aeon.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;chains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;morning-pipeline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
    &lt;span class="na"&gt;on_error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fail-fast&lt;/span&gt;    &lt;span class="c1"&gt;# or: continue&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;parallel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;token-movers&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;hacker-news-digest&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# run concurrently&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;skill&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;morning-brief&lt;/span&gt;
        &lt;span class="na"&gt;consume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;token-movers&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;hacker-news-digest&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;    &lt;span class="c1"&gt;# outputs injected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Behind the scenes, each step is a separate workflow dispatch. Outputs land in &lt;code&gt;.outputs/{skill}.md&lt;/code&gt; and downstream steps with &lt;code&gt;consume:&lt;/code&gt; get them injected into Claude's context. &lt;code&gt;on_error: fail-fast&lt;/code&gt; aborts on any failure; &lt;code&gt;continue&lt;/code&gt; keeps going.&lt;/p&gt;

&lt;p&gt;This isn't novel architecturally — n8n, LangGraph, Temporal all do something similar — but the implementation is shockingly simple. There's no orchestrator service. Just GitHub Actions workflows dispatching each other and writing markdown files to the repo. If a chain breaks, you &lt;code&gt;git log&lt;/code&gt; to debug it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The self-healing loop
&lt;/h2&gt;

&lt;p&gt;This is the feature I expected to be marketing fluff. It mostly isn't.&lt;/p&gt;

&lt;p&gt;Every skill output is scored 1–5 by Claude Haiku after each run. The rubric is in the repo — failed/empty outputs → 1, excellent → 5. Scores accumulate in &lt;code&gt;memory/skill-health/&amp;lt;skill&amp;gt;.json&lt;/code&gt; with a rolling 30-run history, plus flags like &lt;code&gt;api_error&lt;/code&gt;, &lt;code&gt;stale_data&lt;/code&gt;, &lt;code&gt;rate_limited&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When a skill fails 3 times in a row (a "reactive trigger"), &lt;code&gt;skill-repair&lt;/code&gt; auto-fires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;reactive&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;skill-repair&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;consecutive_failures&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;skill-repair&lt;/code&gt; reads the skill's &lt;code&gt;SKILL.md&lt;/code&gt;, the failure logs, the rolling score history, then &lt;strong&gt;edits the SKILL.md&lt;/strong&gt; and commits a fix. I deliberately broke &lt;code&gt;hacker-news-digest&lt;/code&gt; by giving it a malformed RSS URL — after 3 failures it patched the URL back to &lt;code&gt;https://hnrss.org/frontpage&lt;/code&gt; and the next run succeeded. The patch commit is in the repo's git history with a &lt;code&gt;[skill-repair]&lt;/code&gt; prefix.&lt;/p&gt;

&lt;p&gt;This is genuinely the most interesting thing Aeon does. It's not perfect — &lt;code&gt;skill-repair&lt;/code&gt; can't fix logic bugs in its own scoring or runaway infinite loops, and it consumes Opus tokens to do the patching — but for transient API breakages and rate-limit flaps, it really does keep skills alive without intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Aeon compares to other agent frameworks
&lt;/h2&gt;

&lt;p&gt;The README pits Aeon against Claude Code, Hermes, and OpenClaw on unattended scheduling, self-healing, output quality monitoring, persistent memory, and reactive triggers. That's fair but apples-to-oranges — Claude Code and OpenClaw are &lt;strong&gt;interactive coding agents&lt;/strong&gt; built around approval loops, not schedulers. The right comparisons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;vs. n8n&lt;/strong&gt;: Aeon is more agentic (Claude reasons over each task) but less drag-and-drop. n8n wins for non-LLM workflows; Aeon wins when every step benefits from an LLM in the loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs. LangGraph / CrewAI&lt;/strong&gt;: Aeon doesn't have a typed state-machine model. It's prompts + cron + markdown files. Simpler to start, less safety net for complex multi-agent flows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vs. self-hosted cron + scripts&lt;/strong&gt;: Aeon adds quality scoring, self-repair, fleet management, and a dashboard. If you've ever cobbled together GitHub Actions + a shell script + Anthropic SDK to get a daily digest, Aeon is the polished version of that idea.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Community reaction
&lt;/h2&gt;

&lt;p&gt;The Show HN thread went live May 15, 2026 with a mixed-but-mostly-positive response. Patterns from comments and dev.to coverage:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The wins.&lt;/strong&gt; GitHub-Actions-as-runtime got the most love — "no infra to maintain" is real, since most agent frameworks die in production because nobody wants to babysit another Python process on a VPS. The 117-skill catalog also drew praise as a working-examples library.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pushback.&lt;/strong&gt; Biggest critique: &lt;strong&gt;token spend&lt;/strong&gt;. With Opus 4.7 as default and Haiku for scoring, a moderately active fork (20 skills, mostly daily) burns $5–15/day, dropping to $2–5/day via the Bankr gateway. Fine for power users, prohibitive if you wanted "$10/month scheduled assistant." Second concern: &lt;strong&gt;Actions minutes on private repos&lt;/strong&gt; — public repos get unlimited free minutes, but most people fork private because &lt;code&gt;memory/&lt;/code&gt; contains personal notes, putting them in the standard Actions quota. Third: a few security folks flagged that &lt;code&gt;./add-skill&lt;/code&gt; can pull &lt;code&gt;SKILL.md&lt;/code&gt; files from arbitrary GitHub repos; there's a &lt;code&gt;skill-security-scan&lt;/code&gt; meta-skill and &lt;code&gt;./add-skill&lt;/code&gt; runs a check, but the usual rule applies — read every &lt;code&gt;SKILL.md&lt;/code&gt; before enabling it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;After two days of actual use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It's loud by default.&lt;/strong&gt; Every skill wants to notify you. Quieting the channel down to high-signal-only takes deliberate effort — there's no global "only ping on quality &amp;lt; 3 or change-in-state" toggle yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold starts are slow.&lt;/strong&gt; Every run is a fresh Actions checkout + Claude Code install. Median run time 2–10 minutes. Fine for daily digests, useless for "tell me about a price spike &lt;em&gt;right now&lt;/em&gt;."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reactive triggers fire on the next 5-minute tick&lt;/strong&gt; — not real-time by any stretch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The dashboard is local-only.&lt;/strong&gt; Editing &lt;code&gt;aeon.yml&lt;/code&gt; from your phone means editing YAML in the GitHub mobile app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No first-class non-Anthropic support yet.&lt;/strong&gt; Bankr gateway adds GPT/Gemini/Kimi/Qwen, but core skills assume Claude's tool-use conventions and degrade with swaps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory grows unbounded.&lt;/strong&gt; &lt;code&gt;memory/&lt;/code&gt; accumulates forever. A &lt;code&gt;weekly-review&lt;/code&gt; skill summarizes but no automatic pruning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill quality varies.&lt;/strong&gt; Headline skills (&lt;code&gt;morning-brief&lt;/code&gt;, &lt;code&gt;hacker-news-digest&lt;/code&gt;, &lt;code&gt;pr-review&lt;/code&gt;, &lt;code&gt;deep-research&lt;/code&gt;) are polished. Some crypto/markets deep cuts are less battle-tested and score 2/5 out of the box.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Who should use Aeon
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You already have a Claude Pro/Max subscription and want OAuth-billed agent work&lt;/li&gt;
&lt;li&gt;You want a daily/weekly LLM-powered briefing on a topic, repo, market, or domain&lt;/li&gt;
&lt;li&gt;You're a solo dev or small team that wants PR reviews and security scans automated&lt;/li&gt;
&lt;li&gt;You're comfortable reading YAML and &lt;code&gt;SKILL.md&lt;/code&gt; files&lt;/li&gt;
&lt;li&gt;You want to experiment with self-healing agent loops without building one from scratch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bad fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need real-time response (sub-minute latency)&lt;/li&gt;
&lt;li&gt;You have hard budget caps below ~$50/month for API spend on a moderately-used fork&lt;/li&gt;
&lt;li&gt;You want a managed SaaS — Aeon is decidedly DIY&lt;/li&gt;
&lt;li&gt;You don't trust Claude Code to commit to your repo (it does, via GitHub Actions tokens)&lt;/li&gt;
&lt;li&gt;You need fine-grained typed state machines for multi-agent workflows — use LangGraph&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick start, recapped
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Fork on GitHub, then:&lt;/span&gt;
git clone https://github.com/&amp;lt;you&amp;gt;/aeon
&lt;span class="nb"&gt;cd &lt;/span&gt;aeon

&lt;span class="c"&gt;# 2. Local dashboard&lt;/span&gt;
./aeon
&lt;span class="c"&gt;# → http://localhost:5555&lt;/span&gt;

&lt;span class="c"&gt;# 3. In the dashboard:&lt;/span&gt;
&lt;span class="c"&gt;#    - Paste CLAUDE_CODE_OAUTH_TOKEN (or ANTHROPIC_API_KEY)&lt;/span&gt;
&lt;span class="c"&gt;#    - Connect Telegram/Discord/Slack&lt;/span&gt;
&lt;span class="c"&gt;#    - Toggle skills, set schedules&lt;/span&gt;
&lt;span class="c"&gt;#    - Push&lt;/span&gt;

&lt;span class="c"&gt;# 4. Validate the setup ran end-to-end&lt;/span&gt;
./onboard &lt;span class="nt"&gt;--remote&lt;/span&gt;

&lt;span class="c"&gt;# 5. Watch your first run land&lt;/span&gt;
gh run watch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total time from &lt;code&gt;git clone&lt;/code&gt; to first skill output in Telegram: about 12 minutes the first time, mostly waiting for GitHub Actions queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is Aeon free?&lt;/strong&gt;&lt;br&gt;
The framework is MIT-licensed. The compute is free on public GitHub repos (unlimited Actions minutes). The LLM costs are not free — you'll spend $2–15/day on Claude API tokens depending on how many skills you enable, dropping if you route through the Bankr LLM gateway or use Sonnet/Haiku for non-critical skills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use models other than Claude?&lt;/strong&gt;&lt;br&gt;
Partially. The Bankr LLM gateway adds GPT, Gemini, Kimi, and Qwen access, set via &lt;code&gt;gateway: { provider: bankr }&lt;/code&gt; in &lt;code&gt;aeon.yml&lt;/code&gt;. But core skills are written against Claude's tool-use conventions, so non-Claude models will degrade some skills. The default and most-tested model is &lt;code&gt;claude-opus-4-7&lt;/code&gt;, with &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; and &lt;code&gt;claude-haiku-4-5&lt;/code&gt; as cheaper alternates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does Aeon compare to Claude Code or OpenClaw?&lt;/strong&gt;&lt;br&gt;
They solve different problems. Claude Code and OpenClaw are interactive — you sit at a TUI/IDE and approve actions. Aeon is unattended — you configure it once and it runs on a schedule. You'd typically use Claude Code or OpenClaw to &lt;em&gt;write&lt;/em&gt; the skills, and Aeon to &lt;em&gt;run&lt;/em&gt; them on cron without you watching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Aeon write its own new skills?&lt;/strong&gt;&lt;br&gt;
Yes, via the &lt;code&gt;create-skill&lt;/code&gt; meta-skill. You give it a description ("watch for new packages in npm registry matching this pattern and DM me") and it generates a &lt;code&gt;SKILL.md&lt;/code&gt;, a manifest entry, and a &lt;code&gt;aeon.yml&lt;/code&gt; config block as a PR you can merge. Quality is mixed — generated skills usually need a round of human edits before they're production-ready.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is it safe to give Aeon write access to my repo?&lt;/strong&gt;&lt;br&gt;
Aeon commits to its own fork using GitHub Actions' built-in &lt;code&gt;GITHUB_TOKEN&lt;/code&gt;, scoped to that repository. It can't reach other repos unless you give it a personal access token explicitly. The bigger risk is third-party skills imported via &lt;code&gt;./add-skill&lt;/code&gt; — read every &lt;code&gt;SKILL.md&lt;/code&gt; before enabling it, and don't put production secrets in the same fork as experimental skills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens if a skill goes haywire and burns through tokens?&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;cost-report&lt;/code&gt; skill generates a weekly breakdown by skill and model, and &lt;code&gt;memory/token-usage.csv&lt;/code&gt; tracks every run. The &lt;code&gt;skill-health&lt;/code&gt; skill flags &lt;code&gt;rate_limited&lt;/code&gt; and &lt;code&gt;api_error&lt;/code&gt; patterns. There's no hard budget enforcement at the framework level yet — if you set a $0/day budget at Anthropic, requests will start 429ing and &lt;code&gt;skill-repair&lt;/code&gt; will (eventually) catch the pattern and pause the skill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does Aeon work with GitLab or self-hosted Git?&lt;/strong&gt;&lt;br&gt;
Not yet. The framework is hardcoded to GitHub Actions for runtime and GitHub for git operations. A GitLab port would be possible but isn't on the roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The verdict
&lt;/h2&gt;

&lt;p&gt;Aeon is the first agent framework I've tried in 2026 where "set and forget" actually rings true after a few days of use. The GitHub-Actions-as-runtime decision is the right call — it eliminates the operational tax that kills most homegrown agent setups before they reach week two. The self-healing loop is novel and works for the failure modes it's designed for.&lt;/p&gt;

&lt;p&gt;The catches are real: token costs add up fast, default notification volume is excessive, cold start latency is 2–10 minutes, and the codebase is opinionated around Claude. If those constraints fit your use case, Aeon is the polished version of every "I should automate this with Claude on cron" idea you've had this year.&lt;/p&gt;

&lt;p&gt;If you want to play, start small: fork, enable just &lt;code&gt;heartbeat&lt;/code&gt; and one content skill, run for 48 hours, then decide what to add. The framework is good. The temptation to enable all 117 skills at once is the trap.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/aaronjmars/aeon" rel="noopener noreferrer"&gt;github.com/aaronjmars/aeon&lt;/a&gt;&lt;br&gt;
🔗 &lt;strong&gt;Show HN article:&lt;/strong&gt; &lt;a href="https://dev.to/aaronjmars/aeon-the-background-ai-agent-that-runs-on-github-actions-16am"&gt;dev.to/aaronjmars/aeon-the-background-ai-agent-that-runs-on-github-actions&lt;/a&gt;&lt;br&gt;
🔗 &lt;strong&gt;License:&lt;/strong&gt; MIT&lt;/p&gt;

</description>
      <category>aeon</category>
      <category>autonomousagents</category>
      <category>githubactions</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>CloakBrowser Review: Stealth Chromium for AI Agents (2026)</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Sun, 17 May 2026 11:12:02 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/cloakbrowser-review-stealth-chromium-for-ai-agents-2026-216p</link>
      <guid>https://dev.to/andrew-ooo/cloakbrowser-review-stealth-chromium-for-ai-agents-2026-216p</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/cloakbrowser-stealth-chromium-playwright-replacement-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CloakBrowser&lt;/strong&gt; is an open-source stealth Chromium fork from New York-based CloakHQ that landed on GitHub Trending this week with &lt;strong&gt;13,075 stars&lt;/strong&gt; (8,618 gained in the last 7 days). Unlike every other "stealth browser" library, it isn't a JavaScript shim glued on top of Playwright — it's a real Chromium binary with &lt;strong&gt;49 source-level C++ patches&lt;/strong&gt; to canvas, WebGL, audio, fonts, GPU, screen, WebRTC, network timing, and CDP. Key facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drop-in replacement&lt;/strong&gt; for Playwright and Puppeteer — swap one import, your code works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.9 reCAPTCHA v3 score&lt;/strong&gt; — server-side verified, statistically human&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Passes Cloudflare Turnstile&lt;/strong&gt; (non-interactive + managed challenges)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;humanize=True&lt;/code&gt; flag&lt;/strong&gt; for Bézier mouse curves, per-character keystroke timing, realistic scroll physics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pip install cloakbrowser&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;npm install cloakbrowser&lt;/code&gt;&lt;/strong&gt; — binary auto-downloads (~200MB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT-licensed&lt;/strong&gt;, no usage limits, no API keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30/30 detection tests passed&lt;/strong&gt; as of Chromium 146 (April 2026)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building AI agents that need to read the open web — research assistants, price monitors, e-commerce trackers, lead enrichers — CloakBrowser is the first open-source tool I've tested that actually clears the bar set by paid services like Bright Data Scraping Browser or Multilogin, without the subscription.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Repo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/CloakHQ/CloakBrowser" rel="noopener noreferrer"&gt;CloakHQ/CloakBrowser&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PyPI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pypi.org/project/cloakbrowser/" rel="noopener noreferrer"&gt;&lt;code&gt;cloakbrowser&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;npm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.npmjs.com/package/cloakbrowser" rel="noopener noreferrer"&gt;&lt;code&gt;cloakbrowser&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://hub.docker.com/r/cloakhq/cloakbrowser" rel="noopener noreferrer"&gt;&lt;code&gt;cloakhq/cloakbrowser&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chromium 146 (rebased monthly)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Playwright / Puppeteer (drop-in)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;13,075 (+8,618 this week)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;First release&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Feb 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why This Tool Suddenly Matters
&lt;/h2&gt;

&lt;p&gt;For the last three years, "stealth browser" has meant one of three things, and every option had a fatal flaw:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;playwright-stealth&lt;/code&gt; / &lt;code&gt;puppeteer-extra-stealth&lt;/code&gt;&lt;/strong&gt; — JavaScript shims that patch &lt;code&gt;navigator.webdriver&lt;/code&gt; and a handful of other properties at runtime. Cloudflare started detecting &lt;em&gt;the patches themselves&lt;/em&gt; in 2024. Today these libraries score 0.1–0.3 on reCAPTCHA v3 — pure bot territory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;undetected-chromedriver&lt;/code&gt;&lt;/strong&gt; — Selenium-based, patches at the config level. Same problem: every Chrome major release breaks it, and modern Cloudflare fingerprints the patched binary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Camoufox&lt;/strong&gt; — A Firefox fork with C++ patches. Genuinely effective, but Firefox-only, no native Playwright API, and the project is hard to keep current with upstream.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CloakBrowser is the first project to do for Chromium what Camoufox did for Firefox — patches at the C++ source level, compiled into the binary — &lt;em&gt;while&lt;/em&gt; exposing the standard Playwright API everyone already uses. That combination is why it added 6,907 stars in the week it hit GitHub Trending and why every web-scraping subreddit is suddenly talking about it.&lt;/p&gt;

&lt;p&gt;The timing also lines up with a structural change in 2026: more sites are deploying machine-learning behavioral detection that scores mouse jitter, scroll easing, and keystroke entropy. JavaScript-only stealth tools can't fake those signals because the events still originate from CDP. CloakBrowser's &lt;code&gt;humanize=True&lt;/code&gt; mode dispatches input through isolated worlds with trusted-event flags, which is why it passes deviceandbrowserinfo.com's 24-signal behavioral check.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Actually Works
&lt;/h2&gt;

&lt;p&gt;CloakBrowser ships as a thin Python/Node wrapper around a custom Chromium binary:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;pip install cloakbrowser&lt;/code&gt; or &lt;code&gt;npm install cloakbrowser&lt;/code&gt; (small, ~5 MB)&lt;/li&gt;
&lt;li&gt;On first launch, the wrapper auto-downloads a platform-specific Chromium build (~200 MB, SHA-256 verified)&lt;/li&gt;
&lt;li&gt;Every subsequent launch boots Playwright (or Puppeteer) against that binary instead of stock Chromium&lt;/li&gt;
&lt;li&gt;Your code uses the &lt;strong&gt;standard Playwright API&lt;/strong&gt; — no new vocabulary&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The 49 patches sit in the binary itself. They cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Canvas / WebGL&lt;/strong&gt;: deterministic-noise rendering so fingerprint hashes look natural but unique&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio context&lt;/strong&gt;: spoofed sample buffer with realistic dynamic range&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Font enumeration&lt;/strong&gt;: real-OS font list, not the headless-Chrome subset&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU + screen&lt;/strong&gt;: hardware-tier reporting that matches the spoofed user agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebRTC&lt;/strong&gt;: ICE candidate spoofing tied to proxy exit IP (no more IP leaks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network timing&lt;/strong&gt;: DNS/TCP/TLS handshake jitter zeroed so proxy fingerprints don't leak&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation signals&lt;/strong&gt;: &lt;code&gt;navigator.webdriver&lt;/code&gt; set to &lt;code&gt;false&lt;/code&gt; at the source, &lt;code&gt;window.chrome&lt;/code&gt; populated, plugin list realistic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDP input behavior&lt;/strong&gt;: &lt;code&gt;humanize=True&lt;/code&gt; dispatches mouse/keyboard via isolated worlds with &lt;code&gt;is_trusted=true&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Crucially, &lt;strong&gt;none of this is JavaScript injection&lt;/strong&gt;. Detection sites that look for patches see a real browser — because it is a real browser, just one that was compiled differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation: Two Minutes End to End
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;cloakbrowser
&lt;span class="c"&gt;# Optional: pip install cloakbrowser[geoip] for proxy-aware timezone/locale&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cloakbrowser&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;launch&lt;/span&gt;

&lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# headless=True also fine
&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://demo.fingerprint.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;() =&amp;gt; navigator.webdriver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# False
&lt;/span&gt;&lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. First run downloads the binary; second run starts instantly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node.js (Playwright)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;cloakbrowser playwright-core
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;launch&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cloakbrowser&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://bot.sannysoft.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sannysoft.png&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Migrating an existing Playwright script
&lt;/h3&gt;

&lt;p&gt;The migration is exactly two lines. From the README:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- from playwright.sync_api import sync_playwright
- pw = sync_playwright().start()
- browser = pw.chromium.launch()
&lt;/span&gt;&lt;span class="gi"&gt;+ from cloakbrowser import launch
+ browser = launch()
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;page = browser.new_page()
page.goto("https://example.com")
&lt;/span&gt;# ... rest of your code is unchanged
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I migrated a 400-line scraper in under five minutes; not a single line below the import block needed to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Code: A Stealthy AI Research Agent
&lt;/h2&gt;

&lt;p&gt;Here's a complete, working example of an AI agent that uses CloakBrowser to fetch search-engine results without getting blocked — the kind of pipeline that would normally break the moment Cloudflare smells Playwright:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cloakbrowser&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;launch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stealth_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;geoip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# auto timezone/locale from proxy IP
&lt;/span&gt;        &lt;span class="n"&gt;humanize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# bezier mouse, realistic typing
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Humanize a single scroll so behavioral models see motion
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mouse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wheel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;research&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://duckduckgo.com/?q=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stealth_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;socks5://user:pass@proxy:1080&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the top results from this SERP HTML:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;50_000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;research&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;best open source vector database 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things to notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;geoip=True&lt;/code&gt;&lt;/strong&gt; does an HTTP call through the proxy to resolve the exit IP, then aligns the browser's timezone and locale to match. This single flag eliminates the most common detection signal — proxy-IP-vs-browser-locale mismatch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;humanize=True&lt;/code&gt;&lt;/strong&gt; doesn't just affect the explicit &lt;code&gt;mouse.wheel()&lt;/code&gt; call — it changes how &lt;em&gt;every&lt;/em&gt; &lt;code&gt;click&lt;/code&gt;, &lt;code&gt;type&lt;/code&gt;, and &lt;code&gt;fill&lt;/code&gt; dispatches, end to end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No CAPTCHA-solving service needed&lt;/strong&gt;. CloakBrowser doesn't solve CAPTCHAs; it prevents them from triggering in the first place.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Detection-Test Scoreboard
&lt;/h2&gt;

&lt;p&gt;Numbers from the README, cross-referenced with my own runs against the same services (May 17, 2026, Chromium 146 binary, headless mode, residential proxy):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Detection Service&lt;/th&gt;
&lt;th&gt;Stock Playwright&lt;/th&gt;
&lt;th&gt;CloakBrowser&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;reCAPTCHA v3&lt;/td&gt;
&lt;td&gt;0.1 (bot)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9 (human)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare Turnstile (non-interactive)&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;PASS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare Turnstile (managed)&lt;/td&gt;
&lt;td&gt;FAIL&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;PASS&lt;/strong&gt; (single click)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FingerprintJS bot demo&lt;/td&gt;
&lt;td&gt;DETECTED&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;PASS&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BrowserScan&lt;/td&gt;
&lt;td&gt;DETECTED&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NORMAL (4/4)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bot.incolumitas.com&lt;/td&gt;
&lt;td&gt;13 fails&lt;/td&gt;
&lt;td&gt;1 fail (WEBDRIVER spec only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;deviceandbrowserinfo.com&lt;/td&gt;
&lt;td&gt;6 true flags&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0 true flags&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;navigator.webdriver&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;false&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;window.chrome&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;undefined&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;object&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UA string&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HeadlessChrome&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;Chrome/146.0.0.0&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLS JA4 fingerprint&lt;/td&gt;
&lt;td&gt;Mismatch&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Identical to real Chrome&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The TLS fingerprint match is the one that surprised me. Most stealth tools don't touch the network stack at all, which is why DataDome and PerimeterX block them on the very first TCP handshake — before any HTML is requested. CloakBrowser's network-timing patches and the underlying Chromium build emit ja3n/ja4/akamai fingerprints that are byte-identical to real desktop Chrome. That alone makes it the only open-source tool I've used that survives PerimeterX-protected pages in headless mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Community Is Saying
&lt;/h2&gt;

&lt;p&gt;The GitHub stars curve tells most of the story — 3,000+ stars/day during the Trending run — but a few specific signals are worth calling out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-agent integrations:&lt;/strong&gt; the README explicitly lists drop-in support for &lt;code&gt;browser-use&lt;/code&gt;, &lt;code&gt;Crawl4AI&lt;/code&gt;, &lt;code&gt;Scrapling&lt;/code&gt;, &lt;code&gt;Stagehand&lt;/code&gt;, and &lt;code&gt;LangChain&lt;/code&gt;. Several of those projects merged CloakBrowser examples upstream in the last two weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reddit&lt;/strong&gt; threads in r/webscraping and r/selfhosted hit hundreds of upvotes, with the dominant comment pattern being "I replaced my $500/mo proxy-browser SaaS bill with this."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pushback:&lt;/strong&gt; one widely-shared YouTube review urged users to run CloakBrowser inside a VM or Docker container, since you're running a third-party Chromium binary downloaded from a relatively young organization. The SHA-256 verification and the open-source patch set mitigate most of that risk, but it's a fair call-out — production deployments should pin a specific binary version rather than auto-update.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloakHQ also shipped a &lt;a href="https://github.com/CloakHQ/CloakBrowser-Manager" rel="noopener noreferrer"&gt;Manager UI&lt;/a&gt;&lt;/strong&gt; for browser-profile workflows — basically an open-source Multilogin/GoLogin/AdsPower clone. That's a separate Docker container, free, with noVNC built in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The general developer sentiment, summarized fairly: "Finally, somebody did the actual hard work of patching Chromium properly. About time."&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;Three things to know before you bet a production pipeline on it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It does not solve CAPTCHAs — it avoids them.&lt;/strong&gt; If your target site has interactive Turnstile or hCaptcha challenges that &lt;em&gt;always&lt;/em&gt; trigger (e.g., login forms), you still need a solver service. CloakBrowser only helps you avoid being flagged in the first place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Binary trust is a real consideration.&lt;/strong&gt; You're running a custom Chromium build from a small org. The patches are open source, the SHA-256 is checked, but if you're handling sensitive credentials (banking, healthcare logins), run it isolated. The official Docker image is the cleanest path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No built-in proxy rotation.&lt;/strong&gt; You bring your own proxies. This is by design — CloakBrowser is the &lt;em&gt;browser&lt;/em&gt;, not the network — but newcomers occasionally expect a turnkey solution. For rotation, pair it with something like &lt;a href="https://github.com/rofl0r/proxychains-ng" rel="noopener noreferrer"&gt;proxychains&lt;/a&gt; or a SDK like &lt;a href="https://www.webshare.io/" rel="noopener noreferrer"&gt;Webshare&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Two smaller caveats: every Chromium major version bump (roughly monthly) requires a re-released binary, so you'll occasionally see a 24–48-hour gap before a new build drops. And the binary is ~200 MB, which inflates Docker images noticeably if you're not careful with multi-stage builds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Use This
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Building AI research agents or browsing copilots&lt;/strong&gt; — yes, immediately, especially if your current scraper occasionally hits Cloudflare walls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running price-monitoring / SERP / lead-enrichment crawlers&lt;/strong&gt; — replaces $200–500/mo SaaS browser services for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Doing competitive intelligence or e-commerce monitoring&lt;/strong&gt; — the geoip + humanize combo is exactly what these targets are tuned to detect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Doing QA testing on your own site&lt;/strong&gt; — overkill, just use stock Playwright.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trying to bypass site terms of service or sign up for accounts en masse&lt;/strong&gt; — please don't; that's not what stealth browsers are for, and you'll burn your IP space anyway.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Playwright&lt;/th&gt;
&lt;th&gt;playwright-stealth&lt;/th&gt;
&lt;th&gt;undetected-chromedriver&lt;/th&gt;
&lt;th&gt;Camoufox&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;CloakBrowser&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;reCAPTCHA v3 score&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;0.3–0.5&lt;/td&gt;
&lt;td&gt;0.3–0.7&lt;/td&gt;
&lt;td&gt;0.7–0.9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare Turnstile&lt;/td&gt;
&lt;td&gt;Fail&lt;/td&gt;
&lt;td&gt;Sometimes&lt;/td&gt;
&lt;td&gt;Sometimes&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Patch level&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;JS injection&lt;/td&gt;
&lt;td&gt;Config patches&lt;/td&gt;
&lt;td&gt;C++ (Firefox)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;C++ (Chromium)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Survives Chrome updates&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Breaks often&lt;/td&gt;
&lt;td&gt;Breaks often&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Active maintenance&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Stale&lt;/td&gt;
&lt;td&gt;Stale&lt;/td&gt;
&lt;td&gt;Unstable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Active&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine&lt;/td&gt;
&lt;td&gt;Chromium&lt;/td&gt;
&lt;td&gt;Chromium&lt;/td&gt;
&lt;td&gt;Chrome&lt;/td&gt;
&lt;td&gt;Firefox&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Chromium&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Playwright API&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;No (Selenium)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Native&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Free (MIT)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest summary: if you're on Firefox-friendly targets and OK with a non-Playwright API, &lt;a href="https://camoufox.com/" rel="noopener noreferrer"&gt;Camoufox&lt;/a&gt; is mature and excellent. For everything else — and particularly anything Playwright-native — CloakBrowser is now the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How is CloakBrowser different from &lt;code&gt;playwright-stealth&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;playwright-stealth&lt;/code&gt; injects JavaScript at page load to overwrite &lt;code&gt;navigator.webdriver&lt;/code&gt; and a handful of other properties. Modern detection systems specifically fingerprint &lt;em&gt;the act of patching&lt;/em&gt; — they look at when properties were defined, in what order, with what stack trace. CloakBrowser patches at the C++ source level, so there's nothing to detect. As the README puts it: "Detection sites see a real browser because it is a real browser."&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use it with my existing Playwright code?
&lt;/h3&gt;

&lt;p&gt;Yes. Replace your &lt;code&gt;playwright.sync_api&lt;/code&gt;/&lt;code&gt;@playwright/test&lt;/code&gt; browser launch with &lt;code&gt;cloakbrowser&lt;/code&gt;'s &lt;code&gt;launch()&lt;/code&gt;. Every Page, Locator, and Frame API works identically — CloakBrowser only swaps the underlying Chromium binary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does it work with &lt;code&gt;browser-use&lt;/code&gt;, &lt;code&gt;Crawl4AI&lt;/code&gt;, or &lt;code&gt;LangChain&lt;/code&gt; agents?
&lt;/h3&gt;

&lt;p&gt;Yes — the README lists drop-in support for &lt;code&gt;browser-use&lt;/code&gt;, &lt;code&gt;Crawl4AI&lt;/code&gt;, &lt;code&gt;Scrapling&lt;/code&gt;, &lt;code&gt;Stagehand&lt;/code&gt;, &lt;code&gt;LangChain&lt;/code&gt;, and Selenium. You pass CloakBrowser as the underlying browser. Several of those projects shipped explicit &lt;code&gt;cloakbrowser&lt;/code&gt; examples in the last fortnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is it safe to run a custom Chromium binary?
&lt;/h3&gt;

&lt;p&gt;The source patches are open and the binary is SHA-256 verified, but you are still running a third-party Chromium build. For sensitive workloads (credentials, banking), use the official Docker image and pin the binary version with &lt;code&gt;cloakbrowser --version-pin&lt;/code&gt;. The patches are reviewable on GitHub if you want to audit them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does CloakBrowser solve CAPTCHAs?
&lt;/h3&gt;

&lt;p&gt;No. It &lt;em&gt;prevents&lt;/em&gt; them from triggering in the first place by looking enough like a real user that the bot-score-based challenges don't fire. If you hit an unconditional interactive CAPTCHA (e.g., on a login form), you still need a solver service like 2Captcha or CapSolver.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about IP / proxy rotation?
&lt;/h3&gt;

&lt;p&gt;CloakBrowser doesn't rotate proxies — that's outside its scope. You pass a proxy with &lt;code&gt;launch(proxy=…)&lt;/code&gt;, and with &lt;code&gt;geoip=True&lt;/code&gt; it auto-aligns timezone, locale, and WebRTC IP to that proxy's exit. Pair with any external rotator (Webshare, Bright Data, your own pool).&lt;/p&gt;

&lt;h3&gt;
  
  
  How often does the binary update?
&lt;/h3&gt;

&lt;p&gt;CloakHQ rebases the patches onto every Chromium stable release — currently Chromium 146 (April 2026 rebase). Auto-update is enabled by default; you can pin a version for reproducibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is there a managed/cloud version?
&lt;/h3&gt;

&lt;p&gt;Not as a hosted service. But the companion &lt;a href="https://github.com/CloakHQ/CloakBrowser-Manager" rel="noopener noreferrer"&gt;CloakBrowser-Manager&lt;/a&gt; gives you a self-hostable Multilogin/AdsPower-style UI for profile management with noVNC access — open-source under MIT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;CloakBrowser is the rare project that does exactly one thing and does it better than anyone else: it lets your existing Playwright code clear modern bot defenses. The drop-in API means you can A/B test it against stock Playwright in fifteen minutes; the MIT license means you can ship it in production without legal review; and the 49 source-level patches mean it actually works against the detection systems that matter in 2026.&lt;/p&gt;

&lt;p&gt;If you've been holding off on building a real web-research agent because every scraping POC dies the moment it hits Cloudflare, this is your green light.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/CloakHQ/CloakBrowser" rel="noopener noreferrer"&gt;⭐ Star CloakBrowser on GitHub →&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Related reading on andrew.ooo:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/browser-use-ai-agent-browser-automation/"&gt;Browser-Use: AI Agent Browser Automation Review&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/posts/trycua-cua-open-source-computer-use-agents/"&gt;Trycua/cua Review: Open-Source Computer-Use Agents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cloakbrowser</category>
      <category>playwright</category>
      <category>puppeteer</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Needle Review: 26M Function-Calling Model for Edge Devices</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Sat, 16 May 2026 11:11:54 +0000</pubDate>
      <link>https://dev.to/andrew-ooo/needle-review-26m-function-calling-model-for-edge-devices-3ei2</link>
      <guid>https://dev.to/andrew-ooo/needle-review-26m-function-calling-model-for-edge-devices-3ei2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Originally published on &lt;a href="https://andrew.ooo/posts/needle-26m-function-calling-model-review/" rel="noopener noreferrer"&gt;andrew.ooo&lt;/a&gt;&lt;/strong&gt; — visit the original for any updates, code snippets that aged out, or follow-up posts.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Needle&lt;/strong&gt; is a 26-million-parameter function-calling model from the team behind &lt;a href="https://github.com/cactus-compute/cactus" rel="noopener noreferrer"&gt;Cactus&lt;/a&gt;, distilled from Gemini 3.1 Flash-Lite. It runs at &lt;strong&gt;6000 tokens/sec prefill and 1200 tokens/sec decode&lt;/strong&gt; on consumer devices — fast enough for a smartwatch — and it does one thing: turn natural-language queries into JSON tool calls.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;26M parameters&lt;/strong&gt; — roughly 100x smaller than the smallest "useful" chat LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pure attention architecture&lt;/strong&gt; — zero feed-forward / MLP layers, encoder-decoder with cross-attention&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6000 tok/s prefill, 1200 tok/s decode&lt;/strong&gt; on consumer hardware via the Cactus runtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M&lt;/strong&gt; on single-shot function calling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pretrained on 200B tokens in 27 hours&lt;/strong&gt; across 16 TPU v6e; post-trained in 45 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT license&lt;/strong&gt; — code, weights, and dataset generation pipeline all open&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One-command finetune&lt;/strong&gt; on your own tools from your Mac via &lt;code&gt;needle playground&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you've been waiting for a "compiler pass for agents" — a tiny, deterministic model that handles the boring tool-routing step so your main LLM never has to — Needle is the first credible attempt I've seen at that exact shape.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks today shove the entire conversation, tool schema, and system prompt into a 70B+ model just to extract one JSON object like &lt;code&gt;{"name": "get_weather", "arguments": {"city": "Paris"}}&lt;/code&gt;. That's a ~$0.01 LLM call to do what is fundamentally a structured prediction problem.&lt;/p&gt;

&lt;p&gt;The waste shows up everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency.&lt;/strong&gt; Even with Groq or Cerebras, a tool-call round-trip is 200–800ms — an eternity in a voice agent or wearable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost.&lt;/strong&gt; Function-calling is the single largest line item in most production agent bills, and most of it is repetitive routing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy.&lt;/strong&gt; Every tool call leaks query content to a cloud LLM, even when the execution is local (e.g. "turn off the lights").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline.&lt;/strong&gt; Phones, watches, glasses, in-car assistants — none can guarantee a cloud round-trip.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Needle's bet is that &lt;strong&gt;tool calling is not reasoning&lt;/strong&gt; — it's retrieval-and-assembly. Match the query to a tool name, extract argument values, emit JSON. None of those steps requires the per-position non-linear computation that FFN layers provide. So Cactus removed them entirely. That's the most interesting architectural claim to come out of the small-model world in a while.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parameters&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encoder–decoder, pure attention (no FFN)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encoder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12 layers, GQA (8 heads / 4 KV), RoPE, gated residuals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decoder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8 layers, self-attn + cross-attn, gated residuals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;d_model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vocab&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8192 (SentencePiece BPE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Norm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ZCRMSNorm (zero-centered, init = 0)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;bfloat16 (INT4 QAT during training)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pretraining&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200B tokens, 16× TPU v6e, 27 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Post-training&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2B tokens of function-call data, 45 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distilled from&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Flash-Lite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6000 tok/s prefill, 1200 tok/s decode (Cactus runtime)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/cactus-compute/needle" rel="noopener noreferrer"&gt;cactus-compute/needle&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weights&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://huggingface.co/Cactus-Compute/needle" rel="noopener noreferrer"&gt;Cactus-Compute/needle&lt;/a&gt; on Hugging Face&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What It Actually Is
&lt;/h2&gt;

&lt;p&gt;Needle is &lt;strong&gt;a specialist model that converts &lt;code&gt;(query, tools)&lt;/code&gt; into &lt;code&gt;tool_call JSON&lt;/code&gt;&lt;/strong&gt; and nothing else. It is not a chatbot. It does not do RAG. It does not write code. It does not decide &lt;em&gt;whether&lt;/em&gt; a tool should be called — that's your job, or the job of the larger model in your stack.&lt;/p&gt;

&lt;p&gt;Given &lt;code&gt;Query: What's the weather in San Francisco?&lt;/code&gt; and a &lt;code&gt;get_weather&lt;/code&gt; tool, it emits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"San Francisco"&lt;/span&gt;&lt;span class="p"&gt;}}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole product. The reason it matters: 99% of the cycles in a typical agent loop are spent doing exactly this — and Needle does it in milliseconds, locally, for free, while a 70B model is still loading its KV cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Simple Attention Network" architecture
&lt;/h3&gt;

&lt;p&gt;The Cactus team calls their architecture a &lt;strong&gt;Simple Attention Network (SAN)&lt;/strong&gt;. The key choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Encoder–decoder, not decoder-only.&lt;/strong&gt; The encoder reads the full &lt;code&gt;(query, tools)&lt;/code&gt; input bidirectionally in one shot; the decoder generates the JSON via cross-attention. No KV-cache of input tokens during generation — the encoder representation is fixed-size.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No FFN / MLP layers anywhere.&lt;/strong&gt; Standard transformers use ~2/3 of their parameters in FFN. Needle removes them entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gated residuals.&lt;/strong&gt; Without FFNs, plain &lt;code&gt;x = x + Attn(Norm(x))&lt;/code&gt; is limiting. Cactus uses &lt;code&gt;x = x + sigmoid(gate) * Attn(Norm(x))&lt;/code&gt;, gate initialised to zero, so each layer starts at half-strength.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ZCRMSNorm.&lt;/strong&gt; Zero-centered RMSNorm (&lt;code&gt;x * (1 + gamma) / RMS(x)&lt;/code&gt;, gamma init = 0), identity at init. Pairs with gated residuals so the whole network starts as a damped identity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLIP-style tool retrieval head.&lt;/strong&gt; The encoder also produces a unit vector for contrastive search, letting you pre-filter to the top-k most relevant tools from a large catalogue before generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Muon + AdamW dual optimiser.&lt;/strong&gt; Muon (Newton–Schulz orthogonalisation) on Q/K/V/O projections prevents representation collapse in a deep stack of pure linear-then-softmax layers. AdamW for everything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT4 quantisation-aware training.&lt;/strong&gt; Fake INT4 quantisation every 100 steps with straight-through estimators. The model trains at the same precision it deploys at — no post-training quantisation gap.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you've read nGPT or DeepSeek-V3, several tricks will look familiar. The novelty is the combination plus the FFN-free claim. Full write-up: &lt;a href="https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md" rel="noopener noreferrer"&gt;docs/simple_attention_networks.md&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;The full path from clone to running prediction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/cactus-compute/needle.git
&lt;span class="nb"&gt;cd &lt;/span&gt;needle &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; ./setup
needle playground
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That opens a Gradio web UI at &lt;code&gt;http://127.0.0.1:7860&lt;/code&gt; where you can paste tool schemas and queries. Weights auto-download from Hugging Face on first run.&lt;/p&gt;

&lt;p&gt;For programmatic use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_checkpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SimpleAttentionNetwork&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_tokenizer&lt;/span&gt;

&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpoints/needle.pkl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleAttentionNetwork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_tokenizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the weather in San Francisco?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Finetuning on your own tools
&lt;/h3&gt;

&lt;p&gt;This is where Needle shines. The model was post-trained on synthetic data across 15 generic tool categories (timers, messaging, navigation, smart home, etc.). For your own product's tools, you almost certainly want to finetune — which the playground UI makes shockingly easy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;needle finetune data.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;JSONL format is three fields per line — &lt;code&gt;query&lt;/code&gt;, &lt;code&gt;tools&lt;/code&gt; (JSON-encoded string), &lt;code&gt;answers&lt;/code&gt; (also JSON-encoded). Cactus recommends &lt;strong&gt;at least 120 examples per tool&lt;/strong&gt; (100 train / 10 val / 10 test); fewer and the model overfits. You can click "generate-data" in the playground to have Gemini synthesise the dataset from a tool spec, then train immediately. End-to-end "tool spec → finetuned 26M model" in ~10 minutes on a Mac.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why It's Trending Now
&lt;/h2&gt;

&lt;p&gt;Needle hit &lt;strong&gt;#1 Show HN&lt;/strong&gt; on May 14, 2026 with 280+ points and 78 comments, then climbed onto GitHub Trending the next morning. Three forces converged:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The "small model" thesis is going mainstream.&lt;/strong&gt; Apple Intelligence on-device, Gemini Nano in Chrome, Phi-4-mini — everyone is shipping sub-1B models. Needle pushes that an order of magnitude further.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Function calling is the most repetitive thing LLMs do.&lt;/strong&gt; OpenAI and Anthropic both surfaced this in 2026 product talks: the same JSON-emission pattern is happening billions of times a day at huge cost. A tiny specialist is the obvious answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The architecture claim is genuinely novel.&lt;/strong&gt; "No FFN, just attention, and it still works" is the kind of thing ML Twitter loves to argue about. The Cactus team posted careful ablations, which helped.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What the Community Is Saying
&lt;/h2&gt;

&lt;p&gt;The HN thread surfaced the usual mix of skepticism and excitement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Is this just a glorified parser?"&lt;/strong&gt; Several commenters argued tool calling is structured prediction. The author (Henry Ndubuaku) agreed and framed Needle as exactly that — a compiler pass for agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How does it compare to constrained decoding?"&lt;/strong&gt; You can get 99% JSON-validity from any LLM via grammar-constrained sampling (Outlines, Guidance, llama.cpp grammars). The Needle answer: yes, but you still pay the latency cost of a big model. Needle is 100× smaller — you get JSON-validity &lt;em&gt;plus&lt;/em&gt; a 200ms→50ms latency reduction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What about multi-turn?"&lt;/strong&gt; Single-shot only. Multi-turn agentic loops still need a bigger model. Cactus is explicit: Needle is a routing layer, not a planner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Will this work for a 500-tool MCP setup?"&lt;/strong&gt; That's exactly what the CLIP-style retrieval head is for — encode all tool definitions once, cosine-rank per query, feed the top 10 to the decoder.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On Reddit's r/AI_Agents, a thread titled &lt;em&gt;"A 26M tool-router suggests tool calling should be a compiler pass, not a reasoning step"&lt;/em&gt; drew the most thoughtful discussion. The Needle-as-compiler-pass framing is going to stick.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;Where Needle fits in a stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-device voice assistants.&lt;/strong&gt; Phone, watch, smart speaker, car. The big LLM handles open-ended conversation in the cloud; Needle handles "set a timer for 8 minutes" without phoning home.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-critical loops.&lt;/strong&gt; Trading bots, robotics, autonomous driving — anywhere a 500ms LLM round-trip is unacceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-reduction layer.&lt;/strong&gt; Route 80% of tool calls through Needle, fall back to GPT-5 or Claude for ambiguous cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-sensitive verticals.&lt;/strong&gt; Healthcare, legal, defense. The big model never sees user data; Needle routes locally and you send only the resolved tool call to your audited backend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP fan-out.&lt;/strong&gt; Sit Needle in front of a large MCP server (50+ tools) as a fast pre-router.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Honest Limitations
&lt;/h2&gt;

&lt;p&gt;Needle is impressive but it is &lt;strong&gt;not&lt;/strong&gt; a general-purpose model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No multi-turn reasoning.&lt;/strong&gt; Single-shot only. The "decide what to do next" step in an agent loop still needs a bigger model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No conversational fallback.&lt;/strong&gt; Given any input, Needle emits &lt;em&gt;some&lt;/em&gt; tool call — even if no tool fits. You need a guard model or confidence threshold in front.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finicky out-of-the-box.&lt;/strong&gt; The pretrained checkpoint covers 15 generic tool categories. For domain-specific tools, expect to finetune.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiny vocabulary.&lt;/strong&gt; 8192-token BPE splits rare words aggressively. Fine for tool routing, problematic for anything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One language.&lt;/strong&gt; Trained predominantly on English. Multilingual function calling needs more post-training.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apples-to-oranges benchmarks.&lt;/strong&gt; Needle beats FunctionGemma-270M etc. &lt;em&gt;on single-shot tool calling&lt;/em&gt;. It is not better at conversation, code, or reasoning — it cannot do those things at all.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Who Should Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Needle if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building anything that runs on a phone, watch, glasses, in-car, or other constrained device&lt;/li&gt;
&lt;li&gt;You have a high-volume agent loop and your function-call bill is hurting&lt;/li&gt;
&lt;li&gt;Your tool catalogue is well-defined and stable (so finetuning pays off)&lt;/li&gt;
&lt;li&gt;You care about latency or offline operation&lt;/li&gt;
&lt;li&gt;You want to learn how the "no-FFN" architectural bet performs in practice — the code is small, MIT, and beautifully readable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip Needle if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need a model that can also chat, reason, or generate prose&lt;/li&gt;
&lt;li&gt;Your tool catalogue changes constantly and finetuning is operationally painful&lt;/li&gt;
&lt;li&gt;You're already happy with constrained decoding on a 7B model&lt;/li&gt;
&lt;li&gt;You need multi-turn agent loops without a separate planning model&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Comparison with Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Function Calling&lt;/th&gt;
&lt;th&gt;Conversation&lt;/th&gt;
&lt;th&gt;Local?&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Needle&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26M&lt;/td&gt;
&lt;td&gt;✅ Specialist&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Phone-class&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FunctionGemma&lt;/td&gt;
&lt;td&gt;270M&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ Limited&lt;/td&gt;
&lt;td&gt;✅ Laptop-class&lt;/td&gt;
&lt;td&gt;Gemma TOS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;600M&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ Laptop-class&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LFM2.5-350M&lt;/td&gt;
&lt;td&gt;350M&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ Laptop-class&lt;/td&gt;
&lt;td&gt;LFM license&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Granite-3.5-350M&lt;/td&gt;
&lt;td&gt;350M&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ Laptop-class&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5-mini (tools mode)&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌ Cloud&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest read: if you want &lt;strong&gt;the cheapest, fastest single-shot tool router that fits on a watch&lt;/strong&gt;, Needle is currently in a class of one. If you want a generalist that also does tool calling, pick one of the 300M–600M models above.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Needle production-ready?
&lt;/h3&gt;

&lt;p&gt;Cactus tags it as an &lt;strong&gt;experimental run&lt;/strong&gt; for the Simple Attention Network architecture, so be honest with yourself: weights are MIT and stable, but the docs say "small models can be finicky" — which is true. For production, finetune on your own tool set, set a confidence threshold, and route low-confidence queries to a bigger model.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Needle handle ambiguous queries?
&lt;/h3&gt;

&lt;p&gt;It doesn't, really. There's no built-in abstain mechanism — given any input, Needle will emit &lt;em&gt;some&lt;/em&gt; tool call, even if no tool fits. In production you want either (a) a separate intent classifier in front of it, or (b) a confidence-score guard derived from the contrastive retrieval head's top-1 cosine similarity. The Cactus team is explicit that this is a routing layer, not a planner.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I run Needle on a Raspberry Pi or microcontroller?
&lt;/h3&gt;

&lt;p&gt;The Hugging Face weights run on any Python environment with JAX/PyTorch. For real edge deployment (phones, wearables, MCUs), you want the &lt;a href="https://github.com/cactus-compute/cactus" rel="noopener noreferrer"&gt;Cactus runtime&lt;/a&gt; — Cactus is a separate C++ inference engine built specifically for mobile, wearables, and custom hardware. That's where the 6000 tok/s prefill numbers come from.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Needle compare to constrained decoding?
&lt;/h3&gt;

&lt;p&gt;Different layers of the stack. Constrained decoding (Outlines, Guidance, llama.cpp grammars) guarantees JSON-validity on top of any model. Needle is a smaller, faster model that emits valid JSON natively — the architecture is biased toward JSON output. You can stack constrained decoding &lt;em&gt;on top of&lt;/em&gt; Needle, and probably should for paranoia-grade reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the license for the weights?
&lt;/h3&gt;

&lt;p&gt;MIT. Code, weights, and dataset generation pipeline are all permissively licensed. You can ship Needle in a commercial product without attribution. The training data was synthesised via Gemini, so check your Gemini terms of service if you plan to regenerate similar datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does it cost to retrain from scratch?
&lt;/h3&gt;

&lt;p&gt;200B tokens on 16× TPU v6e for 27 hours. At spot pricing, that's roughly $1,500–$3,000 to reproduce the pretraining run. Post-training (the part you'd actually customise) is 2B tokens in 45 minutes on the same setup — under $100. Finetuning on your own ~120 examples is free on a Mac.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Needle is the first model I've seen that takes the "tool calling is a structured prediction problem, not a reasoning problem" thesis seriously and ships a working artifact.&lt;/strong&gt; It won't replace your main LLM, and it isn't supposed to. What it does is collapse the cost and latency of the most repetitive 80% of agentic work down to something you can run on a watch.&lt;/p&gt;

&lt;p&gt;The architectural bet — no FFN, pure attention, encoder-decoder — is genuinely novel at this scale and worth studying even if you never deploy the model. The MIT license, the one-command playground, and the 10-minute finetune loop make it trivially easy to try.&lt;/p&gt;

&lt;p&gt;If you're building agents in 2026, Needle deserves an afternoon of your time. Even if it doesn't fit your current stack, the pattern — &lt;strong&gt;tiny specialist models as compiler passes for big generalist models&lt;/strong&gt; — is almost certainly what production agent stacks look like in two years.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/cactus-compute/needle" rel="noopener noreferrer"&gt;cactus-compute/needle&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weights:&lt;/strong&gt; &lt;a href="https://huggingface.co/Cactus-Compute/needle" rel="noopener noreferrer"&gt;Cactus-Compute/needle&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture write-up:&lt;/strong&gt; &lt;a href="https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md" rel="noopener noreferrer"&gt;Simple Attention Networks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime:&lt;/strong&gt; &lt;a href="https://github.com/cactus-compute/cactus" rel="noopener noreferrer"&gt;Cactus&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>needle</category>
      <category>cactus</category>
      <category>functioncalling</category>
      <category>tooluse</category>
    </item>
  </channel>
</rss>
