<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nilofer 🚀</title>
    <description>The latest articles on DEV Community by Nilofer 🚀 (@nilofer_tweets).</description>
    <link>https://dev.to/nilofer_tweets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137273%2Fac10d3a1-21d6-46e3-90d6-889213a616bd.jpg</url>
      <title>DEV Community: Nilofer 🚀</title>
      <link>https://dev.to/nilofer_tweets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nilofer_tweets"/>
    <language>en</language>
    <item>
      <title>RAG Pipeline Stress Tester: Battle-Test Your RAG System Before It Reaches Production</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 12 May 2026 11:45:30 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/rag-pipeline-stress-tester-battle-test-your-rag-system-before-it-reaches-production-397c</link>
      <guid>https://dev.to/nilofer_tweets/rag-pipeline-stress-tester-battle-test-your-rag-system-before-it-reaches-production-397c</guid>
      <description>&lt;p&gt;Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Pipeline Stress Tester&lt;/strong&gt; is a battle-testing toolkit that finds these issues before deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.&lt;/li&gt;
&lt;li&gt;Tracks relevance, hallucination, refusal quality, and latency for every query sent.&lt;/li&gt;
&lt;li&gt;Scores everything into a composite health score from 0 to 100.&lt;/li&gt;
&lt;li&gt;Breaks results down by query category so you know exactly which failure modes are causing issues.&lt;/li&gt;
&lt;li&gt;Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.&lt;/li&gt;
&lt;li&gt;Produces an HTML report with interactive charts and a JSON report for CI/CD integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28iyjk2nc9t6w3r1h1tq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28iyjk2nc9t6w3r1h1tq.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Exists
&lt;/h2&gt;

&lt;p&gt;Before deploying a RAG system to production, four questions need answers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does it hallucinate when asked about things not in the corpus?&lt;/li&gt;
&lt;li&gt;Does it refuse appropriately on out-of-scope questions?&lt;/li&gt;
&lt;li&gt;Does it stay consistent when the same question is asked multiple ways?&lt;/li&gt;
&lt;li&gt;Does it hold up under load 10, 25, 50 concurrent users?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Manual testing cannot answer these questions at scale. This tool does it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without stress testing&lt;/strong&gt; - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With this tool&lt;/strong&gt; - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 7 Query Categories
&lt;/h2&gt;

&lt;p&gt;The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;out_of_scope&lt;/code&gt; - Questions with no answer in the corpus, tests hallucination resistance&lt;br&gt;
&lt;code&gt;adversarial&lt;/code&gt; - Prompt injection and jailbreak attempts, tests instruction-following safety&lt;br&gt;
&lt;code&gt;ambiguous&lt;/code&gt; - Queries with multiple valid interpretations, tests disambiguation&lt;br&gt;
&lt;code&gt;multilingual&lt;/code&gt; - Non-English queries, tests language handling&lt;br&gt;
&lt;code&gt;temporal&lt;/code&gt; - Time-sensitive questions that depend on stale data&lt;br&gt;
&lt;code&gt;negation&lt;/code&gt; - "What is NOT X" style questions, a common failure mode&lt;br&gt;
&lt;code&gt;compound&lt;/code&gt; - Multi-part questions requiring multiple retrievals&lt;/p&gt;

&lt;p&gt;You can add your own queries by appending lines to any file in &lt;code&gt;query_bank/&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Health Score
&lt;/h2&gt;

&lt;p&gt;Every test run produces a composite Health Score from 0 to 100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;≥ 80  EXCELLENT   Production-ready
≥ 60  GOOD        Minor issues, review before deploying
≥ 40  FAIR        Significant issues, fix first
 &amp;lt; 40  POOR        Critical failures, do not deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Calculated from five weighted components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvislnqf6am0fb8i2m3yi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvislnqf6am0fb8i2m3yi.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main.py             Typer CLI — entry point and orchestration
adversarial.py      Query generator — 7 categories, pre-built + corpus-generated
loader.py           Async load driver — aiohttp, configurable concurrency
evaluator.py        Scorer — hallucination, precision, refusal, consistency
reporter.py         Report generator — HTML (Chart.js) + JSON output
corpus_analyzer.py  Optional: generate targeted queries from your own documents
query_bank/         7 pre-built adversarial query files (one per line)
tests/              58 pytest tests (no live endpoint needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The endpoint the tester sends requests to must accept POST with &lt;code&gt;{"query": "..."}&lt;/code&gt; and return JSON containing either a &lt;code&gt;response&lt;/code&gt; or &lt;code&gt;answer&lt;/code&gt; field. Any HTTP status other than 200 is counted as an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running a Stress Test
&lt;/h2&gt;

&lt;p&gt;The core command runs a full stress test against your RAG endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Basic — 10 concurrent users, 60-second run&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--duration&lt;/span&gt; 60

&lt;span class="c"&gt;# Test only specific query categories&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query-types&lt;/span&gt; out_of_scope,adversarial,multilingual

&lt;span class="c"&gt;# Custom output directory&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./my-reports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what a real terminal output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🚀 Starting RAG Stress Test
   Endpoint: http://localhost:8000/query
   Concurrency: 5
   Duration: 20s

📊 Generating test queries...
   Generated 350 test queries

⚡ Running load tests...
📈 Evaluating results...
📝 Generating reports...

✅ Stress test complete!
   JSON Report: reports/stress_test_results.json
   HTML Report: reports/stress_test_report.html

=======================================================
  Overall Health Score : 57.1/100
  Status               : FAIR - Significant issues detected
  Total requests       : 6355
  Error rate           : 0.0%
  Precision score      : 2.1%
  Hallucination rate   : 22.5%
  Refusal rate         : 77.5%
  Consistency score    : 72.1%
  Latency p50/p95/p99  : 2.9 / 6.3 / 8.7 ms

  Query Type          Count   Halluc%   Refusal%    AvgLat
  ------------------ ------  --------  ---------  --------
  adversarial           205     35.1%      64.9%      3.3ms
  ambiguous             250     12.0%      88.0%      3.2ms
  compound              200     22.0%      78.0%      4.0ms
  multilingual          250     10.0%      90.0%      3.1ms
  negation              200     20.0%      80.0%      5.3ms
  out_of_scope          250     20.0%      80.0%      4.0ms
  temporal              200     38.0%      62.0%      3.1ms

  Recommendations:
    - Low precision score. Enhance retrieval mechanism and relevance ranking.
    - Moderate: Several areas need improvement for production readiness.
=======================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Sanity Check
&lt;/h2&gt;

&lt;p&gt;For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py quick-test &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔍 Running quick sanity test...
   Testing with 35 sample queries

🎯 Quick Test Health Score: 72.4/100
   ✅ Endpoint appears functional
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Generate Queries From Your Own Corpus
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;analyze-corpus&lt;/code&gt; command analyzes your own &lt;code&gt;.txt&lt;/code&gt;, &lt;code&gt;.md&lt;/code&gt;, or &lt;code&gt;.json&lt;/code&gt; files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into &lt;code&gt;query_bank/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py analyze-corpus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--corpus&lt;/span&gt; ./my-docs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./query_bank &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-queries&lt;/span&gt; 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📚 Analyzing corpus: ./my-docs
   Generated 50 in_scope queries → query_bank/in_scope_generated.txt
   Generated 50 out_of_scope queries → query_bank/out_of_scope_generated.txt
   Generated 50 adversarial queries → query_bank/adversarial_generated.txt

✅ Corpus analysis complete!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For very small corpora, lower the keyword frequency threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py analyze-corpus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--corpus&lt;/span&gt; ./my-docs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./query_bank &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-queries&lt;/span&gt; 20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-word-freq&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Edit &lt;code&gt;config.yaml&lt;/code&gt; to customise load levels, thresholds, and reporting. The &lt;code&gt;--endpoint&lt;/code&gt; CLI flag always takes precedence over &lt;code&gt;config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;load.concurrency_levels&lt;/code&gt; - Concurrent user levels to test, for example &lt;code&gt;[1, 5, 10, 25]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.ramp_mode&lt;/code&gt; - If true, steps through each concurrency level; if false, runs at the first level for the full duration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.duration_seconds&lt;/code&gt; - How long to run at each concurrency level&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.rate_limit_per_second&lt;/code&gt; - Maximum requests per second&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluation.hallucination_threshold&lt;/code&gt; - Keyword-overlap score below which a response is flagged as a potential hallucination&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluation.refusal_keywords&lt;/code&gt; - Phrases that indicate a refused answer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reporter.output_dir&lt;/code&gt; - Where to save HTML and JSON reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pass the config file with &lt;code&gt;--config&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Output Reports
&lt;/h2&gt;

&lt;p&gt;Each test run saves two files to &lt;code&gt;./reports/&lt;/code&gt; or your &lt;code&gt;--output&lt;/code&gt; path:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stress_test_results.json&lt;/strong&gt; - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stress_test_report.html&lt;/strong&gt; - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpoint Requirements
&lt;/h2&gt;

&lt;p&gt;The tester sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/your-endpoint&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is machine learning?"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It expects a JSON response containing either a &lt;code&gt;response&lt;/code&gt; or &lt;code&gt;answer&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Machine learning is..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any HTTP status other than 200 is counted as an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Tests
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;58 tests covering all modules. Uses &lt;code&gt;aioresponses&lt;/code&gt; to mock HTTP - no live RAG endpoint required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rag-pipeline-stress-tester/
├── main.py             # CLI entry point
├── adversarial.py      # Query generators (7 types)
├── loader.py           # Async load test driver
├── evaluator.py        # Scoring and metrics
├── reporter.py         # HTML + JSON report generator
├── corpus_analyzer.py  # Optional corpus-based query generation
├── config.yaml         # Test configuration
├── requirements.txt
├── query_bank/         # 7 pre-built adversarial query files
└── tests/              # 58 pytest tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.&lt;/p&gt;

&lt;p&gt;xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a pre-deployment gate for every RAG system.&lt;/strong&gt;&lt;br&gt;
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it with your own domain queries.&lt;/strong&gt;&lt;br&gt;
The pre-built query banks are general purpose. For domain-specific testing, run &lt;code&gt;analyze-corpus&lt;/code&gt; on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into &lt;code&gt;query_bank/&lt;/code&gt; and run the stress test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrate the JSON report into CI/CD.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;stress_test_results.json&lt;/code&gt; is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional query categories.&lt;/strong&gt;&lt;br&gt;
The 7 query banks are plain text files in &lt;code&gt;query_bank/&lt;/code&gt;, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to &lt;code&gt;query_bank/&lt;/code&gt; and registering it in &lt;code&gt;adversarial.py&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/RAG-pipeline-stress-tester" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/RAG-pipeline-stress-tester&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Orbis: Turn Any GitHub Repository Into an Interactive 3D Dependency Graph</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 09 May 2026 10:58:10 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/orbis-turn-any-github-repository-into-an-interactive-3d-dependency-graph-3eei</link>
      <guid>https://dev.to/nilofer_tweets/orbis-turn-any-github-repository-into-an-interactive-3d-dependency-graph-3eei</guid>
      <description>&lt;p&gt;Understanding a large codebase is hard. You clone it, start reading files, and quickly lose track of how everything connects. Which modules are most depended on? Where are the circular dependencies? What would break if you refactored this file?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orbis&lt;/strong&gt; answers these questions visually. Paste a GitHub repository URL, and Orbis clones it, parses the ASTs across Python, JavaScript, TypeScript, Go, Rust, and Java, detects architectural patterns, and renders the entire codebase as a navigable 3D force-directed graph. Click any module to inspect its dependencies, metrics, and exported symbols. Ask the built-in AI assistant questions like "which module should I refactor first?" and get answers grounded in the actual code structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3D force-directed graph&lt;/strong&gt; - Nodes sized by lines of code, colored by type, with animated directional particles on edges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-language AST parsing&lt;/strong&gt; - Python, JavaScript/TypeScript, Go, Rust, and Java via tree-sitter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI chat assistant&lt;/strong&gt; - Ask Claude questions about the analyzed codebase. Questions like "Which modules have circular dependencies?" or "Where should I add feature X?" are answered with full architectural context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural insights&lt;/strong&gt; - Auto-detected issues including god modules, high coupling, and circular dependencies, each with severity ratings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Focus Mode&lt;/strong&gt; - Dim unconnected nodes to trace dependency paths clearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shareable URLs&lt;/strong&gt; - &lt;code&gt;?repo=https://github.com/...&lt;/code&gt; auto-triggers analysis on load, making it easy to share a specific codebase view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recent history&lt;/strong&gt; - Last 5 repos stored locally for quick re-analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo mode&lt;/strong&gt; — Load a pre-analyzed snapshot without a GitHub clone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Backend: FastAPI + Server-Sent Events (SSE)&lt;/li&gt;
&lt;li&gt;AST Parsing: tree-sitter (Python, JS/TS, Go, Rust, Java)&lt;/li&gt;
&lt;li&gt;AI Integration: Claude Opus 4.6 via Anthropic API&lt;/li&gt;
&lt;li&gt;3D Rendering: 3d-force-graph + Three.js&lt;/li&gt;
&lt;li&gt;Frontend: Vanilla JS SPA - no build step&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;orbis
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate   &lt;span class="c"&gt;# Windows: venv\Scripts\activate&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Set up environment&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env and add your ANTHROPIC_API_KEY for the AI chat feature&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get an API key at console.anthropic.com. The AI chat feature requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment. It degrades gracefully, if the key is missing, the chat panel shows an error message rather than breaking the rest of the app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Run&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8001&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; orbis &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8001:8001 &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-... orbis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;Once running, the workflow is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter a public GitHub repository URL - for example &lt;code&gt;https://github.com/expressjs/express&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Optionally specify a branch&lt;/li&gt;
&lt;li&gt;Click Analyze - Orbis clones the repo, parses ASTs, and builds the graph in roughly 5–30 seconds&lt;/li&gt;
&lt;li&gt;Explore the 3D graph - click a node to open its detail drawer, scroll to zoom, drag to rotate&lt;/li&gt;
&lt;li&gt;Use Focus Mode to highlight a node's direct connections&lt;/li&gt;
&lt;li&gt;Use layer filter chips to show or hide architectural layers&lt;/li&gt;
&lt;li&gt;Ask the AI assistant questions about the codebase in the chat panel&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Keyboard Shortcuts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;R: Reset camera&lt;/li&gt;
&lt;li&gt;P: Pause/resume rotation&lt;/li&gt;
&lt;li&gt;F: Toggle Focus Mode&lt;/li&gt;
&lt;li&gt;/: Focus search box&lt;/li&gt;
&lt;li&gt;Esc: Close detail drawer / exit Focus Mode&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The project has four files at its core - a FastAPI backend, a single-file AST parser, and a vanilla JS frontend with no build step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main.py           FastAPI backend — SSE streaming for /analyze, /chat
neo_parser.py     Multi-language AST parser (tree-sitter)
static/
  index.html      Single-page frontend (3d-force-graph + Three.js)
save_analysis.py  Utility: pre-generate demo data from a repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backend streams analysis progress to the frontend via Server-Sent Events, The backend streams analysis progress to the frontend via Server-Sent Events while cloning and analyzing the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiky04nqoykgwfknmlsm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiky04nqoykgwfknmlsm.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Schema
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/analyze&lt;/code&gt; emits SSE events and completes with a &lt;code&gt;complete&lt;/code&gt; event containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"architecture_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MVC"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"languages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Codebase contains 42 modules..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"auth.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"utility"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lines_of_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;315&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"complexity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"exported_symbols"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AuthBase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HTTPBasicAuth"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"internal_dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests/compat"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"external_dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"functions_total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"classes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"edges"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"import"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"insights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high_coupling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"High fan-in on requests/models"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"14 modules import this file directly."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"affected_nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests/models"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"recommendation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Consider splitting into smaller focused modules."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node carries its lines of code, complexity rating, exported symbols, and both internal and external dependencies. The insights block surfaces architectural issues automatically, high coupling, circular dependencies, and god modules - each with a severity rating and a specific recommendation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Languages
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python - &lt;code&gt;.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;JavaScript/TypeScript - &lt;code&gt;.js&lt;/code&gt;, &lt;code&gt;.mjs&lt;/code&gt;, &lt;code&gt;.cjs&lt;/code&gt;, &lt;code&gt;.jsx&lt;/code&gt;, &lt;code&gt;.ts&lt;/code&gt;, &lt;code&gt;.tsx&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Go - &lt;code&gt;.go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Rust - &lt;code&gt;.rs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Java - &lt;code&gt;.java&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI Chat
&lt;/h2&gt;

&lt;p&gt;The chat assistant uses Claude Opus 4.6 and receives the full architectural graph as context - node list, dependencies, insights, and summary. It can answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What does the auth module depend on?"&lt;/li&gt;
&lt;li&gt;"Why are there circular dependencies between X and Y?"&lt;/li&gt;
&lt;li&gt;"Which module should I refactor first?"&lt;/li&gt;
&lt;li&gt;"Where would I add a caching layer?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The assistant's answers are grounded in the actual parsed structure of the codebase - not generic advice. Requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run with auto-reload&lt;/span&gt;
uvicorn main:app &lt;span class="nt"&gt;--reload&lt;/span&gt; &lt;span class="nt"&gt;--port&lt;/span&gt; 8001

&lt;span class="c"&gt;# Re-generate demo data&lt;/span&gt;
python save_analysis.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The idea was a tool that turns any GitHub repository into an interactive 3D graph, something a developer could paste a URL into and immediately understand the architecture without reading a single file. The requirements included multi-language AST parsing, automatic architectural issue detection, an AI assistant grounded in the actual code structure, and a frontend that required no build step.&lt;/p&gt;

&lt;p&gt;NEO built the full stack from that description: the FastAPI backend with SSE streaming for real-time analysis progress, the multi-language AST parser in &lt;code&gt;neo_parser.py&lt;/code&gt; covering Python, JavaScript, TypeScript, Go, Rust, and Java via tree-sitter, the 3D force-directed graph frontend in vanilla JS, the Claude Opus 4.6 chat assistant with full architectural context, the insights engine detecting god modules, high coupling, and circular dependencies with severity ratings, and the demo mode with pre-generated analysis data.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to onboard onto an unfamiliar codebase.&lt;/strong&gt;&lt;br&gt;
Instead of spending hours reading files to understand how a project is structured, paste the repo URL into Orbis and get an immediate visual map of every module, its dependencies, and the architectural issues that already exist. The AI assistant can then answer specific questions about the structure without you having to trace imports manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during code review to understand structural impact.&lt;/strong&gt;&lt;br&gt;
When reviewing a large pull request, run Orbis on the repo and use the insights panel to see whether high coupling, circular dependencies, or god modules exist in the areas being changed. The AI assistant can answer specific questions about how the affected modules connect to the rest of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to plan a refactor.&lt;/strong&gt;&lt;br&gt;
Ask the AI assistant "which module should I refactor first?" or "where would I add a caching layer?" and get answers grounded in the actual dependency graph. The focus mode lets you isolate a specific module and trace exactly what depends on it before touching anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional language parsers.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;neo_parser.py&lt;/code&gt; already handles five languages via tree-sitter. Adding a new language - Ruby, C++, Swift - follows the same parser pattern and surfaces automatically in the language filter chips and the supported languages list without touching the frontend or the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Orbis makes codebase architecture something you can see and navigate rather than something you have to reconstruct in your head. A 3D dependency graph, multi-language AST parsing, automatic architectural issue detection, and an AI assistant that knows the actual structure - all from a single repo URL.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Orbit-dependency-visualised" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Orbit-dependency-visualised&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>SmolVLM2 Edge Vision Agent: Visual Monitoring Without a GPU or Cloud API</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 07 May 2026 11:43:31 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/smolvlm2-edge-vision-agent-visual-monitoring-without-a-gpu-or-cloud-api-2afp</link>
      <guid>https://dev.to/nilofer_tweets/smolvlm2-edge-vision-agent-visual-monitoring-without-a-gpu-or-cloud-api-2afp</guid>
      <description>&lt;p&gt;Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SmolVLM2 Edge Vision Agent&lt;/strong&gt; is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;The agent does five things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingests a live webcam feed or an image folder as input&lt;/li&gt;
&lt;li&gt;Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes&lt;/li&gt;
&lt;li&gt;Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations&lt;/li&gt;
&lt;li&gt;Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores&lt;/li&gt;
&lt;li&gt;Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use cases:&lt;/strong&gt; home security camera analysis, document digitization pipelines, accessibility tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfmzuo5rq01ymye9ocj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfmzuo5rq01ymye9ocj7.png" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-frame timeline:&lt;/strong&gt;&lt;br&gt;
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;FRAME_DIFF_THRESHOLD&lt;/code&gt; defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbhclbik1cc2vpheuh6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbhclbik1cc2vpheuh6k.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python:&lt;/strong&gt; 3.11 or newer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 16GB minimum for the real model; less is fine in &lt;code&gt;--mock&lt;/code&gt; mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; ~5GB free for the model cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No GPU required&lt;/strong&gt; - SmolVLM2-2.2B is designed for CPU inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
&lt;span class="nb"&gt;cd &lt;/span&gt;smolvlm2-edge-agent
make &lt;span class="nb"&gt;install&lt;/span&gt;                                  &lt;span class="c"&gt;# pip install -e .&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env                          &lt;span class="c"&gt;# then edit values as needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;make install&lt;/code&gt; command runs &lt;code&gt;pip install -e&lt;/code&gt; . which installs the package and its pinned runtime dependencies from &lt;code&gt;requirements.txt&lt;/code&gt;. The &lt;code&gt;.env.example&lt;/code&gt; file contains all documented environment variables, copy it to &lt;code&gt;.env&lt;/code&gt; and edit the values you want to override before running.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in &lt;code&gt;.env.example&lt;/code&gt; in the &lt;a href="https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MODEL_NAME&lt;/code&gt; - HuggingFace model id, default: &lt;code&gt;HuggingFaceTB/SmolVLM2-2.2B-Instruct&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;USE_MOCK_MODE&lt;/code&gt; - bypass model loading with deterministic stub responses, default: &lt;code&gt;false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MODEL_CACHE_DIR&lt;/code&gt; - where the HuggingFace model is cached on disk, default: &lt;code&gt;./models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DB_PATH&lt;/code&gt; - SQLite database file path, default: &lt;code&gt;./data/observations.db&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FRAME_DIFF_THRESHOLD&lt;/code&gt; - motion sensitivity on a 0–1 scale, higher means less sensitive, default: &lt;code&gt;0.15&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MIN_CONFIDENCE&lt;/code&gt; - minimum VLM confidence required to log an observation, default: &lt;code&gt;0.5&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PROCESSING_INTERVAL&lt;/code&gt; - seconds between frame samples, default: &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MAX_OBSERVATIONS&lt;/code&gt; - cap on stored rows, older observations are pruned, default: &lt;code&gt;10000&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DASHBOARD_HOST&lt;/code&gt; - FastAPI bind host, default: &lt;code&gt;0.0.0.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DASHBOARD_PORT&lt;/code&gt; - FastAPI port, default: &lt;code&gt;8080&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;INPUT_SOURCE&lt;/code&gt; - camera index or path to image folder, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OUTPUT_DIR&lt;/code&gt; - where observation artifacts are written, default: &lt;code&gt;./data/observations/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;THUMBNAIL_DIR&lt;/code&gt; - where frame thumbnails are saved, default: &lt;code&gt;./data/thumbnails/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOG_LEVEL&lt;/code&gt; - Python logging level, default: &lt;code&gt;INFO&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOG_FILE&lt;/code&gt; - optional log file path, default: &lt;code&gt;./data/agent.log&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;MIN_CONFIDENCE&lt;/code&gt; is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Quick start - mock mode, no model download&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; data/test_images
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--mock&lt;/span&gt; &lt;span class="nt"&gt;--input&lt;/span&gt; ./data/test_images &lt;span class="nt"&gt;--duration&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs the agent for 30 seconds against the &lt;code&gt;data/test_images/&lt;/code&gt; folder using the mock VLM, populates &lt;code&gt;data/observations.db&lt;/code&gt;, and writes thumbnails to &lt;code&gt;data/thumbnails/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run against a webcam&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--input&lt;/span&gt; 0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open &lt;code&gt;http://localhost:8080&lt;/code&gt; in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run against an image folder&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--input&lt;/span&gt; ./images &lt;span class="nt"&gt;--interval&lt;/span&gt; 2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Iterates over &lt;code&gt;./images&lt;/code&gt; at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboard only in read mode&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--mode&lt;/span&gt; dashboard &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Serves the dashboard against an existing &lt;code&gt;data/observations.db&lt;/code&gt; without running the agent. Useful for reviewing historical observations without starting a new capture session.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Reference
&lt;/h2&gt;

&lt;p&gt;The FastAPI dashboard exposes six endpoints:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm66ohxfya82kfvtv4gx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm66ohxfya82kfvtv4gx.png" alt=" " width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/search&lt;/code&gt; endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/observations&lt;/code&gt; endpoint is paginated with &lt;code&gt;limit&lt;/code&gt; and &lt;code&gt;offset&lt;/code&gt; parameters. The default returns the 50 most recent observations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models Used
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l2t21i3rhps3jm4ra5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l2t21i3rhps3jm4ra5f.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the default &lt;code&gt;--model&lt;/code&gt; argument and &lt;code&gt;MODEL_NAME&lt;/code&gt; env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in &lt;code&gt;./models&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;test&lt;/span&gt;                  &lt;span class="c"&gt;# python3 -m pytest tests/ -v
&lt;/span&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;lint&lt;/span&gt;                  &lt;span class="c"&gt;# ruff check src/ tests/ --fix
&lt;/span&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;typecheck&lt;/span&gt;             &lt;span class="c"&gt;# mypy src/ --ignore-missing-imports
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test coverage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tests/test_db.py&lt;/code&gt; - 10 tests covering SQLite schema, CRUD, and search&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_vision.py&lt;/code&gt; - 6 tests covering mock VLM and prompt rendering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_agent.py&lt;/code&gt; - 9 tests covering motion detection and the agent loop&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_dashboard.py&lt;/code&gt; - 6 tests covering HTTP route handlers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_cli.py&lt;/code&gt; - 7 tests covering argparse and env-var loading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total: 36 tests, all passing. No skipped tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;smolvlm2-edge-agent/
├── src/
│   ├── __init__.py
│   ├── __main__.py              # entry point for python -m src
│   ├── agent.py                 # MotionDetector + VisionAgent
│   ├── vision.py                # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│   ├── db.py                    # SQLite Database class
│   ├── dashboard.py             # FastAPI app factory + route handlers
│   └── cli.py                   # argparse + env loading
├── tests/                       # 36 pytest tests, VLM fully mocked
├── data/.gitkeep                # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep              # HF model cache
├── pyproject.toml               # ruff + mypy config + console_script
├── requirements.txt             # pinned runtime deps
├── Makefile                     # install, test, lint, typecheck, run, clean
├── .env.example                 # documented env vars
├── .gitignore
├── BUILD_NOTES.md               # build/verification trace
└── PUBLISH.md                   # exact GitHub push commands
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;src/&lt;/code&gt; directory maps cleanly to the agent's responsibilities - &lt;code&gt;agent.py&lt;/code&gt; handles the motion detection and VLM orchestration loop, &lt;code&gt;vision.py&lt;/code&gt; wraps the model with a mock-compatible interface, &lt;code&gt;db.py&lt;/code&gt; handles all SQLite operations, &lt;code&gt;dashboard.py&lt;/code&gt; is the FastAPI application, and &lt;code&gt;cli.py&lt;/code&gt; handles all argument parsing and environment variable loading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;PRs welcome. Before submitting, all three of the following must pass with zero errors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make lint &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make typecheck &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as an offline home security monitor:&lt;/strong&gt; Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it for document digitization pipelines:&lt;/strong&gt; Point &lt;code&gt;--input&lt;/code&gt; at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The &lt;code&gt;/api/search&lt;/code&gt; endpoint lets you query what was found across the full document set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as an accessibility tool:&lt;/strong&gt; Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional VLM backends:&lt;/strong&gt; &lt;code&gt;VisionEngine&lt;/code&gt; in &lt;code&gt;vision.py&lt;/code&gt; wraps SmolVLM2-2.2B with a clean interface that &lt;code&gt;MockVisionEngine&lt;/code&gt; also implements. Swapping in a different HuggingFace multimodal model means updating &lt;code&gt;vision.py&lt;/code&gt; - the agent, database, dashboard, and CLI stay entirely unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Prompt Compression Benchmarker: Cut LLM Input Costs by 35–63% With Measurable Quality Tracking</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 06 May 2026 07:07:48 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/prompt-compression-benchmarker-cut-llm-input-costs-by-35-63-with-measurable-quality-tracking-12f3</link>
      <guid>https://dev.to/nilofer_tweets/prompt-compression-benchmarker-cut-llm-input-costs-by-35-63-with-measurable-quality-tracking-12f3</guid>
      <description>&lt;p&gt;Most LLM cost comes from input tokens, the long documents, codebases, or conversation histories you send as context. There are several prompt compression algorithms available, but nobody tells you which one actually works best for your specific workload, or how much quality you are trading for the savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Compression Benchmarker (PCB)&lt;/strong&gt; answers both questions. It benchmarks every major prompt compression algorithm against your actual data, shows you exactly how much quality each one drops, projects the real dollar savings at your call volume, and then gives you a one-line wrapper to deploy the winner as a drop-in replacement around your Anthropic or OpenAI client.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;PCB answers two questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which compression algorithm preserves the most quality at a given token budget?&lt;/strong&gt; &lt;br&gt;
Benchmark mode runs all compressors against your data and scores each one with task-specific quality metrics and an optional LLM-as-judge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much money does that save at your actual call volume?&lt;/strong&gt; &lt;br&gt;
Cost projection mode takes your daily token volume and model pricing and gives you monthly and annual savings per compressor.&lt;/p&gt;

&lt;p&gt;Then it gives you a one-line wrapper to deploy the answer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm265a0k98rmkebalhz97.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm265a0k98rmkebalhz97.png" alt=" " width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# From source
git clone https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
cd Prompt-Compression-Benchmarker
pip install .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# From PyPI (once published)
pip install prompt-compression-benchmarker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Python 3.9+. No GPU required. Core dependencies: &lt;code&gt;tiktoken&lt;/code&gt;, &lt;code&gt;scikit-learn&lt;/code&gt;, &lt;code&gt;rouge-score&lt;/code&gt;, &lt;code&gt;rank-bm25&lt;/code&gt;, &lt;code&gt;typer&lt;/code&gt;, &lt;code&gt;rich&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Verify
pcb --help

# Optional extras
pip install "prompt-compression-benchmarker[anthropic]"   # SDK wrapper for Anthropic
pip install "prompt-compression-benchmarker[openai]"      # SDK wrapper for OpenAI
pip install "prompt-compression-benchmarker[mcp]"         # MCP server for Claude Code
pip install "prompt-compression-benchmarker[all]"         # Everything
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Run the benchmark&lt;/strong&gt;&lt;br&gt;
The simplest run uses bundled sample data - no setup needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# All compressors × all task types, bundled sample data — no setup needed
pcb run

# Target a specific task with cost projection
pcb run --task rag --max-samples 20 --daily-tokens 2000000 --cost-model claude-sonnet-4-6

# Add LLM-as-judge for deeper quality scoring (requires OpenRouter API key)
export OPENROUTER_API_KEY=sk-or-...
pcb run --llm-judge --judge-model claude-sonnet-4-6 --max-samples 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what a real benchmark run looks like - RAG task, 3M tokens/day, claude-sonnet-4-6 pricing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6
                          RAG
 Compressor          Token Reduc %  Proxy Score  Proxy Drop %   ms
 no_compression           0.0%        0.2983         0.0%      0.3
 tfidf ★                 40.1%        0.2519        +16.5%     12.1
 selective_context        56.9%        0.1874        +34.4%      8.3
 llmlingua                53.6%        0.2182        +28.1%      9.7
 llmlingua2               45.0%        0.2204        +27.3%     11.2

 Monthly Cost Projection  claude-sonnet-4-6 · $3/1M · 3M tokens/day
 tfidf             38.3% reduction   $103/mo saved   $1,240/yr
 selective_context 57.5% reduction   $155/mo saved   $1,863/yr
 llmlingua2        43.6% reduction   $118/mo saved   $1,413/yr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ★ marks the Pareto-optimal compressor - best token savings given a quality drop below 20%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Compress a file directly&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Compress from a file or stdin, output to stdout
pcb compress context.txt --compressor llmlingua2 --rate 0.45 --stats

# Pipe it into any script
cat rag_context.txt | pcb compress | python send_to_claude.py

# Save compressed output
pcb compress context.txt -o compressed.txt --stats
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Deploy the winner&lt;/strong&gt;&lt;br&gt;
Once you know which compressor wins on your data, deploying it is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingAnthropic

# Drop-in replacement for anthropic.Anthropic()
client = CompressingAnthropic(compressor="llmlingua2", rate=0.45)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

print(client.stats)  # CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else in your codebase stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Benchmark table columns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5abgr0r8hnyzshkj5ltz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5abgr0r8hnyzshkj5ltz.png" alt=" " width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality drop color coding&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cyan   = negative drop (compression improved the metric — noise removal)
green  = &amp;lt; 5% drop    (effectively lossless)
yellow = 5–15% drop   (acceptable for most use cases)
red    = ≥ 15% drop   (significant information loss)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why use the LLM judge?
&lt;/h2&gt;

&lt;p&gt;The proxy score (F1, ROUGE, BM25) is fast and free but mechanical. The LLM judge calls a real model to evaluate whether the compressed context still supports the correct answer, it reveals things proxy metrics miss.&lt;/p&gt;

&lt;p&gt;Here is a real example showing why this matters - RAG task, 5 samples, LLM judge = claude-sonnet-4-6:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compressor           Proxy Drop %   LLM Score   LLM Drop %
no_compression           0.0%         0.94         0.0%
tfidf                  +23.7%         0.40        -57.4%    ← proxy hid the severity
llmlingua2             +29.9%         0.70        -25.5%    ← much better than proxy suggested
selective_context      +37.6%         0.14        -85.1%    ← dangerous despite high compression
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rule of thumb: use proxy scores to compare many configs quickly, then LLM-judge the top 2–3 before deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing a Compressor
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG:&lt;/strong&gt; &lt;code&gt;llmlingua2&lt;/code&gt; at rate 0.40 - preserves named entities and key facts better than sentence-dropping&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summarization:&lt;/strong&gt; &lt;code&gt;llmlingua&lt;/code&gt; at rate 0.45 - sentence-level pruning maintains structural coverage&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code contexts:&lt;/strong&gt; &lt;code&gt;llmlingua2&lt;/code&gt; at rate 0.35 - keeps imports, identifiers, type names; removes boilerplate&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General chat:&lt;/strong&gt; &lt;code&gt;tfidf&lt;/code&gt; at rate 0.40 - safe default, fast, reliable&lt;/p&gt;

&lt;h2&gt;
  
  
  Target compression rate
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;--rate&lt;/code&gt; is the fraction of tokens to remove. 0.45 means keep 55% of tokens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5q2a5j4q65dt793zgm5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5q2a5j4q65dt793zgm5.png" alt=" " width="800" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Savings - The Real Numbers
&lt;/h2&gt;

&lt;p&gt;Compression saves money on input tokens only. Output tokens are unchanged.&lt;br&gt;
At 3M input tokens per day:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fthywdxzquiykzeazrosi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fthywdxzquiykzeazrosi.png" alt=" " width="796" height="291"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Compression is most valuable on premium models. On DeepSeek or GPT-4.1-mini, the savings are too small to justify the complexity, use it only if you're hitting context window limits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check your own workload
pcb run --max-samples 10 --daily-tokens 5000000 --cost-model claude-opus-4-7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deploy: Python SDK Wrappers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    verbose=True,
)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

# Cumulative stats
print(client.stats)
# CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)

# Estimate monthly savings
print(client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000))
# 588.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenAI (Chat Completions + Codex Responses API)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingOpenAI

client = CompressingOpenAI(compressor="tfidf", rate=0.40)

# Chat Completions API — unchanged
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": long_context}]
)

# Responses API (Codex / o-series)
response = client.responses.create(
    model="codex-mini-latest",
    input=long_codebase_context,
    reasoning={"effort": "high"}
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What gets compressed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By default, only "user" role messages over 100 tokens are compressed. System prompts and assistant history are passed through unchanged.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    compress_roles=("user", "system"),  # also compress system prompt
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Claude Code Integration (MCP)
&lt;/h2&gt;

&lt;p&gt;PCB ships an MCP server that adds four compression tools directly into Claude Code conversations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Add to the current project
claude mcp add pcb -s project -- python -m pcb.mcp_server

# Or add to all your projects
claude mcp add pcb -s user -- python -m pcb.mcp_server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or drop &lt;code&gt;.mcp.json&lt;/code&gt; into any project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "mcpServers": {
    "pcb": {
      "type": "stdio",
      "command": "python",
      "args": ["-m", "pcb.mcp_server"]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Available tools&lt;/strong&gt;&lt;br&gt;
Once connected, you can ask Claude:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Compress this RAG context before sending it to the model"&lt;/li&gt;
&lt;li&gt;"Estimate how much I'd save compressing my prompts on claude-opus-4-7 at 2000 calls/day"&lt;/li&gt;
&lt;li&gt;"What compressor should I use for my coding assistant at 90% quality floor?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxqk0rzatmcf5xghgtyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxqk0rzatmcf5xghgtyd.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI Codex (Agents SDK)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agents import Agent, Runner
from agents.mcp import MCPServerStdio
import asyncio

async def main():
    async with MCPServerStdio(
        name="pcb",
        params={"command": "python", "args": ["-m", "pcb.mcp_server"]},
    ) as pcb_server:
        agent = Agent(
            name="CostAwareAssistant",
            model="codex-mini-latest",
            mcp_servers=[pcb_server],
        )
        result = await Runner.run(
            agent,
            "Compress this codebase context and estimate savings: " + codebase_context
        )
        print(result.final_output)

asyncio.run(main())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bring Your Own Data
&lt;/h2&gt;

&lt;p&gt;Data is JSONL - one JSON object per line. Check the schema for each task type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb show-schema rag
pcb show-schema summarization
pcb show-schema coding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RAG schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "id": "my_001",
  "context": "&amp;lt;passage 300–1500 tokens&amp;gt;",
  "question": "&amp;lt;specific question requiring the full context&amp;gt;",
  "answer": "&amp;lt;short, precise answer string&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Summarization schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "id": "my_001",
  "article": "&amp;lt;article or document 300–800 tokens&amp;gt;",
  "summary": "&amp;lt;2–3 sentence reference summary&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Coding schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "id": "my_001",
  "context": "&amp;lt;imports, helpers, type definitions — 400–800 tokens&amp;gt;",
  "docstring": "&amp;lt;description of the function to implement&amp;gt;",
  "solution": "&amp;lt;correct Python implementation&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running on your data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --data-dir ./my_data --task rag --max-samples 50

# Compare specific compressors
pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2

# Export results
pcb run --data-dir ./my_data --output results.json
pcb run --data-dir ./my_data --output results.csv
pcb run --data-dir ./my_data --output results.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Workflow: Benchmark to Production
&lt;/h2&gt;

&lt;p&gt;Here is the full path from benchmarking to deploying a compressor in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 : Benchmark on your actual data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --data-dir ./my_data --max-samples 50 --task rag \
        --daily-tokens 2000000 --cost-model claude-opus-4-7 \
        --output benchmark.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 : LLM-judge the top candidates&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2 \
        --llm-judge --judge-model claude-sonnet-4-6 --max-samples 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 : Deploy the winner&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(compressor="llmlingua2", rate=0.40)
# Everything else in your codebase stays the same
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4 : Monitor in production&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if client.stats.calls % 1000 == 0:
    logger.info(
        "pcb savings: calls=%d saved=%d tokens (%.1f%%) est_monthly=$%.0f",
        client.stats.calls,
        client.stats.tokens_saved,
        client.stats.reduction_pct,
        client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000),
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PCB run - benchmark&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Options:
  -c, --compressor TEXT       Compressor to include (repeat for multiple). Default: all five.
  -t, --task TEXT             Task type: rag, summarization, coding (repeat for multiple).
  -n, --max-samples INT       Max samples per task.
  -r, --rate FLOAT            Target compression rate 0.0–1.0. Default: 0.5
  -d, --data-dir PATH         Directory with *_samples.jsonl files.
  -o, --output PATH           Save report as .json, .csv, or .html.
  -j, --llm-judge             Enable LLM-as-judge scoring via OpenRouter.
  -m, --judge-model TEXT      Model for LLM judge. Default: claude-sonnet-4-6.
      --openrouter-key TEXT   OpenRouter API key (or set OPENROUTER_API_KEY).
      --daily-tokens INT      Daily token volume for cost projection.
      --cost-model TEXT       Model name for cost lookup (e.g. claude-opus-4-7).
      --token-price FLOAT     Manual price override in $/1M tokens.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PCB compress - compress text&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Arguments:
  [INPUT_FILE]                File to compress. Reads stdin if omitted.

Options:
  -c, --compressor TEXT       Algorithm. Default: tfidf.
  -r, --rate FLOAT            Fraction to remove. Default: 0.45.
  -o, --output PATH           Write to file instead of stdout.
  -s, --stats                 Print token stats to stderr.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Other commands&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb list-compressors          # Show all algorithms
pcb list-models               # Show 75+ supported LLM judge models
pcb show-schema rag           # Show JSONL schema for a task type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Output Formats
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;JSON - full detail per sample&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --output results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CSV - one row per compressor × task&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --output results.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Columns: &lt;code&gt;compressor&lt;/code&gt;, &lt;code&gt;task&lt;/code&gt;, &lt;code&gt;avg_token_reduction_pct&lt;/code&gt;, &lt;code&gt;avg_quality_score&lt;/code&gt;, &lt;code&gt;avg_quality_drop_pct&lt;/code&gt;, &lt;code&gt;avg_llm_score&lt;/code&gt;, &lt;code&gt;avg_llm_drop_pct&lt;/code&gt;, &lt;code&gt;avg_latency_ms&lt;/code&gt;, &lt;code&gt;num_samples&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTML - shareable visual report&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --output results.html
# Open in any browser — Chart.js scatter plots, dark theme, Pareto highlights
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When NOT to Use Compression
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Short prompts (&amp;lt; 200 tokens):&lt;/strong&gt; PCB skips these automatically overhead exceeds savings.&lt;br&gt;
&lt;strong&gt;Cheap models (&amp;lt; $0.50/1M):&lt;/strong&gt; DeepSeek, Gemini Flash, GPT-4.1-mini - savings too small.&lt;br&gt;
&lt;strong&gt;High-precision tasks:&lt;/strong&gt; Legal review, medical diagnosis - verify your quality floor with &lt;code&gt;--llm-judge&lt;/code&gt; first.&lt;br&gt;
&lt;strong&gt;Output-bottlenecked workloads:&lt;/strong&gt; Compression only affects input tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/pcb/
├── cli.py                      # Typer CLI — all commands
├── config.py                   # Pydantic config and model pricing table
├── runner.py                   # Benchmark orchestration + BenchmarkReport
├── mcp_server.py               # FastMCP server for Claude Code / Codex
├── compressors/
│   ├── tfidf.py                # TF-IDF sentence scoring
│   ├── selective_context.py    # Greedy token-budget selection
│   ├── llmlingua.py            # Sentence-level coarse pruning
│   └── no_compression.py       # Passthrough baseline
├── tasks/
│   ├── rag.py                  # F1/EM/context-recall evaluator
│   ├── summarization.py        # ROUGE-L evaluator
│   └── coding.py               # BM25 + identifier preservation
├── evaluators/
│   └── llm_judge.py            # OpenRouter LLM-as-judge (75+ models)
├── reporters/
│   ├── terminal.py             # Rich terminal tables
│   ├── json_reporter.py        # JSON output
│   ├── csv_reporter.py         # CSV output
│   └── html_reporter.py        # Chart.js HTML report
├── middleware/
│   ├── anthropic_client.py     # CompressingAnthropic drop-in wrapper
│   └── openai_client.py        # CompressingOpenAI drop-in wrapper
└── data/
    ├── rag_samples.jsonl        # 20 real-world factual passages (400–450 tokens)
    ├── summarization_samples.jsonl  # 10 real news-style articles
    └── coding_samples.jsonl     # 10 real Python code contexts (370–800 tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;I described the problem at a high level: a tool that benchmarks multiple prompt compression algorithms against real workloads, scores quality loss empirically, projects actual dollar savings at a given token volume, and makes it trivially easy to deploy the winning algorithm into an existing Anthropic or OpenAI codebase.&lt;/p&gt;

&lt;p&gt;NEO built the entire thing autonomously, the Typer CLI with all commands and flags, all five compressor implementations, the F1/ROUGE-L/BM25 task evaluators, the OpenRouter LLM-as-judge with support for 75+ models, the cost projection engine with the model pricing table, the &lt;code&gt;CompressingAnthropic&lt;/code&gt; and &lt;code&gt;CompressingOpenAI&lt;/code&gt; drop-in wrappers, the FastMCP server with four tools, the JSON/CSV/HTML reporters, and the three bundled sample datasets - 20 RAG passages, 10 summarization articles, and 10 coding contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it before committing to a compression strategy.&lt;/strong&gt;&lt;br&gt;
Before wiring any compressor into your production stack, run pcb on a sample of your actual prompts. The benchmark tells you which algorithm preserves the most quality at your target compression rate specific to your data, not a generic recommendation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to justify the cost of compression infrastructure.&lt;/strong&gt;&lt;br&gt;
The cost projection output gives you monthly and annual savings at your actual token volume and model pricing. This is the number you need to make a case for adding compression to your pipeline, not a rough estimate but a measured projection against your workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the MCP tools inside Claude Code sessions.&lt;/strong&gt;&lt;br&gt;
With the MCP server connected, you can ask Claude to compress a context, estimate savings, or recommend a compressor without leaving your coding environment. This makes compression a natural part of the agent workflow rather than a separate offline step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional compressors.&lt;/strong&gt;&lt;br&gt;
The four compressors share a common interface in &lt;code&gt;src/pcb/compressors/&lt;/code&gt;. A new algorithm - semantic chunking, abstractive summarization, or a custom retrieval-based approach, slots in as a new file in that directory and appears automatically in &lt;code&gt;pcb run&lt;/code&gt;, &lt;code&gt;pcb compress&lt;/code&gt;, and the MCP &lt;code&gt;recommend&lt;/code&gt; tool without touching any other part of the codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Most teams discover they are over-spending on input tokens only after the bill arrives. pcb gives you the benchmark data to make an informed decision before committing - which algorithm, at what rate, for which task type and the deployment tooling to act on it immediately.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>ContextCraft: A Visual Workbench for Building and Managing LLM Context Windows</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 05 May 2026 16:00:35 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/contextcraft-a-visual-workbench-for-building-and-managing-llm-context-windows-3f8e</link>
      <guid>https://dev.to/nilofer_tweets/contextcraft-a-visual-workbench-for-building-and-managing-llm-context-windows-3f8e</guid>
      <description>&lt;p&gt;Building a good LLM prompt is not a one-shot task. You assemble pieces, a system message, a few examples, some context, the actual instruction and then you iterate. You compress things that are too long, test whether the output still holds up, check how many tokens you are spending, and save versions so you can roll back when something breaks.&lt;/p&gt;

&lt;p&gt;Most developers do this in a text editor, a notebook, or scattered across a handful of scripts. There is no single place where you can see the whole context window, manipulate it visually, compress a block, run a live test, and save a snapshot, all without switching tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContextCraft&lt;/strong&gt; is that place. It is a canvas-based interactive workbench for assembling, compressing, testing, and versioning LLM context windows. It runs locally, connects to Ollama for local compression and testing, supports OpenRouter for cloud LLM testing, and exports directly to OpenAI, Anthropic, LangChain, and JSON formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Visual Canvas:&lt;/strong&gt; Drag and drop interface for organizing prompt blocks with real-time token counting and visual progress bars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smart Compression:&lt;/strong&gt; AI-powered compression using Ollama with semantic preservation. Set a target compression ratio, choose whether to preserve structure, and review a before/after comparison before applying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coverage Analysis:&lt;/strong&gt; Semantic similarity scoring between original and compressed content. Key concept preservation is surfaced as a score so you know exactly what you are trading for token savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Testing:&lt;/strong&gt; Test prompts with streaming responses from Ollama or OpenRouter directly from the canvas. Select provider, model, and temperature and view responses in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version Control:&lt;/strong&gt; Save and restore canvas versions with a SQLite backend. Name versions for easy reference and compare two versions to see what changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Format Export:&lt;/strong&gt; Export to OpenAI, Anthropic, LangChain, and JSON formats. Copy the generated code and paste directly into your application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block Library:&lt;/strong&gt; Pre-built starter blocks for common use cases, available from the sidebar. Add your own blocks to the library for reuse across canvases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;ContextCraft is split into a FastAPI backend and a React + Vite frontend.&lt;/p&gt;

&lt;p&gt;The backend handles token counting via tiktoken, semantic similarity analysis via sentence-transformers, compression via Ollama, streaming LLM test responses via Ollama or OpenRouter, SQLite-backed version management, and export format generation. The frontend renders the visual canvas with drag-and-drop via &lt;code&gt;@hello-pangea/dnd&lt;/code&gt; and code editing via CodeMirror.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4emd0squlz4375pjo2rv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4emd0squlz4375pjo2rv.png" alt=" " width="800" height="461"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;contextcraft/
├── server/                 # FastAPI backend
│   ├── main.py            # FastAPI app entry point
│   ├── models.py          # Pydantic data models
│   ├── tokenizer.py       # Token counting (tiktoken)
│   ├── coverage.py        # Semantic similarity analysis
│   ├── compress.py        # Ollama compression service
│   ├── tester.py          # LLM streaming test service
│   ├── export.py          # Export format generators
│   ├── versions.py        # SQLite version management
│   └── pricing.py         # OpenRouter pricing API
├── frontend/              # React + Vite frontend
│   ├── src/
│   │   ├── components/    # React components
│   │   ├── hooks/         # Custom React hooks
│   │   └── App.jsx        # Main application
│   └── package.json
├── cli/                   # CLI entry point
│   └── main.py
└── pyproject.toml         # Python package config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.9+&lt;/li&gt;
&lt;li&gt;Node.js 18+&lt;/li&gt;
&lt;li&gt;Ollama (optional, for local compression and testing)&lt;/li&gt;
&lt;li&gt;OpenRouter API key (optional, for cloud LLM testing)&lt;/li&gt;
&lt;li&gt;Installation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/contextcraft/contextcraft.git
&lt;span class="nb"&gt;cd &lt;/span&gt;contextcraft

&lt;span class="c"&gt;# Install Python dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;

&lt;span class="c"&gt;# Install frontend dependencies&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;frontend
npm &lt;span class="nb"&gt;install
cd&lt;/span&gt; ..

&lt;span class="c"&gt;# Initialize the database&lt;/span&gt;
contextcraft init-db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running the Application&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the server and frontend&lt;/span&gt;
contextcraft serve

&lt;span class="c"&gt;# Or start with custom options&lt;/span&gt;
contextcraft serve &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="nt"&gt;--frontend-port&lt;/span&gt; 3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once running, the frontend is available at localhost:5173 and the API docs at localhost:8000/docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenRouter API key (for cloud LLM testing)
&lt;/span&gt;&lt;span class="py"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your_api_key_here&lt;/span&gt;

&lt;span class="c"&gt;# Ollama URL (default: http://localhost:11434)
&lt;/span&gt;&lt;span class="py"&gt;OLLAMA_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;

&lt;span class="c"&gt;# Default compression model
&lt;/span&gt;&lt;span class="py"&gt;DEFAULT_COMPRESSION_MODEL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gemma2:2b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Supported Models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token Counting:&lt;/strong&gt; GPT-4, GPT-4o, GPT-4o Mini, GPT-3.5 Turbo, Claude 3 Opus, Sonnet, Haiku, Claude 3.5 Sonnet&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression (via Ollama):&lt;/strong&gt; gemma2:2b (default), any Ollama-compatible model&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing:&lt;/strong&gt; Ollama local models, OpenRouter cloud models (requires API key)&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage Guide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Creating a Canvas:&lt;/strong&gt; Start with an empty canvas or load from the library. Add blocks using the sidebar buttons or drag from the library. Arrange blocks by dragging to reorder. Edit block content inline or in the full editor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compressing Content:&lt;/strong&gt; Click the compress icon on any block. Set the target compression ratio (0.1 to 0.9). Choose whether to preserve structure. Review the before/after comparison. Apply compression when satisfied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing Prompts:&lt;/strong&gt; Add your prompt blocks to the canvas. Click the Test button. Select provider (Ollama or OpenRouter). Choose model and set temperature. View streaming responses in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyzing Coverage:&lt;/strong&gt; Compress one or more blocks. Click the Coverage button. View semantic similarity scores. Check key concept preservation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Managing Versions:&lt;/strong&gt; Click Versions to save the current state. Name your version for easy reference. Restore previous versions at any time. Compare versions to see changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exporting:&lt;/strong&gt; Click Export when ready. Choose format, OpenAI, Anthropic, LangChain, or JSON. Copy the generated code. Paste into your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Token Management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /api/tokenize&lt;/code&gt; - count tokens for text or blocks&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /api/pricing&lt;/code&gt; - get model pricing information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compression&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /api/compress&lt;/code&gt; - compress text using Ollama&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/coverage&lt;/code&gt; - analyze semantic coverage between original and compressed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/test&lt;/code&gt; - stream LLM responses from Ollama or OpenRouter&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Versioning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GET /api/versions&lt;/code&gt; - list all versions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/versions&lt;/code&gt; - save a new version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /api/versions/{id}&lt;/code&gt; - get a specific version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/versions/{id}/restore&lt;/code&gt; - restore a version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/versions/compare&lt;/code&gt; - compare two versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Export&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/export&lt;/code&gt; - export canvas to OpenAI, Anthropic, LangChain, or JSON format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Library&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /api/library&lt;/code&gt; - get starter block library&lt;br&gt;
&lt;code&gt;POST /api/library&lt;/code&gt; - add a block to the library&lt;/p&gt;
&lt;h2&gt;
  
  
  CLI Commands
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the application&lt;/span&gt;
contextcraft serve

&lt;span class="c"&gt;# Initialize database&lt;/span&gt;
contextcraft init-db

&lt;span class="c"&gt;# Add a block to library&lt;/span&gt;
contextcraft add-block &lt;span class="nt"&gt;--type&lt;/span&gt; system &lt;span class="nt"&gt;--label&lt;/span&gt; &lt;span class="s2"&gt;"My Template"&lt;/span&gt; &lt;span class="nt"&gt;--content&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt;

&lt;span class="c"&gt;# Get help&lt;/span&gt;
contextcraft &lt;span class="nt"&gt;--help&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Docker
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build image&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; contextcraft &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Run container&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="nt"&gt;-p&lt;/span&gt; 5173:5173 contextcraft
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install dev dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;

&lt;span class="c"&gt;# Run tests&lt;/span&gt;
pytest

&lt;span class="c"&gt;# Format code&lt;/span&gt;
black server/ cli/
isort server/ cli/

&lt;span class="c"&gt;# Type checking&lt;/span&gt;
mypy server/ cli/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;frontend

&lt;span class="c"&gt;# Start dev server&lt;/span&gt;
npm run dev

&lt;span class="c"&gt;# Build for production&lt;/span&gt;
npm run build

&lt;span class="c"&gt;# Run linter&lt;/span&gt;
npm run lint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;Fork the repository. Create a feature branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/amazing-feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commit your changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s1"&gt;'Add amazing feature'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push to the branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git push origin feature/amazing-feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open a Pull Request.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This tool was designed, built, debugged, and iterated entirely using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; -An autonomous AI engineering agent that writes, runs, and refines real code end-to-end.&lt;/p&gt;

&lt;p&gt;ContextCraft is a full-stack application, a FastAPI backend, a React + Vite frontend, a CLI, and a SQLite-backed versioning layer. Every part of the system was generated and connected through NEO: the backend services for token counting, semantic coverage analysis, compression via Ollama, streaming LLM testing, export pipelines, version management, and pricing integration, along with the interactive frontend canvas for assembling prompt blocks with drag-and-drop, inline editing, and real-time token tracking.&lt;/p&gt;

&lt;p&gt;The compression and coverage pipeline, the live testing flow across Ollama and OpenRouter, the version save/restore and comparison system, and the multi-format export layer were all built end-to-end from a high-level problem description. NEO handled the full cycle - generating code, wiring components, resolving issues, and refining the system into a working product.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering workbench:&lt;/strong&gt; Instead of iterating on prompts in a text editor and manually counting tokens, assemble your context window visually, compress blocks that are too long, and test the result, all in one place. The version control means you never lose a working configuration while experimenting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluate compression quality before shipping:&lt;/strong&gt; Before deploying a compressed prompt to production, run coverage analysis to get a semantic similarity score between the original and compressed version. You know exactly how much meaning you are trading for token savings, not just a token count but an actual semantic measurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manage prompt libraries across projects:&lt;/strong&gt; The block library lets you save reusable prompt blocks and load them into any canvas. Teams building multiple LLM products can maintain a shared library of tested, versioned prompt components.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional export formats:&lt;/strong&gt; The export module currently supports OpenAI, Anthropic, LangChain, and JSON. Adding a new format follows the same pattern in &lt;code&gt;export.py&lt;/code&gt; and surfaces automatically in the Export UI without touching any other part of the stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Context window management is one of those problems that looks simple until you are doing it seriously. ContextCraft brings together the pieces that are usually scattered across different tools visual assembly, token counting, AI compression, semantic coverage analysis, live testing, version control, and export into a single local workbench.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/ContextCraft" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/ContextCraft&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can also use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>promptengineering</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>LLM Behavior Diff Model Update Detector</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 04 May 2026 11:18:41 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/llm-behavior-diff-model-update-detector-3e7b</link>
      <guid>https://dev.to/nilofer_tweets/llm-behavior-diff-model-update-detector-3e7b</guid>
      <description>&lt;p&gt;You swap a model. The new one scores better on your benchmarks. You deploy it. Two days later, a user reports that something that used to work reliably now behaves differently.&lt;/p&gt;

&lt;p&gt;The benchmark never caught it because benchmarks measure averages. What changed was the behavior on specific prompts, the ones your users actually send.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Behavior Diff&lt;/strong&gt; is a tool that catches this before it happens. Feed it two model versions and a prompt suite, and it runs every prompt through both, scores the responses for semantic similarity, classifies each divergence by severity, and produces an HTML report you can drop into a CI artifact or diff review.&lt;/p&gt;

&lt;p&gt;It ships as a CLI, a Python API, and an MCP server so Claude Code or any MCP-compatible agent can run a behavioral diff before a model swap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Model Updates
&lt;/h2&gt;

&lt;p&gt;Every model update is a tradeoff. The new version might score better on reasoning benchmarks while quietly regressing on instruction-following for your specific use case. Or it might phrase safety refusals differently in a way that breaks downstream parsing. Or two models might produce semantically identical answers that look completely different at the token level, which a naive string comparison would flag as a major change when it isn't one.&lt;/p&gt;

&lt;p&gt;LLM Behavior Diff addresses all three scenarios. Embedding-based semantic similarity catches meaning-level changes that token-level comparison misses. The LLM-as-judge layer adds a reasoning layer for ambiguous cases. Severity classification separates noise from real regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The pipeline runs in five steps for every prompt in your suite:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load:&lt;/strong&gt; A YAML prompt suite is loaded into a &lt;code&gt;PromptSuite&lt;/code&gt; Pydantic model. Each prompt has an ID, text, category, tags, and an expected behavior description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run:&lt;/strong&gt; Each prompt is sent through Model A and Model B via &lt;code&gt;LLMRunner&lt;/code&gt;. Three providers are supported: Ollama (&lt;code&gt;/api/generate&lt;/code&gt;), OpenRouter (chat completions), and a deterministic stub provider for offline CI runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score:&lt;/strong&gt; Each response pair is scored with either &lt;code&gt;EmbeddingDiffer&lt;/code&gt; (cosine similarity on &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; embeddings) or &lt;code&gt;SimpleDiffer&lt;/code&gt; (Jaccard over words). Optionally, an LLM-as-judge score is combined with the similarity score, default judge model is &lt;code&gt;google/gemini-2.0-flash-lite-001&lt;/code&gt; via OpenRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify:&lt;/strong&gt; Each prompt is classified against the &lt;code&gt;--threshold&lt;/code&gt;. Changes are bucketed by severity: combined score &amp;gt;= 0.7 is minor, &amp;gt;= 0.4 is moderate, &amp;lt; 0.4 is major.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Report:&lt;/strong&gt; An HTML report is rendered and a rich summary table is printed to the terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Embeddings Over Token Matching
&lt;/h2&gt;

&lt;p&gt;The difference matters. Here is the same two-model comparison run two ways:&lt;br&gt;
With &lt;code&gt;--use-embeddings&lt;/code&gt; (cosine on &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;):&lt;/p&gt;

&lt;p&gt;Avg Similarity: 91.4%&lt;br&gt;
Changes Detected: 0 of 5&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;--no-use-embeddings&lt;/code&gt; (Jaccard fallback):&lt;/p&gt;

&lt;p&gt;Avg Similarity: 25.0%&lt;br&gt;
Changes Detected: 5 of 5&lt;/p&gt;

&lt;p&gt;Same two models, same prompts, completely opposite conclusions. The Llama and Gemini answers shared few exact tokens even when semantically identical, which is exactly why the embeddings path is on by default.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Python 3.11+. Embedding similarity uses &lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt;, downloaded on first use. The LLM-judge path requires &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; without it, scoring falls back to embeddings-only.&lt;/p&gt;
&lt;h2&gt;
  
  
  Running a Diff
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Offline - stub provider&lt;/strong&gt;&lt;br&gt;
A stub provider returns deterministic hashed responses, so the whole pipeline runs offline without Ollama or an API key. Good for CI and testing the setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-diff run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-a&lt;/span&gt; stub-a &lt;span class="nt"&gt;--provider-a&lt;/span&gt; stub &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-b&lt;/span&gt; stub-b &lt;span class="nt"&gt;--provider-b&lt;/span&gt; stub &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompts&lt;/span&gt; prompts/default.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; output/report.html &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-use-embeddings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real output from this run (stub + Jaccard, threshold 0.5):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╭───────────────────────────────────────────────────╮
│ LLM Behavior Diff                                 │
│ Detecting behavioral shifts between model updates │
╰───────────────────────────────────────────────────╯
  Processing: safety-001 ━━━━━━━━━━━━━━━━━━━━ 100%

Comparison Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric           ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total Prompts    │ 5     │
│ Changes Detected │ 3     │
│ Change Rate      │ 60.0% │
│ Avg Similarity   │ 40.0% │
└──────────────────┴───────┘
Report saved to: output/stub_jaccard.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real models - OpenRouter&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
llm-diff run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-a&lt;/span&gt; meta-llama/llama-3.2-3b-instruct &lt;span class="nt"&gt;--provider-a&lt;/span&gt; openrouter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-b&lt;/span&gt; google/gemini-2.0-flash-lite-001 &lt;span class="nt"&gt;--provider-b&lt;/span&gt; openrouter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompts&lt;/span&gt; prompts/default.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; output/or_emb.html &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--use-embeddings&lt;/span&gt; &lt;span class="nt"&gt;--threshold&lt;/span&gt; 0.85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real output (embeddings only):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Comparison Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Total Prompts    │ 5     │
┃ Changes Detected │ 0     │
┃ Change Rate      │ 0.0%  │
┃ Avg Similarity   │ 91.4% │
└──────────────────┴───────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding &lt;code&gt;--use-judge&lt;/code&gt; brings the average similarity to 91.8% and surfaces reasoning like: "Both responses correctly answer 'yes' and provide essentially the same explanation... Response A is slightly more verbose, but the core meaning is identical."&lt;/p&gt;

&lt;p&gt;Real models - Ollama&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-diff run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-a&lt;/span&gt; qwen3:8b &lt;span class="nt"&gt;--provider-a&lt;/span&gt; ollama &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-b&lt;/span&gt; gemma4:e4b &lt;span class="nt"&gt;--provider-b&lt;/span&gt; ollama &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompts&lt;/span&gt; prompts/default.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; output/report.html &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--use-embeddings&lt;/span&gt; &lt;span class="nt"&gt;--threshold&lt;/span&gt; 0.85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;llm-diff --help

Usage: llm-diff [OPTIONS] COMMAND [ARGS]...

 LLM Behavior Diff — Model Update Detector

 --version                Show version information
 --help                   Show this message and exit.

 Commands
   run  Run a comparison between two models.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;llm-diff --version
LLM Behavior Diff version 0.1.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key options for &lt;code&gt;llm-diff run&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7sqizy9sheh9lm5ss0h1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7sqizy9sheh9lm5ss0h1.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Severity buckets applied when a change is detected: combined &amp;gt;= 0.7 is minor, &amp;gt;= 0.4 is moderate, &amp;lt; 0.4 is major.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Suite Format
&lt;/h2&gt;

&lt;p&gt;The prompt suite is a YAML file. &lt;code&gt;prompts/default.yaml&lt;/code&gt; ships with 5 prompts spanning reasoning, coding, factual, instruction-following, and safety. You can write your own:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;suite"&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0"&lt;/span&gt;
&lt;span class="na"&gt;prompts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code-001"&lt;/span&gt;
    &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Python&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reverse_string(s)..."&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coding"&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;expected_behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Short&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;correct&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;function"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IDs must be unique. Category must be one of: &lt;code&gt;reasoning&lt;/code&gt;, &lt;code&gt;coding&lt;/code&gt;, &lt;code&gt;creativity&lt;/code&gt;, &lt;code&gt;safety&lt;/code&gt;, &lt;code&gt;instruction_following&lt;/code&gt;, &lt;code&gt;factual&lt;/code&gt;, &lt;code&gt;conversational&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;The full pipeline is available as a library. A synchronous one-shot call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.runner&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_prompt_sync&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProviderType&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_prompt_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ProviderType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STUB&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; Model stub-m says: 921fac0c4c True
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similarity scoring directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.differ&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleDiffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EmbeddingDiffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_differ&lt;/span&gt;

&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleDiffer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the cat sat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the cat ran&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# 0.5
&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmbeddingDiffer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer is 4.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Two plus two equals four.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; ~0.59
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;create_differ(use_embeddings=False)&lt;/code&gt; returns a &lt;code&gt;SimpleDiffer&lt;/code&gt; (Jaccard). &lt;code&gt;True&lt;/code&gt; returns an &lt;code&gt;EmbeddingDiffer&lt;/code&gt; if sentence-transformers is importable, otherwise falls back to &lt;code&gt;SimpleDiffer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Generating a report from a &lt;code&gt;ComparisonRun&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.report&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReportGenerator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;

&lt;span class="nc"&gt;ReportGenerator&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;save_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;out.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ReportGenerator&lt;/code&gt; looks for a Jinja template in the CWD, the package directory, and a legacy path, then falls back to a built-in template so reports always render.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Server
&lt;/h2&gt;

&lt;p&gt;The tool also runs as an MCP server over stdio transport, exposing three tools so Claude Code or any MCP-compatible agent can trigger a behavioral diff during a session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-diff-mcp
&lt;span class="c"&gt;# or: python -m llm_behavior_diff.mcp_server&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three exposed tools:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;compare_models&lt;/strong&gt; - runs a full prompt suite through two models and returns per-prompt similarity, severity, and response text.&lt;br&gt;
&lt;strong&gt;analyze_drift&lt;/strong&gt; - scores drift between two candidate responses for a single prompt.&lt;br&gt;
&lt;strong&gt;generate_report&lt;/strong&gt; - renders an HTML summary from a JSON list of results.&lt;/p&gt;

&lt;p&gt;Claude Code config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llm-behavior-diff"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm-diff-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Smoke test - all three tools, offline, via Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.mcp_server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;compare_models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CompareModelsRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;analyze_drift&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AnalyzeDriftRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GenerateReportRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;compare_models&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompareModelsRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub-b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompts_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompts/default.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changes_detected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# real: 5 4 0.3446
&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;analyze_drift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AnalyzeDriftRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer is 4.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2+2 equals 4.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_similarity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# real: 0.5572 moderate
&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;GenerateReportRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;results_json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavioral_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavioral_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;major&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output/mcp_report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MCP Smoke&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verified over stdio JSON-RPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;llm-diff-mcp   #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;speaks MCP 2024-11-05 on stdio
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tools/list -&amp;gt; compare_models, analyze_drift, generate_report
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tools/call analyze_drift &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"prompt_text"&lt;/span&gt;:&lt;span class="s2"&gt;"..."&lt;/span&gt;,&lt;span class="s2"&gt;"response_a"&lt;/span&gt;:&lt;span class="s2"&gt;"Paris"&lt;/span&gt;,
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s2"&gt;"response_b"&lt;/span&gt;:&lt;span class="s2"&gt;"The capital is Paris."&lt;/span&gt;,&lt;span class="s2"&gt;"use_embeddings"&lt;/span&gt;:true&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"embedding_similarity"&lt;/span&gt;:0.7761,&lt;span class="s2"&gt;"severity"&lt;/span&gt;:&lt;span class="s2"&gt;"minor"&lt;/span&gt;, ...&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge requires OpenRouter&lt;/strong&gt; -  Without &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;, judging is skipped and the combined score equals the embedding or Jaccard similarity alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First embedding run is slow&lt;/strong&gt; - &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; is downloaded from Hugging Face on first use. Subsequent runs use the local cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama is not spawned automatically&lt;/strong&gt; - The client talks to &lt;code&gt;http://localhost:11434&lt;/code&gt; by default (OLLAMA_HOST env var overrides). Ollama must already be running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stub provider is for CI and demos only&lt;/strong&gt; - It produces deterministic fake text keyed on model name, temperature, and prompt. Not suitable for real behavioral conclusions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How You Can Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gate model upgrades in CI before they ship:&lt;/strong&gt; Add &lt;code&gt;llm-diff&lt;/code&gt; run to your deployment pipeline. Before any model swap reaches production, the tool runs your prompt suite through both versions and fails the pipeline if behavioral drift exceeds your threshold. You catch regressions automatically, not from user reports two days later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during prompt engineering to measure real impact:&lt;/strong&gt; When you change a system prompt or few-shot examples, run a diff between the old and new configuration. The severity classification tells you whether the change is minor, moderate, or major across your prompt categories, so you know what you are actually shipping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the MCP server to make your agent self-aware of drift:&lt;/strong&gt; With the MCP server running, Claude Code or any MCP-compatible agent can call &lt;code&gt;compare_models&lt;/code&gt; or &lt;code&gt;analyze_drift&lt;/code&gt; directly during a session. An agent working on a model integration can check for behavioral drift without leaving the coding environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional providers:&lt;/strong&gt; The tool currently supports Ollama, OpenRouter, and a stub provider, all sharing a common &lt;code&gt;LLMRunner&lt;/code&gt; interface. Adding a new provider for Anthropic, Gemini, or any OpenAI-compatible endpoint follows the same pattern without touching the differ, classifier, or report logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Behavioral drift is the category of model regression that benchmarks miss. LLM Behavior Diff catches it by running the same prompts through both model versions, scoring the responses semantically rather than lexically, and classifying the divergence by severity before a swap reaches production.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also build with &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;NEO is a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>AI Slop Cleaner: Automating Your Codebase Hygiene</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 30 Apr 2026 09:56:31 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/ai-slop-cleaner-automating-your-codebase-hygiene-4hi</link>
      <guid>https://dev.to/nilofer_tweets/ai-slop-cleaner-automating-your-codebase-hygiene-4hi</guid>
      <description>&lt;p&gt;Every codebase accumulates clutter over time. An import left behind after a refactor. A helper function that nothing calls anymore. A method that grew too complex to reason about. None of it breaks anything immediately, but it slows down every developer who reads through it, and it silently raises the cost of every future change.&lt;/p&gt;

&lt;p&gt;The usual fix is a manual review pass. Someone spends an hour looking for unused imports, searching for dead functions, flagging complexity hotspots. It is tedious, inconsistent, and happens far less often than it should.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slop Cleaner&lt;/strong&gt; is a CLI tool that does this automatically. It detects unused imports, dead functions and classes, and over-complex code using tree-sitter AST analysis (not regex) so it never removes an import that appears in a docstring or string annotation. Every patch is atomic: backed up before writing, and rolled back automatically if your test suite fails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ndoo779n1lyz832dx0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ndoo779n1lyz832dx0b.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What slop-cleaner Detects and Fixes
&lt;/h2&gt;

&lt;p&gt;slop-cleaner targets three specific categories of clutter that accumulate in every codebase:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unused imports:&lt;/strong&gt; Imports left behind after refactors. These are detected at HIGH confidence and removed automatically. The tool handles single-line imports, selective removal from &lt;code&gt;from X import A, B&lt;/code&gt; blocks where only one name is unused, and multi-line &lt;code&gt;from X import (...)&lt;/code&gt; blocks where individual lines are surgically removed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead functions and classes:&lt;/strong&gt; Defined but never called anywhere in the codebase. These are detected at MEDIUM confidence and flagged in the report for human review. The tool builds a full call graph across all symbols before making this determination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over-complex functions:&lt;/strong&gt; Functions with cyclomatic complexity above the threshold (default 10). These are flagged at MEDIUM confidence for manual refactoring. The tool never auto-removes a function, complexity is a signal that something needs attention, not a safe automatic fix.&lt;/p&gt;

&lt;p&gt;HIGH confidence issues are fixed automatically. MEDIUM confidence issues are always left to human judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-Phase Pipeline
&lt;/h2&gt;

&lt;p&gt;Everything slop-cleaner does runs as a five-phase pipeline. Each phase feeds into the next, and every patch is atomic, backed up before writing and rolled back automatically if your tests fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;Parses every &lt;code&gt;.py&lt;/code&gt; and &lt;code&gt;.ts&lt;/code&gt;/&lt;code&gt;.tsx&lt;/code&gt; file with tree-sitter. Collects unused imports at HIGH confidence and high-complexity functions at MEDIUM confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyze:&lt;/strong&gt; Builds a call graph across all symbols in the project. Identifies dead code, defined but never called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean:&lt;/strong&gt; Applies HIGH-confidence patches atomically. Handles single-line and multi-line &lt;code&gt;from X import (...)&lt;/code&gt; blocks. Backs up each file before writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify:&lt;/strong&gt; Runs your test suite with pytest. On any failure, rolls back every patched file to its backup automatically. Returns exit code 1 so CI catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document:&lt;/strong&gt; Generates three Markdown reports: &lt;code&gt;ARCHITECTURE.md&lt;/code&gt;, &lt;code&gt;FUNCTION_MAP.md&lt;/code&gt;, and &lt;code&gt;SLOP_REPORT.md&lt;/code&gt; covered in detail below.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Fixed Automatically vs. Flagged
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1g1a3936hz63kzupcf8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1g1a3936hz63kzupcf8.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The distinction matters. HIGH confidence patches are cases where the tool is certain removal is safe. MEDIUM confidence cases, dead code and complexity could have dynamic dispatch, reflection, or other patterns that make automatic removal risky. The tool flags these and leaves the decision to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge Cases Handled
&lt;/h2&gt;

&lt;p&gt;One of the hardest parts of detecting unused imports is knowing when a name that looks unused is actually needed. slop-cleaner handles these correctly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyztlbmueky8h7euxbuq0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyztlbmueky8h7euxbuq0.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are exactly the cases where regex-based tools get it wrong. Because slop-cleaner parses an actual AST, it understands the difference between a name appearing inside a string and a name being used as an identifier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/your-org/slop-cleaner
&lt;span class="nb"&gt;cd &lt;/span&gt;slop-cleaner

&lt;span class="c"&gt;# Create and activate a virtual environment&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate      &lt;span class="c"&gt;# macOS / Linux&lt;/span&gt;
&lt;span class="c"&gt;# .venv\Scripts\activate       # Windows&lt;/span&gt;

&lt;span class="c"&gt;# Install the tool and its dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This registers two CLI commands: &lt;code&gt;slop-audit&lt;/code&gt; and &lt;code&gt;slop-clean&lt;/code&gt;.&lt;br&gt;
Dependencies installed automatically via &lt;code&gt;pyproject.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tree-sitter&amp;gt;=0.25
tree-sitter-python&amp;gt;=0.23
tree-sitter-typescript&amp;gt;=0.23
rich&amp;gt;=13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note for modern Linux (Ubuntu 23.04+, Debian 12+)&lt;/strong&gt;: system Python blocks global pip install. Always use a virtual environment as shown above, or use &lt;code&gt;pipx install .&lt;/code&gt; to install the CLI tools globally without a venv.&lt;/p&gt;

&lt;p&gt;To run the test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[test]"&lt;/span&gt;   &lt;span class="c"&gt;# adds pytest + pytest-cov&lt;/span&gt;
pytest                     &lt;span class="c"&gt;# runs tests/test_parsers.py (22 tests)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Audit a project — report issues, exit 1 if any found (CI-friendly)&lt;/span&gt;
slop-audit path/to/project/

&lt;span class="c"&gt;# Audit a single file&lt;/span&gt;
slop-audit src/services/user_service.py

&lt;span class="c"&gt;# Full clean — audit, fix, verify tests, generate docs&lt;/span&gt;
slop-clean path/to/project/

&lt;span class="c"&gt;# Dry run — show what would change without touching files&lt;/span&gt;
slop-clean path/to/project/ &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Write audit JSON for tooling integration&lt;/span&gt;
slop-audit path/to/project/ &lt;span class="nt"&gt;--output&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Commands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;slop-audit&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;slop-audit&lt;/code&gt; scans a file or directory and prints a table of all issues found without touching anything. It exits with code 1 if issues are found, making it a clean CI gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;slop-audit &amp;lt;target&amp;gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt; JSON] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--threshold&lt;/span&gt; N] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--verbose&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62hiadeh2o86hmc6y3u2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62hiadeh2o86hmc6y3u2.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Exit codes: &lt;code&gt;0&lt;/code&gt; = clean, &lt;code&gt;1&lt;/code&gt; = issues found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;slop-clean&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;slop-clean&lt;/code&gt; runs the full 5-phase pipeline. Always run &lt;code&gt;--dry-run&lt;/code&gt; first on an unfamiliar project, it shows exactly what patches would be applied without touching any file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;slop-clean &amp;lt;target&amp;gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt; DIR] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--threshold&lt;/span&gt; N] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--verbose&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8fuen6i2plts3rceqho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8fuen6i2plts3rceqho.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Exit codes: &lt;code&gt;0&lt;/code&gt; = success, &lt;code&gt;1&lt;/code&gt; = rollback was triggered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generated Output
&lt;/h2&gt;

&lt;p&gt;After &lt;code&gt;slop-clean&lt;/code&gt; runs, the &lt;code&gt;--output&lt;/code&gt; directory contains three reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slop-output/
├── ARCHITECTURE.md   — file tree + Mermaid dependency graph
├── FUNCTION_MAP.md   — every symbol with start line and complexity score
└── SLOP_REPORT.md    — issue summary, patches applied, dead-code candidates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ARCHITECTURE.md&lt;/code&gt; gives a structural overview of the codebase with a visual Mermaid call graph, useful for understanding how the project fits together at a glance.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;FUNCTION_MAP.md&lt;/code&gt; is a complete index of every function and class with its start line and complexity score. For large codebases, this is the fastest way to see where complexity is concentrated.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SLOP_REPORT.md&lt;/code&gt; is the actionable output: what was fixed automatically, what was flagged for manual review, and which dead-code candidates need a human decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying It on the Example Projects
&lt;/h2&gt;

&lt;p&gt;Two sample projects ship in examples/ so you can try the tool immediately after installing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Audit only — no test verification needed&lt;/span&gt;
slop-audit examples/todo_app/ &lt;span class="nt"&gt;--verbose&lt;/span&gt;
slop-audit examples/event_pipeline/ &lt;span class="nt"&gt;--verbose&lt;/span&gt;

&lt;span class="c"&gt;# Full clean — dry run first, then apply&lt;/span&gt;
slop-clean examples/todo_app/ &lt;span class="nt"&gt;--dry-run&lt;/span&gt;
slop-clean examples/todo_app/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run the example test suites directly, change into the project directory first so their imports resolve correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;examples/todo_app &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;examples/event_pipeline &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;todo_app&lt;/code&gt; is a sample Python app with intentional slop, a good first run to see exactly what the tool catches. &lt;code&gt;event_pipeline&lt;/code&gt; covers tricky import patterns like aliases, multi-line imports, and string annotations, designed to show the AST analysis handling edge cases correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI Integration
&lt;/h2&gt;

&lt;p&gt;slop-audit drops directly into CI as a quality gate. It exits 1 if any issues are found, failing the workflow step automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/quality.yml&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;slop-check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install -e .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slop-audit src/ --threshold &lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches unused imports and complexity regressions on every pull request before they get merged into the main branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slop-cleaner/
├── cli/
│   └── main.py               — entry points + rich-formatted output
├── engines/
│   ├── auditor.py            — Phase 1 · AST-based issue detection
│   ├── analyzer.py           — Phase 2 · call-graph + dead-code finder
│   ├── cleaner.py            — Phase 3 · atomic patch application
│   ├── verifier.py           — Phase 4 · pytest runner + rollback
│   └── documenter.py         — Phase 5 · Markdown report generator
├── parsers/
│   ├── python_parser.py      — tree-sitter Python wrapper
│   └── typescript_parser.py  — tree-sitter TypeScript/TSX wrapper
├── examples/
│   ├── todo_app/             — sample Python app with intentional slop
│   └── event_pipeline/       — sample project with tricky import patterns
├── assets/
│   ├── pipeline.svg          — 5-phase flow diagram
│   ├── features.svg          — feature overview infographic
│   └── before-after.svg      — code transformation visual
└── tests/
    └── test_parsers.py       — 22 tests covering the parsers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structure maps cleanly to the five phases, one engine per phase, two parsers for Python and TypeScript/TSX, and a single CLI entry point that ties them together.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This tool was designed, built, debugged, and refined entirely using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes, runs, and iterates on real code without hand-holding.&lt;/p&gt;

&lt;p&gt;Every engine, every edge-case fix, and every SVG was produced by NEO in a single session. The five-phase pipeline, the tree-sitter AST parsers for Python and TypeScript, the atomic patch application with backup and rollback, the call graph builder, the pytest runner with automatic rollback on failure, and the three Markdown report generators all of it built end-to-end from a high-level problem description.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a CI quality gate on every pull request:&lt;/strong&gt;&lt;br&gt;
Drop &lt;code&gt;slop-audit&lt;/code&gt; into your CI pipeline. Every PR gets checked for unused imports and complexity regressions before it touches main. The exit code 1 on issues means it fails the build automatically, no configuration beyond one workflow step. It is already wired for this out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it before a major refactor:&lt;/strong&gt;&lt;br&gt;
Before starting a large refactor, run &lt;code&gt;slop-clean&lt;/code&gt; on the codebase. It removes accumulated import clutter, flags dead code that no longer needs to be worked around, and generates a complexity map. You go into the refactor with a cleaner starting point and a clear picture of where complexity lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to onboard onto an unfamiliar codebase:&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;slop-audit&lt;/code&gt; on a project you have just inherited and read &lt;code&gt;SLOP_REPORT.md&lt;/code&gt;. You get a structured list of unused imports, dead functions, and complexity hotspots, a map of technical debt you can act on rather than discovering piece by piece while working in the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to generate a documentation baseline for a legacy project:&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;slop-clean&lt;/code&gt; and read &lt;code&gt;ARCHITECTURE.md&lt;/code&gt; and &lt;code&gt;FUNCTION_MAP.md&lt;/code&gt;. You get a file tree, a visual call graph, and a complete index of every symbol with its complexity score. For a project with no existing documentation, that is a meaningful starting point.&lt;/p&gt;

&lt;p&gt;The tool is also designed to be extended and NEO can take any of these further without starting from scratch:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JavaScript/JSX parser-&lt;/strong&gt; Two parsers already exist (&lt;code&gt;python_parser.py&lt;/code&gt;, &lt;code&gt;typescript_parser.py&lt;/code&gt;) following the same tree-sitter wrapper pattern. A third for JavaScript/JSX follows the same interface and plugs into all five engines immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional complexity metrics-&lt;/strong&gt; &lt;code&gt;auditor.py&lt;/code&gt; already tracks cyclomatic complexity. Additional metrics like function length follow the same detection pattern and surface automatically in &lt;code&gt;FUNCTION_MAP.md&lt;/code&gt; and the audit output once added.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-fix for dead code-&lt;/strong&gt; dead code is currently flagged at MEDIUM confidence and left to human judgment. For clearly dead private functions confirmed unreachable by the call graph, automatic removal is a natural next step built directly on the existing &lt;code&gt;analyzer.py&lt;/code&gt; and &lt;code&gt;cleaner.py&lt;/code&gt; infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-commit hook-&lt;/strong&gt; &lt;code&gt;slop-audit&lt;/code&gt; already exits 1 on issues. A small wrapper that hooks into the existing CLI entry point brings slop detection into the local development loop before anything reaches CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Code clutter is not a cosmetic problem. Unused imports add noise to every code review. Dead functions add surface area that has to be mentally accounted for. High-complexity functions resist change and hide bugs. slop-cleaner addresses all three automatically, with AST-level precision that regex-based tools cannot match, and with a test-verification step that means it never leaves your codebase in a worse state than it found it.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/Ai_Slop_Cleaner" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Ai_Slop_Cleaner&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Agent Failure Classifier: Post-Hoc Root Cause Analysis for Failed LLM Agent Runs</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 29 Apr 2026 07:56:24 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/agent-failure-classifier-post-hoc-root-cause-analysis-for-failed-llm-agent-runs-1i79</link>
      <guid>https://dev.to/nilofer_tweets/agent-failure-classifier-post-hoc-root-cause-analysis-for-failed-llm-agent-runs-1i79</guid>
      <description>&lt;p&gt;When an LLM agent fails, the trace is right there, the user turns, the tool calls, the responses, the final result. But knowing what happened and knowing why it failed are two different things. Most teams read traces manually, form a guess, and move on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Failure Classifier&lt;/strong&gt; is a CLI tool and Python library for post-hoc root cause analysis of failed or low-quality LLM agent runs. Feed it any agent trace and it classifies the failure into one of eight named failure modes, identifies the first turn where things went wrong, and produces a structured report with actionable fixes.&lt;/p&gt;

&lt;p&gt;The classifier combines eight fast rule-based detectors with an optional LLM-as-judge pass via OpenRouter. The rule-based layer is free, deterministic, and requires no network access. The LLM pass breaks ties and classifies traces the rules cannot resolve alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eight Failure Modes
&lt;/h2&gt;

&lt;p&gt;The classifier recognises exactly eight failure modes, each with a precise definition:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HALLUCINATION:&lt;/strong&gt; Agent stated facts or called tools that do not exist&lt;br&gt;
&lt;strong&gt;TOOL_MISUSE:&lt;/strong&gt; Agent called a real tool with wrong parameters or at the wrong time&lt;br&gt;
&lt;strong&gt;CONTEXT_LOSS:&lt;/strong&gt; Agent forgot earlier decisions or repeated already-completed steps&lt;br&gt;
&lt;strong&gt;CIRCULAR_REASONING:&lt;/strong&gt; Agent looped between the same 2-3 steps without making progress&lt;br&gt;
&lt;strong&gt;GOAL_DRIFT:&lt;/strong&gt; Agent started pursuing a sub-goal and forgot the original task&lt;br&gt;
&lt;strong&gt;OVER_REFUSAL:&lt;/strong&gt; Agent refused an action it was capable of and should have taken&lt;br&gt;
&lt;strong&gt;SCHEMA_ERROR:&lt;/strong&gt; Agent generated malformed JSON for a tool call or structured output&lt;br&gt;
&lt;strong&gt;TIMEOUT_CASCADE:&lt;/strong&gt; One slow tool call caused the agent to rush or skip subsequent steps&lt;/p&gt;

&lt;p&gt;These are not fuzzy categories. Each one maps to a specific detector with specific signals. A hallucination is flagged when the agent asserts a factual claim without invoking any retrieval tool. A timeout cascade is flagged when a tool call exceeds a latency threshold and the subsequent agent turn is unusually short relative to the tool output.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The classification pipeline runs in two layers.&lt;/p&gt;

&lt;p&gt;The rule-based layer runs eight deterministic detectors over the trace. Each detector looks for specific structural signals repeated tool calls with identical inputs, cycles in agent turn content, latency spikes followed by short responses, malformed JSON in tool call outputs. This layer runs offline, requires no API key, and classifies all eight failure modes.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;LLM-as-judge&lt;/strong&gt; layer is optional. When enabled, it receives traces the rule-based layer couldn't resolve with high confidence and breaks ties. The judge runs via OpenRouter and can be pointed at any OpenRouter model or a local OpenAI-compatible server (Ollama, vLLM, llama.cpp).&lt;br&gt;
Every classification produces a structured report with the classified failure mode, a confidence score, the first turn where the failure was detected, a root cause summary, and a list of actionable fixes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/agent-failure-classifier
&lt;span class="nb"&gt;cd &lt;/span&gt;agent-failure-classifier
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.8+. The only dependencies are &lt;code&gt;pydantic&lt;/code&gt;, &lt;code&gt;rich&lt;/code&gt;, &lt;code&gt;click&lt;/code&gt;, and &lt;code&gt;requests&lt;/code&gt;. The rule-based layer runs with no additional setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Judge Setup (Optional)&lt;/strong&gt;&lt;br&gt;
To enable the LLM-judge pass, copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt; and set your OpenRouter key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# edit .env and set OPENROUTER_API_KEY=sk-or-...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv2sdra53suzkiojox5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv2sdra53suzkiojox5t.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without any key, pass &lt;code&gt;--no-llm&lt;/code&gt; to every classify or batch call. The rule-based layer alone classifies all eight failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI
&lt;/h2&gt;

&lt;p&gt;The CLI is exposed as both a console script (&lt;code&gt;agent-failure-classifier&lt;/code&gt;) and an importable module (&lt;code&gt;python -m agent_failure_classifier.cli&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify a single trace&lt;/strong&gt;&lt;br&gt;
The core command takes a trace JSON file and returns a structured report. &lt;code&gt;--no-llm&lt;/code&gt; keeps it offline, rule-based only, no API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier classify &lt;span class="nt"&gt;--trace&lt;/span&gt; traces/hallucination_example.json &lt;span class="nt"&gt;--no-llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key flags:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdabmg2rqnqa9i9tyks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdabmg2rqnqa9i9tyks.png" alt=" " width="746" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate a trace&lt;/strong&gt;&lt;br&gt;
Before classifying, &lt;code&gt;validate&lt;/code&gt; parses the trace and prints its structure: trace ID, goal, turn count, and a preview of each turn. Useful for confirming the trace loaded correctly before running classification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier validate &lt;span class="nt"&gt;--trace&lt;/span&gt; traces/hallucination_example.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Batch classification&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;batch&lt;/code&gt; runs classification over every &lt;code&gt;*.json&lt;/code&gt; file in a directory and produces a failure-mode distribution table plus a per-trace summary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier batch &lt;span class="nt"&gt;--traces-dir&lt;/span&gt; ./traces/ &lt;span class="nt"&gt;--no-llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Worked Examples
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Example 1 - Hallucination&lt;/strong&gt;&lt;br&gt;
The trace has a user asking for WWII death statistics. The agent responds directly with a factual claim, no tool call, no retrieval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hallucination-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"original_goal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Get population statistics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"70 million people died in WWII."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"is_successful"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"turn_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How many people died in WWII?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"turn_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"70 million people died in WWII."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classification: &lt;code&gt;HALLUCINATION&lt;/code&gt;, confidence 75%, first failure at turn 1. The detector flags that the agent asserted a factual claim without invoking any retrieval tool. Recommended fixes include adding a fact-checking step, requiring tool verification for factual claims, and implementing retrieval-augmented generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2 - Circular Reasoning&lt;/strong&gt;&lt;br&gt;
Four turns alternating between &lt;code&gt;"Let me analyze this step by step."&lt;/code&gt; and &lt;code&gt;"I need more information."&lt;/code&gt; The agent makes no progress across the entire trace.&lt;br&gt;
Classification: &lt;code&gt;CIRCULAR_REASONING&lt;/code&gt;, confidence 80%. The rule-based detector identifies a 2-step cycle repeating across agent turns and recommends a maximum-iteration limit plus state-change detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 3 - Timeout Cascade&lt;/strong&gt;&lt;br&gt;
A &lt;code&gt;slow_api&lt;/code&gt; tool call with &lt;code&gt;latency_ms: 6000&lt;/code&gt; followed by a one-word agent response &lt;code&gt;"OK"&lt;/code&gt;.&lt;br&gt;
Classification: &lt;code&gt;TIMEOUT_CASCADE&lt;/code&gt;, confidence 70%. The detector flags the latency breach and notes that the subsequent agent turn is a one-word response, less than half the length of the tool output, indicating the agent rushed through the remaining steps.&lt;/p&gt;
&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Classify a trace programmatically&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.classifier&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FailureClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;

&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traces/hallucination_example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FailureClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classified_failure_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root_cause_summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Record a trace live with TraceRecorder&lt;/strong&gt;&lt;br&gt;
Rather than constructing trace JSON by hand, &lt;code&gt;TraceRecorder&lt;/code&gt; is a context manager that captures an agent run as it executes and writes a trace file to disk on exit. The output is immediately compatible with the CLI and with &lt;code&gt;FailureClassifier&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.recorder&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceRecorder&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TraceRecorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find Italian restaurants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./traces&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find Italian restaurants near me&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Searching...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;italian restaurants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;tool_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luigi Bistro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pasta Palace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I found Luigi Bistro and Pasta Palace.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_final_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found Luigi Bistro and Pasta Palace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_successful&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On exit the trace is saved to &lt;code&gt;./traces/trace_&amp;lt;id&amp;gt;_&amp;lt;timestamp&amp;gt;.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parse traces from other frameworks&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;AutoParser&lt;/code&gt; auto-detects and normalises three input formats into the canonical &lt;code&gt;AgentTrace&lt;/code&gt; model. No manual conversion needed regardless of where the trace came from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.formats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoParser&lt;/span&gt;

&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AutoParser&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;parse_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/trace.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three supported formats are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native / generic:&lt;/strong&gt; a dict with &lt;code&gt;trace_id&lt;/code&gt;, &lt;code&gt;original_goal&lt;/code&gt;, &lt;code&gt;is_successful&lt;/code&gt;, and a turns list. This is the format emitted by &lt;code&gt;TraceRecorder&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith run export:&lt;/strong&gt; a dict with &lt;code&gt;run_type&lt;/code&gt;, &lt;code&gt;inputs&lt;/code&gt;, &lt;code&gt;outputs&lt;/code&gt;, and optional &lt;code&gt;child_runs&lt;/code&gt;. Tool child runs become TOOL turns; chain and LLM child runs become AGENT turns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph state dict:&lt;/strong&gt; a dict with &lt;code&gt;thread_id&lt;/code&gt; and a &lt;code&gt;state.messages&lt;/code&gt; list whose entries use type values &lt;code&gt;human&lt;/code&gt;, &lt;code&gt;ai&lt;/code&gt;, and &lt;code&gt;tool&lt;/code&gt;.
A minimal list-of-dicts (&lt;code&gt;[{"role": "...", "content": "..."}, ...]&lt;/code&gt;) is also accepted by the generic parser.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a tool that takes any agent trace, runs deterministic detectors over it, and classifies the failure into a named category with a structured report and actionable fixes. NEO generated the full implementation: the eight rule-based detectors, the &lt;code&gt;FailureClassifier&lt;/code&gt; orchestration layer, the optional LLM-as-judge pass via OpenRouter, the &lt;code&gt;TraceRecorder&lt;/code&gt; context manager, the &lt;code&gt;AutoParser&lt;/code&gt; with support for native, LangSmith, and LangGraph formats, and the Click-based CLI with classify, validate, and batch commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a CI/CD quality gate for your agent.&lt;/strong&gt;&lt;br&gt;
If you're shipping an LLM agent, you can integrate the classifier directly into your deployment pipeline. Record traces from your test suite with &lt;code&gt;TraceRecorder&lt;/code&gt;, run &lt;code&gt;batch&lt;/code&gt; classification on every pull request, and fail the build if a new failure mode appears or if the rate of a known one spikes. You get a systematic regression check on agent behaviour, not just on code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to understand where your agent breaks most.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;batch&lt;/code&gt; classification across a directory of historical traces and look at the failure mode distribution. If CONTEXT_LOSS shows up in 40% of your traces, that's a signal about your agent's memory design, not a one-off bug. This turns debugging from reactive to diagnostic, you're looking at patterns across runs, not reading individual traces one by one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as a live monitoring layer in a multi-agent system.&lt;/strong&gt;&lt;br&gt;
The classifier runs as an A2A agent, which means it can sit as a node in a multi-agent pipeline. Any agent in the system can send its trace to the classifier after each run and get a structured failure report back. An orchestrator can use that signal to decide whether to retry, reroute, or escalate without any human in the loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during agent development to catch regressions early.&lt;/strong&gt;&lt;br&gt;
Wrap &lt;code&gt;TraceRecorder&lt;/code&gt; around your agent during development. Every run produces a trace. Feed those traces into the classifier after each session and you'll know immediately if a change introduced a new failure mode. It's the difference between finding out something broke in production versus finding out in your local environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Failure Classifier turns trace debugging from a manual read-and-guess process into a systematic one. Eight named failure modes, a deterministic rule-based layer that runs offline, an optional LLM judge for ambiguous cases, and support for traces from native formats, LangSmith, and LangGraph, all producing a structured report with the first failure turn and actionable fixes.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/agent-failure-classifier" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/agent-failure-classifier&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>agents</category>
    </item>
    <item>
      <title>Synthetic Data Flywheel: A Closed-Loop Pipeline for Instruction-Tuning Data</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:44:54 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/synthetic-data-flywheel-a-closed-loop-pipeline-for-instruction-tuning-data-c85</link>
      <guid>https://dev.to/nilofer_tweets/synthetic-data-flywheel-a-closed-loop-pipeline-for-instruction-tuning-data-c85</guid>
      <description>&lt;p&gt;Fine-tuning a model requires data. Good data requires human labeling. Human labeling doesn't scale. And most synthetic generation pipelines stop at generation, they produce candidate pairs but have no mechanism to filter them, measure quality, or feed failure cases back into the next round.&lt;br&gt;
&lt;strong&gt;Synthetic Data Flywheel&lt;/strong&gt; is a closed-loop pipeline that handles the full cycle: generate candidate instruction-output pairs, validate them deterministically, score them with an LLM-as-judge, calibrate that judge against human labels, export clean training data, and feed the failure cases from one cycle as seeds into the next. It ships as a CLI, a Python library, and an A2A-protocol agent surface for multi-agent orchestration.&lt;/p&gt;

&lt;p&gt;Everything except the optional fine-tuning step runs on CPU.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Synthetic data generation without a quality gate produces noise at scale. And quality gates without calibration produce a judge whose scores you can't trust. The flywheel addresses both: every candidate pair is scored, every score can be validated against human labels, and every failure becomes signal for the next generation cycle rather than a dead end.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;A dataset moves through a series of additive stages, each producing artifacts keyed by the dataset name. Every stage is idempotent and re-runnable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation:&lt;/strong&gt; Candidate pairs are produced from seed prompts via OpenRouter, using one of four prompt templates: QA, INSTRUCTION, REASONING, or CREATIVE.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation:&lt;/strong&gt; Deterministic checks run over each pair: schema, length, dedup, PII, language, profanity. Results are written as a JSON report with severity levels (error, warning, never). A cleaned copy of the dataset can be written at this stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judging:&lt;/strong&gt; An LLM-as-judge scores each pair against a rubric. The judge supports three backends: Ollama, OpenRouter, and Anthropic. Judgments are cached on disk keyed by &lt;code&gt;(backend, model, pair.id, rubric.name@version)&lt;/code&gt;, repeated judge passes on unchanged pairs are free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Labeling:&lt;/strong&gt; Three modes: Interactive (human reviews pairs one by one), bulk (apply a status to a filtered subset), and auto-from-judge (derive labels from judgment scores above a threshold). Labels are stored append-only so sessions can be interrupted and resumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration:&lt;/strong&gt; Treats human labels (&lt;code&gt;status == approved&lt;/code&gt;) as ground truth and measures the judge's precision, recall, F1, and accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compare:&lt;/strong&gt; Two or more judgment runs on the same dataset are compared: pass-agreement, Cohen's kappa, and Pearson correlation on the overall score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Export:&lt;/strong&gt; Pairs that clear the judge filter are written to a train/val split. The filter expression uses a safe evaluator, only arithmetic, comparisons, and subscript access into the context dict. Attribute access and function calls are rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cycle feedback:&lt;/strong&gt; Failure instructions from one cycle are extracted and fed as additional seeds into cycle N+1. The autonomous loop stops when the pass rate drops below &lt;code&gt;min_pass_rate&lt;/code&gt; (default 0.5) or &lt;code&gt;max_cycles&lt;/code&gt; is reached.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/synthetic-data-flywheel
&lt;span class="nb"&gt;cd &lt;/span&gt;synthetic-data-flywheel
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.11+. Generation requires OPENROUTER_API_KEY. The local judge path requires Ollama, verified against gemma4:latest. Fine-tuning requires Unsloth and a GPU; the repo was verified on a free Colab T4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initialize&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;init&lt;/code&gt; creates the directory structure the rest of the pipeline writes into.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Synthetic Data Flywheel Initialized
Data Directory: ./data
Checkpoint Directory: ./data/checkpoints
Report Directory: ./reports
Directories created successfully
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ingest&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;ingest&lt;/code&gt; normalises an existing dataset into the flywheel's internal JSONL format. It supports jsonl, csv, and HuggingFace datasets, and accepts a field mapping flag when the source uses different column names.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; demo.jsonl &lt;span class="nt"&gt;-n&lt;/span&gt; demo &lt;span class="nt"&gt;--tag&lt;/span&gt; demo1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ingested 8 pairs -&amp;gt; data/user/demo.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other ingest forms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; data.csv              &lt;span class="nt"&gt;-n&lt;/span&gt; my_dataset &lt;span class="nt"&gt;-f&lt;/span&gt; csv
flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; hf://tatsu-lab/alpaca &lt;span class="nt"&gt;-n&lt;/span&gt; alpaca &lt;span class="nt"&gt;--limit&lt;/span&gt; 500 &lt;span class="nt"&gt;--hf-split&lt;/span&gt; train
flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; data.jsonl &lt;span class="nt"&gt;-n&lt;/span&gt; aliased &lt;span class="nt"&gt;--map&lt;/span&gt; &lt;span class="s2"&gt;"instruction=prompt,output=completion"&lt;/span&gt;
flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; data.jsonl &lt;span class="nt"&gt;-n&lt;/span&gt; x &lt;span class="nt"&gt;--dry-run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each successful ingest writes &lt;code&gt;data/user/&amp;lt;name&amp;gt;.jsonl&lt;/code&gt; and &lt;code&gt;data/user/&amp;lt;name&amp;gt;.meta.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate&lt;/strong&gt;&lt;br&gt;
Before any judging happens, the validator runs deterministic checks over the dataset. This catches structural problems, duplicate pairs, PII, malformed schema, before spending LLM calls on them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel validate &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--checks&lt;/span&gt; schema,length,dedup,pii &lt;span class="nt"&gt;--write-clean&lt;/span&gt; data/user/demo.clean.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Validation: demo
  Total pairs       8
  pii               1
  severity:warning  1
Report: data/validation/demo.report.json
Clean dataset written (8 pairs): data/user/demo.clean.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--fail-on error|warning|never&lt;/code&gt; flag lets you gate CI on validation issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge&lt;/strong&gt;&lt;br&gt;
With a clean dataset, the judge scores each pair against a rubric. The default rubric is built-in; custom rubrics can be passed with &lt;code&gt;--rubric&lt;/code&gt;. Results are cached, so re-running after adding new pairs only scores the new ones.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel judge &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--backend&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:latest &lt;span class="nt"&gt;--tag&lt;/span&gt; v1 &lt;span class="nt"&gt;--max-pairs&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Judging 3 pairs with ollama:gemma4:latest
  Judged                3
  Passed                0 (0.0%)
  Avg overall (scored)  5.00
  Output                data/judgments/demo.v1.jsonl
  Cache                 hits=0 misses=3 writes=3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Judgments land at &lt;code&gt;data/judgments/&amp;lt;dataset&amp;gt;.&amp;lt;tag&amp;gt;.jsonl&lt;/code&gt;. The &lt;code&gt;--tag&lt;/code&gt; flag is how multiple judgment runs on the same dataset are tracked separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Label&lt;/strong&gt;&lt;br&gt;
Labeling bridges human judgment and automated scoring. auto-from-judge derives labels directly from the judgment scores, pairs above the threshold are approved, pairs below are rejected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel label &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--mode&lt;/span&gt; auto-from-judge &lt;span class="nt"&gt;--judgments&lt;/span&gt; data/judgments/demo.v1.jsonl &lt;span class="nt"&gt;--reject-below&lt;/span&gt; 3.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For manual review, &lt;code&gt;--mode interactive&lt;/code&gt; walks through pairs one by one. For bulk operations, &lt;code&gt;--mode bulk&lt;/code&gt; applies a status to a filtered subset. All labels are stored append-only at &lt;code&gt;data/labels/&amp;lt;dataset&amp;gt;.jsonl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compare&lt;/strong&gt;&lt;br&gt;
When you have two judgment runs, say from two different models, &lt;code&gt;compare&lt;/code&gt; measures how much they agree. Cohen's kappa close to 1.0 means the two judges are making the same pass/fail decisions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel compare &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--tags&lt;/span&gt; judge_a,judge_b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Judge comparison: judge_a vs judge_b
  Common pairs          8
  judge_a passed / mean 6 / 7.44
  judge_b passed / mean 6 / 7.19
  Pass agreement        100.0%
  Cohen's kappa (p/f)   1.000  (near-perfect)
  Score Pearson r       0.965
  Output                reports/demo/compare.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Calibrate&lt;/strong&gt;&lt;br&gt;
Calibration answers the question you need to answer before trusting your judge: does its &lt;code&gt;passed&lt;/code&gt; decision align with human labels? Precision of 1.0 means every pair the judge passed, a human also approved. Recall of 0.75 means the judge missed 25% of the pairs humans would have kept.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel calibrate &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--tag&lt;/span&gt; judge_a &lt;span class="nt"&gt;--approved-is&lt;/span&gt; approved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Evaluated pairs  8
  Precision        1.000
  Recall           0.750
  F1               0.857
  Accuracy         0.750
  TP/FP/TN/FN      6/0/0/2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Visualize&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;visualize&lt;/code&gt; renders a suite of PNG charts and an &lt;code&gt;index.html&lt;/code&gt; for a dataset — covering label distribution, score distributions, pass/fail breakdown, pair lengths, categories, judge agreement matrix, and validation results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel visualize &lt;span class="nt"&gt;-d&lt;/span&gt; demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;categories      reports/demo/categories.png
  lengths         reports/demo/lengths.png
  validation      reports/demo/validation.png
  pass_fail       reports/demo/pass_fail.png
  scores          reports/demo/scores.png
  criteria        reports/demo/criteria.png
  labels          reports/demo/labels.png
  judge_agreement reports/demo/judge_agreement.png
  index.html      reports/demo/index.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dataset inspection and export&lt;/strong&gt;&lt;br&gt;
Before exporting, &lt;code&gt;dataset ls&lt;/code&gt; and &lt;code&gt;dataset info&lt;/code&gt; show what artifacts exist for each dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel dataset &lt;span class="nb"&gt;ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name   pairs  source  tags
  demo   8      jsonl   demo1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel dataset info demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pairs       data/user/demo.jsonl               present
  meta        data/user/demo.meta.json           present
  validation  data/validation/demo.report.json   present
  labels      data/labels/demo.jsonl             present
  judgments   data/judgments                     5 set(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Export filters pairs using a safe expression, only pairs with an overall score of 7 or above are written, split 80/20 into train and val.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel dataset &lt;span class="nb"&gt;export &lt;/span&gt;demo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--to&lt;/span&gt; data/exports/demo.jsonl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; jsonl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--judgments&lt;/span&gt; data/judgments/demo.judge_a.jsonl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"scores['overall'] &amp;gt;= 7"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--split&lt;/span&gt; &lt;span class="nv"&gt;train&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.8,val&lt;span class="o"&gt;=&lt;/span&gt;0.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wrote 4 pairs -&amp;gt; data/exports/demo.train.jsonl
Wrote 2 pairs -&amp;gt; data/exports/demo.val.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run the autonomous loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;flywheel run&lt;/code&gt; ties everything together into a seeds-to-checkpoint cycle. Generation goes through OpenRouter; judging goes through Ollama. If Ollama isn't running, generation still succeeds and pairs are saved in the checkpoint, every judgment falls back to &lt;code&gt;passed=false&lt;/code&gt;. The standalone &lt;code&gt;flywheel judge --backend openrouter&lt;/code&gt; works fully without Ollama.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;meta-llama/llama-3.2-3b-instruct

flywheel run &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"benefits of green tea,history of python language"&lt;/span&gt; &lt;span class="nt"&gt;--max-cycles&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╭───── Configuration ─────╮
│ Synthetic Data Flywheel │
│ Seeds: 2                │
│ Max Cycles: 1           │
╰─────────────────────────╯
Starting Flywheel with max_cycles=1
============================================================
Starting Cycle 1
============================================================
Using 2 seeds
Generating synthetic data...
Generated 2 pairs
Judging quality...
Passed: 0, Failed: 2
Cycle 1 complete. Pass rate: 0.00%
Flywheel complete. Ran 1 cycles.
       Flywheel Summary
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric             ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total Cycles       │ 1     │
│ Total Passed Pairs │ 0     │
│ Avg Pass Rate      │ 0.00% │
└────────────────────┴───────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each cycle writes a checkpoint. The generated pair is saved verbatim inside &lt;code&gt;data/checkpoints/checkpoint_001.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"benefits of green tea"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Here is an example of an instruction-following training data in JSON format:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;{&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;instruction&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;What are some of the benefits of drinking green tea?&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;output&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Green tea has numerous benefits, including: - High antioxidant content - Anti-inflammatory properties - May help with weight loss ...&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;category&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;instruction&lt;/span&gt;&lt;span class="se"&gt;\"\n&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_seed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"benefits of green tea"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Status and report&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel status
flywheel report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;status&lt;/code&gt; summarises checkpoint state. &lt;code&gt;report&lt;/code&gt; produces an HTML report across cycles written to &lt;code&gt;reports/flywheel_report_&amp;lt;timestamp&amp;gt;.html&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;flywheel --help&lt;/code&gt; lists the command groups. Every command has &lt;code&gt;--help&lt;/code&gt; with full flag docs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;flywheel &lt;span class="nt"&gt;--help&lt;/span&gt;
Usage: flywheel &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS] COMMAND &lt;span class="o"&gt;[&lt;/span&gt;ARGS]...

  Synthetic Data Flywheel - Autonomous data generation pipeline.

Commands:
  calibrate  Measure judge &lt;span class="s1"&gt;'passed'&lt;/span&gt; against human labels &lt;span class="o"&gt;(&lt;/span&gt;precision/recall/F1&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
  compare    Compare two+ judgment runs &lt;span class="o"&gt;(&lt;/span&gt;Cohen&lt;span class="s1"&gt;'s kappa, agreement, ...).
  dataset    Dataset management: ls | info | export.
  ingest     Ingest a user dataset into the flywheel'&lt;/span&gt;s JSONL format.
  init       Initialize flywheel configuration.
  judge      Judge a dataset with an LLM-as-judge backend.
  label      Label a dataset: interactive/bulk/auto-from-judge.
  pipeline   Run declarative YAML pipelines.
  report     Generate HTML report from checkpoints.
  run        Run the synthetic data flywheel.
  status     Show current flywheel status.
  validate   Validate a dataset and write a ValidationReport.
  visualize  Render a suite of PNG charts + index.html &lt;span class="k"&gt;for &lt;/span&gt;a dataset.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pipeline Runner
&lt;/h2&gt;

&lt;p&gt;Individual commands can be composed into a declarative YAML pipeline and run as a single step. This is useful for repeatable workflows, the pipeline dispatches through the same Click commands as manual runs, so behaviour is identical.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pipeline_demo.yaml&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;length&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;dedup&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;export&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/user/demo_pipeline.jsonl&lt;/span&gt;
      &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jsonl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel pipeline run pipeline_demo.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1/2] flywheel validate -d demo --checks schema,length,dedup
[2/2] flywheel dataset export demo --to data/user/demo_pipeline.jsonl --format jsonl
   Pipeline: demo
  1  validate  ok  0
  2  export    ok  0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;The full pipeline is available as a library. The minimal end-to-end call scores a dataset with an async judge backed by Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.ingest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset_jsonl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.rubrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;default_rubric&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.judge&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncQualityJudge&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.judge_backends&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_backend&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.judge_cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JudgmentCache&lt;/span&gt;

&lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/user/demo.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_backend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncQualityJudge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;default_rubric&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;JudgmentCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.cache/judge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;backend_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;judgments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;judge_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;judgments&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judgments&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The statistical functions used internally by &lt;code&gt;calibrate&lt;/code&gt; and &lt;code&gt;compare&lt;/code&gt; are also directly callable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohens_kappa&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pearson&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prf&lt;/span&gt;

&lt;span class="nf"&gt;cohens_kappa&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# 0.5
&lt;/span&gt;
&lt;span class="nf"&gt;pearson&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# 0.8315...
&lt;/span&gt;
&lt;span class="nf"&gt;prf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'accuracy': 0.5,
#  'tp': 1, 'fp': 1, 'tn': 1, 'fn': 1}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A2A Agent
&lt;/h2&gt;

&lt;p&gt;The flywheel exposes a FastAPI application implementing the A2A protocol surface, &lt;code&gt;/a2a/capabilities&lt;/code&gt;, &lt;code&gt;/a2a/tasks/send&lt;/code&gt;, &lt;code&gt;/a2a/tasks/get&lt;/code&gt;, &lt;code&gt;/a2a/tasks/cancel&lt;/code&gt; so it can be orchestrated as a node in a multi-agent ML pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; synthetic_data_flywheel.a2a_agent
&lt;span class="c"&gt;# or&lt;/span&gt;
uvicorn synthetic_data_flywheel.a2a_agent:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three capabilities are exposed: &lt;code&gt;generate_synthetic_data&lt;/code&gt;, &lt;code&gt;get_status&lt;/code&gt;, &lt;code&gt;generate_report&lt;/code&gt;. Querying &lt;code&gt;/a2a/capabilities&lt;/code&gt; returns the agent's identity and the full capability list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.testclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TestClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.a2a_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TestClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/a2a/capabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# {'agent_name': 'synthetic_data_flywheel', 'version': '0.1.0',
#  'capabilities': [{'name': 'generate_synthetic_data', ...},
#                   {'name': 'get_status', ...},
#                   {'name': 'generate_report', ...}]}
&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/a2a/tasks/send&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# {'task_id': '...', 'status': {'state': 'completed'},
#  'result': {'type': 'status_result',
#             'content': {'checkpoints_found': 1,
#                         'checkpoint_dir': 'data/checkpoints'}}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;All settings are read from environment variables or a &lt;code&gt;.env&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;sk-or-...&lt;/span&gt;
&lt;span class="py"&gt;OPENROUTER_MODEL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen3-8b:free&lt;/span&gt;
&lt;span class="py"&gt;OLLAMA_BASE_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;
&lt;span class="py"&gt;OLLAMA_MODEL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gemma4:latest&lt;/span&gt;
&lt;span class="py"&gt;DEFAULT_JUDGE_BACKEND&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;ollama        # ollama | openrouter | anthropic&lt;/span&gt;
&lt;span class="py"&gt;JUDGE_CONCURRENCY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;JUDGE_TIMEOUT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;600&lt;/span&gt;
&lt;span class="py"&gt;QUALITY_MIN_SCORE&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;7.0&lt;/span&gt;
&lt;span class="py"&gt;MAX_CYCLES&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;PII_POLICY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;warn                     # strict | warn | off&lt;/span&gt;
&lt;span class="py"&gt;A2A_HOST&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="py"&gt;A2A_PORT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;JUDGE_TIMEOUT&lt;/code&gt; defaults to 600 seconds, large local models can take over two minutes on first call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning requires a GPU:&lt;/strong&gt; &lt;code&gt;Trainer.prepare_training_artifacts&lt;/code&gt; writes a Colab-ready Unsloth notebook under &lt;code&gt;notebooks/training_cycle_NNN.ipynb&lt;/code&gt;. Running the training step locally on CPU is not supported by Unsloth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous generation requires OpenRouter:&lt;/strong&gt; &lt;code&gt;flywheel run&lt;/code&gt; requires &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;. The in-loop judge is hardcoded to Ollama (&lt;code&gt;engine.create_judge&lt;/code&gt; constructs a sync &lt;code&gt;QualityJudge&lt;/code&gt; over &lt;code&gt;OllamaClient&lt;/code&gt;); if Ollama isn't available, pairs are persisted but every judgment falls back to &lt;code&gt;passed=false&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large local judges are slow to cold-start:&lt;/strong&gt; Gemma 4 (9 GB) takes about 130 seconds the first time it loads into VRAM/RAM. The default &lt;code&gt;JUDGE_TIMEOUT&lt;/code&gt; is 600 seconds to cover this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HuggingFace ingest requires &lt;code&gt;datasets&lt;/code&gt;:&lt;/strong&gt; already a dependency, but gated datasets additionally require &lt;code&gt;HUGGINGFACE_TOKEN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic judge backend requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;:&lt;/strong&gt; no offline fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a closed-loop pipeline that generates synthetic instruction-tuning pairs, filters them with a calibrated LLM judge, and feeds failure cases back as seeds for the next cycle. NEO generated the full implementation, the &lt;code&gt;FlywheelEngine&lt;/code&gt; cycle loop with checkpointing, the &lt;code&gt;AsyncQualityJudge&lt;/code&gt; with three pluggable backends and disk-backed cache, the deterministic &lt;code&gt;Validator&lt;/code&gt; with six check types, the &lt;code&gt;LabelStore&lt;/code&gt; with append-only storage, the statistical calibration layer (&lt;code&gt;cohens_kappa&lt;/code&gt;, &lt;code&gt;pearson&lt;/code&gt;, &lt;code&gt;prf&lt;/code&gt;), the safe-eval export filter, the declarative YAML pipeline runner, the Matplotlib visualisation suite, and the A2A FastAPI agent surface. 100 tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Additional judge backends:&lt;/strong&gt; the three existing backends share a common interface via &lt;code&gt;get_backend&lt;/code&gt;. Any OpenAI-compatible endpoint can be wired in as a new backend, and the judge cache, calibration, and compare logic all work with it immediately without any changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional generation templates:&lt;/strong&gt; the generator ships with four templates: QA, INSTRUCTION, REASONING, CREATIVE. New domain-specific templates would let the flywheel produce specialised training data, code generation, structured extraction, tool-use, while the cycle loop, judge, and export pipeline stay entirely unchanged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional validation checks:&lt;/strong&gt; the &lt;code&gt;Validator&lt;/code&gt; already supports six check types plugged into the same &lt;code&gt;--checks&lt;/code&gt; flag and report format. New checks for domain-specific quality signals would run in the same validation pass and appear in the same JSON report and visualisation output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-judge ensembling:&lt;/strong&gt; &lt;code&gt;compare&lt;/code&gt; already computes agreement metrics across judgment runs. Taking the average or majority vote across two or more judge scores before the pass/fail decision would reduce the noise that small local models introduce, without touching the labeling, calibration, or export logic downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Synthetic Data Flywheel closes the loop that most synthetic data pipelines leave open. It generates, validates, judges, calibrates, and exports, and feeds what failed back into the next cycle. The result is a data pipeline that improves with each run rather than producing a static batch.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/synthetic-data-flywheel" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/synthetic-data-flywheel&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>syntheticdata</category>
      <category>opensource</category>
      <category>finetuning</category>
    </item>
    <item>
      <title>Token Budget Negotiator</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 27 Apr 2026 21:55:15 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/token-budget-negotiator-1ijg</link>
      <guid>https://dev.to/nilofer_tweets/token-budget-negotiator-1ijg</guid>
      <description>&lt;p&gt;Everyone knows long prompts cost money. Almost nobody knows which parts of their prompt actually matter.&lt;/p&gt;

&lt;p&gt;Prompts accumulate over time, a system message, a style guide, a few-shot example or two, some background context. Each addition made sense when it was added. Over hundreds of API calls, the overhead compounds. And the honest answer to "which of these sections can I remove?" is: you don't know until you test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token Budget Negotiator&lt;/strong&gt; makes that test systematic. It takes a prompt split into named, prioritised sections, runs a greedy ablation loop that drops one section at a time, scores the remaining prompt against a rubric using a local or remote LLM judge, and stops when savings hit the target without falling below the quality threshold. The result is the smallest prompt that still behaves like the original.&lt;/p&gt;

&lt;p&gt;It ships as a CLI, a Python library, and an MCP server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Prompt sections are not equal in value, but there's no principled way to know which ones matter for a given task without testing. Manual trimming is guesswork. Token Budget Negotiator answers the question empirically per section, per task, against a rubric that defines what quality means for that use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;A prompt is defined as a YAML file with named sections. Each section carries a &lt;code&gt;type&lt;/code&gt; (system, few_shot, context, instruction), a &lt;code&gt;content&lt;/code&gt; block, and a &lt;code&gt;priority&lt;/code&gt; integer. Priority determines the order in which sections are considered for removal low-priority sections are evaluated first, high-priority sections last.&lt;/p&gt;

&lt;p&gt;Before any removal happens, the full prompt is scored by the judge LLM against the rubric. This establishes a baseline. The quality target for the run is &lt;code&gt;baseline_score × threshold&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The ablation loop then works through sections in ascending priority order. For each candidate, a test prompt is built without that section and rescored. If the score still meets the target, the section is dropped permanently and the loop continues with the updated prompt. If not, the section is kept and the next candidate is evaluated.&lt;/p&gt;

&lt;p&gt;Two conditions stop the loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token savings reach &lt;code&gt;min_token_savings&lt;/code&gt;,the target has been hit.&lt;/li&gt;
&lt;li&gt;A removal would push savings above &lt;code&gt;max_token_savings&lt;/code&gt;, the ceiling is enforced.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every accepted removal is verified to actually reduce the token count. The loop cannot produce a larger prompt than it started with.&lt;br&gt;
The output is a &lt;code&gt;NegotiationResult&lt;/code&gt; containing the original and optimised token counts, the list of sections removed, per-step scores, quality retention percentage, elapsed time, scoring call count, rubric name, and a full ablation log. This can be written to JSON or YAML.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;token-budget-negotiator
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.11+. The local judge path requires Ollama with a model pulled, verified end-to-end against &lt;code&gt;gemma4:latest&lt;/code&gt;. The OpenRouter path requires &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyze token distribution&lt;/strong&gt;&lt;br&gt;
Before negotiating, &lt;code&gt;analyze&lt;/code&gt; prints how many tokens each section holds and its share of the total budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget analyze examples/prompt.yaml
&lt;span class="go"&gt;
Token Distribution Analysis:
Section              Type              Tokens        % Priority
-----------------------------------------------------------------
system               system                22    18.6%       30
style_guide          system                26    22.0%       10
few_shot_1           few_shot              26    22.0%       20
few_shot_2           few_shot              20    16.9%       25
context              context               12    10.2%       40
instruction          instruction           12    10.2%      100
-----------------------------------------------------------------
TOTAL                                     118   100.0%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check the local judge:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget check-ollama &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:latest
&lt;span class="go"&gt;Ollama is connected
  Host: http://localhost:11434
  Model requested: gemma4:latest
  Model available: Yes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run the negotiator:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget negotiate examples/prompt.yaml &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    --scorer ollama --model gemma4:latest \
    --threshold 0.80 --min-savings 0.20 --max-savings 0.80 \
    --output result.json --format json

Negotiation Result:
  Original: 118 tokens, score=0.600
  Optimized: 92 tokens, score=0.700
  Savings: 22.0%
  Quality Retention: 116.7%
  Success: Yes
  Sections removed: style_guide

Results saved to result.json
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;result.json&lt;/code&gt; contains the full ablation log, the final optimized prompt, per-step scores, and metadata (elapsed time, scoring call count, rubric name).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the negotiator - OpenRouter path&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget check-openrouter
&lt;span class="go"&gt;OpenRouter is connected
  Base URL: https://openrouter.ai/api/v1
  Model requested: qwen/qwen3-8b
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget &lt;span class="nt"&gt;-v&lt;/span&gt; negotiate examples/prompt.yaml &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    --scorer openrouter --model meta-llama/llama-3.2-3b-instruct \
    --rubric rubrics/qa.yaml \
    --threshold 0.7 --min-savings 0.1 --max-savings 0.6 --no-cache
Connected to openrouter

Negotiation Result:
  Original: 118 tokens, score=1.000
  Optimized: 92 tokens, score=0.900
  Savings: 22.0%
  Quality Retention: 90.0%
  Success: Yes
  Sections removed: style_guide
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a looser threshold &lt;code&gt;(-t 0.7 --min-savings 0.1 --max-savings 0.5)&lt;/code&gt; and caching left on, the same model drops two sections for 44.1% savings at 100% quality retention.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F279uww5s8wbbjf93vwt0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F279uww5s8wbbjf93vwt0.png" alt=" " width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key &lt;code&gt;negotiate&lt;/code&gt; flags:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-r, --rubric PATH&lt;/code&gt;:  YAML rubric. Defaults to a built-in accuracy+relevance rubric.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-s, --scorer {ollama,openrouter}&lt;/code&gt;: which judge to use. Default ollama.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-m, --model TEXT&lt;/code&gt;: model name (gemma4:latest for Ollama, qwen/qwen3-8b for OpenRouter, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-t, --threshold FLOAT&lt;/code&gt;: minimum fraction of the baseline score to keep. Default 0.95.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--min-savings FLOAT&lt;/code&gt;: stop once savings reach this fraction. Default 0.40.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--max-savings FLOAT&lt;/code&gt;: never drop sections if it would save more than this. Default 0.60.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-o, --output PATH&lt;/code&gt; / &lt;code&gt;-f, --format {json,yaml}&lt;/code&gt;: write a machine-readable report.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--no-cache&lt;/code&gt;: disable the in-memory scoring cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;token_budget_negotiator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Negotiator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OllamaScorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PromptSection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rubric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;token_budget_negotiator.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RubricCriterion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SectionType&lt;/span&gt;

&lt;span class="n"&gt;sections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;PromptSection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are helpful.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;section_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SectionType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SYSTEM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;PromptSection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2+2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;section_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SectionType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INSTRUCTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Rubric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa rubric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;RubricCriterion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;factually correct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scorer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OllamaScorer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;negotiator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Negotiator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;min_token_savings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_token_savings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;negotiator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;negotiate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;original_token_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimized_token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;removed:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sections_removed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Rubric Format
&lt;/h2&gt;

&lt;p&gt;The rubric defines what quality means for the task. The judge scores each test prompt against it. Three rubrics ship in &lt;code&gt;rubrics&lt;/code&gt;/: &lt;code&gt;qa.yaml&lt;/code&gt;, &lt;code&gt;coding.yaml&lt;/code&gt;, &lt;code&gt;summarization.yaml&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qa&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;General question-answer rubric&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0"&lt;/span&gt;
&lt;span class="na"&gt;criteria&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;accuracy&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Is the response factually correct?&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;relevance&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Does it answer what was asked?&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;span class="na"&gt;scoring_instructions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Score 0-1. 1 = perfect, 0 = wrong or irrelevant.&lt;/span&gt;
&lt;span class="na"&gt;output_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  MCP Server
&lt;/h2&gt;

&lt;p&gt;The library also runs as an MCP server over stdio transport, exposing two tools, analyze and negotiate, so Claude Code or any MCP-compatible agent can call it directly during a session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; token_budget_negotiator.mcp_server &lt;span class="nt"&gt;--scorer&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;analyze&lt;/code&gt; takes a sections list and returns token distribution as JSON. negotiate takes sections, rubric, task, thresholds, and scorer config and returns the full negotiation result as JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ablation is greedy one-at-a-time in priority order, not exhaustive subset search.&lt;/li&gt;
&lt;li&gt;The judge is asked for strict JSON; free-text replies fall back to regex score extraction with reduced confidence.&lt;/li&gt;
&lt;li&gt;Small local judges like &lt;code&gt;gemma4&lt;/code&gt; are noisy, prefer thresholds in the 0.80-0.90 range and expect multi-minute wall clock even for short prompts.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;check-openrouter&lt;/code&gt; and the OpenRouter scorer require &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;; there is no offline stub.&lt;/li&gt;
&lt;li&gt;Only the &lt;code&gt;remove&lt;/code&gt; compression strategy is wired up. &lt;code&gt;CompressionStrategy&lt;/code&gt; and &lt;code&gt;sections_compressed&lt;/code&gt; exist on the model but are not yet produced by the negotiator.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a tool that takes a structured prompt, scores it with a local or remote LLM judge, and finds the minimum set of sections needed to hit a quality threshold. NEO generated the full implementation, the greedy ablation loop in &lt;code&gt;Negotiator&lt;/code&gt;, the &lt;code&gt;OllamaScorer&lt;/code&gt; and &lt;code&gt;OpenRouterScorer&lt;/code&gt; with their shared interface, the &lt;code&gt;ScoreCache&lt;/code&gt; with TTL-based invalidation, the &lt;code&gt;SectionTokenizer&lt;/code&gt; backed by tiktoken, the YAML rubric format, the MCP server with its two exposed tools, and the CLI built on Click. All 49 tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Token Budget Negotiator turns prompt compression from guesswork into an empirical process. It scores every section against a rubric, drops only what demonstrably doesn't matter, and produces a report showing exactly what changed and why.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/token-budget-negotiator" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/token-budget-negotiator&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Agent Memory Compressor: Intelligent Memory Compression for Long-Running LLM Agents</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 27 Apr 2026 13:02:47 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/agent-memory-compressor-intelligent-memory-compression-for-long-running-llm-agents-5941</link>
      <guid>https://dev.to/nilofer_tweets/agent-memory-compressor-intelligent-memory-compression-for-long-running-llm-agents-5941</guid>
      <description>&lt;p&gt;A 10-turn agent session can easily accumulate 20,000+ tokens of raw history, leaving almost no room for the current task. Naive truncation drops older turns wholesale, including the decisions and discovered facts the agent needs to avoid repeating work. Developers need a principled way to compress history rather than discard it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Memory Compressor&lt;/strong&gt; is a Python library that implements an intelligent memory compression pipeline for long-running LLM agents. It combines importance-based scoring, LLM-driven summarization, a forgetting curve trigger, and a token-budgeted context builder so agents can run indefinitely without exhausting their context windows, while preserving the facts and decisions that matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Context Window Exhaustion
&lt;/h2&gt;

&lt;p&gt;The problem has three dimensions, and agent-memory-compressor addresses each one directly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to keep&lt;/strong&gt;: A multi-signal importance scorer ranks every memory entry.&lt;br&gt;
&lt;strong&gt;How to shrink&lt;/strong&gt;: Three pluggable compression strategies replace low-value entries with compact equivalents using any OpenAI-compatible LLM.&lt;br&gt;
&lt;strong&gt;When to act&lt;/strong&gt;: A forgetting curve fires compression automatically when either a turn interval or a token threshold is crossed.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Importance Scoring&lt;/strong&gt;&lt;br&gt;
Every memory entry is scored by the &lt;code&gt;ImportanceScorer&lt;/code&gt;, which combines three signals:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vuitmevat3gxfqchuy5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vuitmevat3gxfqchuy5.png" alt=" " width="780" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression Strategies&lt;/strong&gt;&lt;br&gt;
Given a scored store, the CompressionEngine exposes three strategies:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;summarize(entry)&lt;/code&gt;: Asks the LLM for a short summary that preserves all decisions and facts.&lt;br&gt;
&lt;code&gt;extract_facts(entry)&lt;/code&gt;: Asks the LLM for a bullet list of facts and decisions, stored as high-importance compressed entries.&lt;br&gt;
&lt;code&gt;archive(entry)&lt;/code&gt;: Replaces the entry with a minimal reference; the original content is retained in the entry's &lt;code&gt;compression_history&lt;/code&gt; for audit.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;MemoryCompressor&lt;/strong&gt; orchestrates the pipeline: score, pick the lowest-scoring non-protected entries, apply the least-destructive strategy first, and iterate until the store is under &lt;code&gt;token_budget&lt;/code&gt;. Every successful replacement is verified to actually reduce the token count, so compression can never make the context larger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Forgetting Curve&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;ForgettingCurve&lt;/code&gt; decides when to compress. It combines two triggers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Turn-based:&lt;/strong&gt; fires once the number of turns since the last compression reaches &lt;code&gt;compression_interval_turns&lt;/code&gt; (default: 10)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token-based:&lt;/strong&gt; fires once &lt;code&gt;MemoryStore.token_total()&lt;/code&gt; exceeds &lt;code&gt;compression_threshold_tokens&lt;/code&gt; (default: 6000), with hysteresis to prevent thrashing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;should_compress(store)&lt;/code&gt; returns &lt;code&gt;True&lt;/code&gt; as soon as either condition is met. &lt;code&gt;get_compression_priority(store)&lt;/code&gt; returns entries sorted by importance, so the orchestrator always attacks the least-valuable history first.&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="c"&gt;# optional, for live LLM calls&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The package depends on &lt;code&gt;pydantic&lt;/code&gt;, &lt;code&gt;tiktoken&lt;/code&gt; (for &lt;code&gt;cl100k_base&lt;/code&gt; token counts), &lt;code&gt;click&lt;/code&gt;, and &lt;code&gt;rich&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage Example
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MemoryEntry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MemoryCompressor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.triggers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ForgettingCurve&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ContextBuilder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ContextConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.strategies&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CompressionEngine&lt;/span&gt;

&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MemoryEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compressor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryCompressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;protected_recent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;CompressionEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;curve&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ForgettingCurve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compression_interval_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;compression_threshold_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;curve&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;should_compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;curve&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens_saved&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compression_ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; reduction)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContextBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ContextConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Without an API key, &lt;code&gt;LLMClient&lt;/code&gt; falls back to a deterministic short stub so pipelines remain runnable in tests and offline demos. A full end-to-end demo lives at &lt;a href="https://github.com/dakshjain-1616/Agent-Memory-Compressor/blob/main/demos/long_run_demo.py" rel="noopener noreferrer"&gt;demos/long_run_demo.py&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  API Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqv6ab49n2hrdoiiaylc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqv6ab49n2hrdoiiaylc.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;memory-cli&lt;/code&gt; entrypoint (&lt;code&gt;click&lt;/code&gt;-based) is installed for quick inspection, compression, and demo runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Integration with the Session Manager
&lt;/h2&gt;

&lt;p&gt;The adapters module wires the compressor directly into the &lt;a href="https://github.com/dakshjain-1616/agent-session-manager" rel="noopener noreferrer"&gt;Stateful Agent Session Manager&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.adapters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compress_session&lt;/span&gt;

&lt;span class="n"&gt;compressed_messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compress_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# anything exposing get_messages() / get_metadata()
&lt;/span&gt;    &lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;protected_recent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SessionAdapter.session_to_store&lt;/code&gt; projects session messages into a &lt;code&gt;MemoryStore&lt;/code&gt;, &lt;code&gt;compressor.compress(...)&lt;/code&gt; runs the pipeline, and &lt;code&gt;store_to_session&lt;/code&gt; projects the compressed entries back into the session's message format, preserving original roles and retaining the compression history on each compacted entry.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I build This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. A fully autonomous AI engineering agent that writes code end-to-end for AI/ML tasks including model evals, prompt optimization, and pipeline development.&lt;br&gt;
I described the problem at a high level: an intelligent memory pipeline for long-running agents that scores history by importance, compresses the least valuable entries, and assembles a token-bounded context. &lt;/p&gt;

&lt;p&gt;NEO generated the full implementation, the multi-signal ImportanceScorer, the three compression strategies in CompressionEngine, the turn- and token-based ForgettingCurve triggers, the token-budgeted ContextBuilder, and the SessionAdapter that wires everything into an existing agent session, all as a coherent, installable Python library.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Semantic similarity scoring&lt;/strong&gt;: straightforward, just call an embeddings API and add the score to the existing pipeline. Done all the time in RAG systems.&lt;br&gt;
&lt;strong&gt;Pluggable tokenizers&lt;/strong&gt;: purely an engineering task, just abstract the tiktoken call. No research needed.&lt;br&gt;
&lt;strong&gt;More agent framework adapters&lt;/strong&gt;: LangChain/LlamaIndex all expose message lists. The &lt;code&gt;session_to_store&lt;/code&gt; pattern already exists, just repeat it for each framework.&lt;br&gt;
&lt;strong&gt;Streaming compression&lt;/strong&gt;: the trigger logic already exists, moving it per-turn is a refactor not a research problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Memory Compressor is a principled answer to context window exhaustion for long-running LLM agents.&lt;/p&gt;

&lt;p&gt;Instead of truncating history blindly, it scores every piece of memory, applies the least-destructive compression strategy first, and assembles a token-bounded context that preserves what the agent actually needs, the decisions, discovered facts, and recent turns that matter most.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Agent-Memory-Compressor" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Agent-Memory-Compressor&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>devtools</category>
      <category>agents</category>
    </item>
    <item>
      <title>Cache-Augmented Generation (CAG): A RAG-less Approach to Document QA</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 25 Apr 2026 10:29:22 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/cache-augmented-generation-cag-a-rag-less-approach-to-document-qa-3296</link>
      <guid>https://dev.to/nilofer_tweets/cache-augmented-generation-cag-a-rag-less-approach-to-document-qa-3296</guid>
      <description>&lt;p&gt;Most document QA systems today rely on Retrieval-Augmented Generation (RAG). The standard pipeline is familiar: chunk the document, generate embeddings, store them in a vector database, and retrieve relevant chunks at query time.&lt;/p&gt;

&lt;p&gt;This works, but it comes with trade-offs. The model only sees fragments of the document, retrieval adds latency, and the system becomes more complex with multiple moving parts.&lt;/p&gt;

&lt;p&gt;Cache-Augmented Generation (CAG) explores a different approach,where the document is processed once and reused across queries instead of being retrieved repeatedly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Cache-Augmented Generation
&lt;/h2&gt;

&lt;p&gt;Cache-Augmented Generation (CAG) approaches document QA by reusing the model’s internal state instead of retrieving context for every query.&lt;/p&gt;

&lt;p&gt;During ingestion, the entire document is processed in a single pass. In this step, the model builds its KV (key-value) cache, which represents the document’s context.&lt;/p&gt;

&lt;p&gt;This KV cache is then saved to disk.&lt;/p&gt;

&lt;p&gt;When a query is made, the cache is restored and the query is appended, allowing the model to generate responses using the previously processed document.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmg3ppq2ap9haejv1cnvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmg3ppq2ap9haejv1cnvd.png" alt=" " width="757" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Ingest (done once per document)&lt;/strong&gt;&lt;br&gt;
The document is wrapped in a structured prompt and sent to llama-server. The model runs a full prefill pass, loading every token into the KV cache. This takes time, proportional to document size, but only happens once. The KV cache is then saved to a .bin file on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Query (instant, repeatable)&lt;/strong&gt;&lt;br&gt;
Before each query, the saved .bin file is restored into llama-server's KV cache in ~1 second. The user's question is appended and the model generates an answer with full document context active. No re-reading, no re-embedding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Persistence&lt;/strong&gt;&lt;br&gt;
KV slots survive server restarts. Kill the server, restart it, and your next query restores the cache from disk just as fast. The 24-minute prefill for War and Peace only needs to happen once ever.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Validated Results&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;All 11 GPU tests were run on an NVIDIA RTX A6000 (48 GB VRAM) with Qwen3.5-35B-A3B Q3_K_M at 1,048,576 token context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcl1j86rcusllw996dbj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcl1j86rcusllw996dbj.png" alt=" " width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Output&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;“Who is Pierre Bezukhov?” → Correct, detailed answer&lt;br&gt;
“What happened at the Battle of Borodino?” → Correct, detailed answer&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Quick Start&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Prerequisites: Linux, NVIDIA GPU (8 GB+ VRAM), Python 3.8+&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Build llama.cpp + download model (one-time, ~35 min)&lt;/span&gt;
./setup.sh

&lt;span class="c"&gt;# 2. Start the LLM server&lt;/span&gt;
./start_server.sh

&lt;span class="c"&gt;# 3. Start the API server&lt;/span&gt;
python3 src/api_server.py

&lt;span class="c"&gt;# 4. Ingest a document&lt;/span&gt;
python3 src/ingest.py my_document.txt &lt;span class="nt"&gt;--corpus-id&lt;/span&gt; my_doc

&lt;span class="c"&gt;# 5. Query it&lt;/span&gt;
python3 src/query.py my_doc &lt;span class="s2"&gt;"What is this document about?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. After step 4, the KV cache is saved to kv_slots/my_doc.bin. Every future query restores it instantly, and it survives server restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Model Selection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;setup.sh auto-detects GPU VRAM and picks the right model:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F172inz546xzlcb7o615u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F172inz546xzlcb7o615u.png" alt=" " width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 24 GB+ path uses unsloth/Qwen3.5-35B-A3B-GGUF on HuggingFace and requires a free HF account + access token.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;REST API&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Start the API server with python3 src/api_server.py --port 8000 (optionally set CAG_API_KEY env var to enable key auth).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhslie8eplnq0cbj0uelk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhslie8eplnq0cbj0uelk.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Full API docs available at &lt;code&gt;http://localhost:8000/docs&lt;/code&gt; when the server is running.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Directory Structure&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── setup.sh              # Builds llama.cpp, downloads model
├── start_server.sh       # Launches llama-server with CAG flags
├── requirements.txt
├── src/
│   ├── api_server.py     # FastAPI REST API
│   ├── ingest.py         # CLI: ingest a document
│   ├── query.py          # CLI: query a corpus
│   └── demo.py           # End-to-end demo
├── docker/
│   ├── Dockerfile
│   └── docker-compose.yml
├── docs/
│   ├── REPORT.md         # Full GPU validation report with all 11 test results
│   └── GPU_TESTING.md    # GPU test checklist
├── models/               # GGUF weights (not committed)
├── kv_slots/             # Saved KV cache .bin files (not committed)
└── logs/                 # Runtime logs (not committed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Limitations&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Linux + NVIDIA only:&lt;/strong&gt; TurboQuant CUDA kernels require Linux and NVIDIA GPUs (no Windows, macOS, or AMD).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long initial prefill:&lt;/strong&gt; ~900K tokens can take ~24 minutes on an A6000. This is a one-time cost; subsequent queries restore in ~1 second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VRAM gating:&lt;/strong&gt; Systems with lower VRAM use smaller models with shorter context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single active corpus:&lt;/strong&gt; Uses a single llama.cpp slot (slot 0). Switching corpora requires restoring a different KV cache (~1 second).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-context limitations:&lt;/strong&gt; YaRN extrapolation biases attention toward the start and end of documents, so mid-document content can be missed at very large context sizes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build time:&lt;/strong&gt; Initial setup (./setup.sh) can take ~35 minutes to compile CUDA kernels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model access requirements:&lt;/strong&gt; Large models (e.g., Qwen3.5-35B) require a Hugging Face account and access token.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The system was defined at a high level, describing a document QA workflow that avoids RAG by loading full documents into an LLM, saving the KV cache, and restoring it for repeated queries.&lt;/p&gt;

&lt;p&gt;Based on this, NEO generated the implementation, handled debugging across CUDA, Python, and shell components, and validated the system through a series of GPU tests.&lt;/p&gt;

&lt;p&gt;This included fixing multiple issues during development and running end-to-end validation to ensure ingestion, cache restoration, and query flows worked reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to Extend This Further with NEO&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The system can be extended in several ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;supporting multiple KV cache slots&lt;/li&gt;
&lt;li&gt;improving handling of long-context attention limitations&lt;/li&gt;
&lt;li&gt;optimizing cache storage and compression&lt;/li&gt;
&lt;li&gt;exploring hybrid approaches combining CAG with retrieval&lt;/li&gt;
&lt;li&gt;extending API capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These extensions would require changes to the current implementation and can be explored based on system requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Notes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cache-Augmented Generation is an alternative way to approach document QA.&lt;/p&gt;

&lt;p&gt;Instead of retrieving context at query time, it shifts the cost to a one-time preprocessing step and reuses the model’s KV cache.&lt;/p&gt;

&lt;p&gt;This makes repeated queries faster and makes the document context available to the model through the KV cache, while introducing trade-offs in setup time and hardware requirements.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Cache-Augmented-Generation-CAG-System" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Cache-Augmented-Generation-CAG-System&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
