<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nilofer 🚀</title>
    <description>The latest articles on DEV Community by Nilofer 🚀 (@nilofer_tweets).</description>
    <link>https://dev.to/nilofer_tweets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137273%2Fac10d3a1-21d6-46e3-90d6-889213a616bd.jpg</url>
      <title>DEV Community: Nilofer 🚀</title>
      <link>https://dev.to/nilofer_tweets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nilofer_tweets"/>
    <language>en</language>
    <item>
      <title>ArchGuard: Detect Architecture Drift Before It Becomes Technical Debt</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 02 Jun 2026 09:44:09 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/archguard-detect-architecture-drift-before-it-becomes-technical-debt-5b11</link>
      <guid>https://dev.to/nilofer_tweets/archguard-detect-architecture-drift-before-it-becomes-technical-debt-5b11</guid>
      <description>&lt;p&gt;Architecture degrades gradually. A circular dependency here, a god class there, a controller reaching directly into the database layer. Each violation is small on its own. Over time they compound into a codebase that is expensive to change and expensive to understand.&lt;/p&gt;

&lt;p&gt;Most teams discover this in retrospect when a refactor takes three times as long as expected, or when a seemingly isolated change breaks something unrelated. By then the drift is already embedded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArchGuard&lt;/strong&gt; is a production-ready Python static analysis tool that detects architecture degradation patterns in codebases over time. It runs six built-in detectors, compares architecture health between branches, tracks drift over the last 10 commits, and integrates into CI/CD through a GitHub Action or git hooks - all without any AI model dependency, using deterministic local static analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;6 Built-in Detectors&lt;/strong&gt; - circular dependencies, god classes, service layer bypasses, magic values, cyclomatic complexity, and layer violations.&lt;br&gt;
&lt;strong&gt;Per-PR Analysis&lt;/strong&gt; - compare architecture health between branches to catch regressions before they merge.&lt;br&gt;
&lt;strong&gt;Trend Analysis&lt;/strong&gt; - track architecture health over the last 10 commits to see drift over time.&lt;br&gt;
&lt;strong&gt;Multiple Output Formats&lt;/strong&gt; - table, JSON, YAML, Markdown, and HTML.&lt;br&gt;
&lt;strong&gt;CLI and Git Hooks&lt;/strong&gt; - command-line tool with pre-commit and pre-push hooks.&lt;br&gt;
&lt;strong&gt;GitHub Action&lt;/strong&gt; - CI/CD integration for automated architecture checks.&lt;br&gt;
&lt;strong&gt;YAML Configuration&lt;/strong&gt; - flexible, project-specific configuration via &lt;code&gt;.archguard.yml&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The CLI and YAML config feed the core engine - an AST parser, dependency graph, and base analyzer which fans out to six detectors. Findings are graded by severity, rendered as Table, JSON, YAML, Markdown, or HTML, and delivered through the CLI, git hooks, or the GitHub Action.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uj8xg6cbkzervxd4ge9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uj8xg6cbkzervxd4ge9.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;From PyPI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;archguard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From Source&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/Arch-Guard
&lt;span class="nb"&gt;cd &lt;/span&gt;archguard
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.10+.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;Initialize a configuration file in the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scan the current tree or point it at a specific path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard scan
archguard scan ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For machine-readable results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard scan &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--output&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review architecture drift over the last 10 commits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard trend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Commands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;scan - Analyze Codebase&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard scan &lt;span class="o"&gt;[&lt;/span&gt;PATH] &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key flags: &lt;code&gt;--format&lt;/code&gt;, &lt;code&gt;--output&lt;/code&gt;, &lt;code&gt;--detectors&lt;/code&gt;, &lt;code&gt;--severity&lt;/code&gt;, &lt;code&gt;--fail-on-violations&lt;/code&gt;. Global flags: &lt;code&gt;--config&lt;/code&gt;, &lt;code&gt;--verbose&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;trend - Analyze Trends&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard trend &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags: &lt;code&gt;--commits&lt;/code&gt;, &lt;code&gt;--format&lt;/code&gt;, &lt;code&gt;--output&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;init - Create Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard init &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--path&lt;/code&gt; selects the config file location.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;config - Manage Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard config                          &lt;span class="c"&gt;# Show active configuration&lt;/span&gt;
archguard config output_format            &lt;span class="c"&gt;# Read a value&lt;/span&gt;
archguard config output_format json       &lt;span class="c"&gt;# Update a value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Six Detectors
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Circular Dependency&lt;/strong&gt;&lt;br&gt;
Detects circular import dependencies between modules.&lt;br&gt;
&lt;code&gt;min_cycle_length&lt;/code&gt; - minimum cycle length to report, default 2&lt;br&gt;
&lt;code&gt;max_cycles&lt;/code&gt; - maximum cycles to report, default 100&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;God Class&lt;/strong&gt;&lt;br&gt;
Detects classes with too many methods, attributes, or lines.&lt;br&gt;
&lt;code&gt;max_methods&lt;/code&gt; - maximum methods per class, default 20&lt;br&gt;
&lt;code&gt;max_attributes&lt;/code&gt; - maximum attributes per class, default 15&lt;br&gt;
&lt;code&gt;max_lines&lt;/code&gt; - maximum lines per class, default 500&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service Layer Bypass&lt;/strong&gt;&lt;br&gt;
Detects when controller or presentation layers bypass service layers to access repositories directly.&lt;br&gt;
&lt;code&gt;controller_patterns&lt;/code&gt; - regex patterns for controller files&lt;br&gt;
&lt;code&gt;service_patterns&lt;/code&gt; - regex patterns for service files&lt;br&gt;
&lt;code&gt;repository_patterns&lt;/code&gt; - regex patterns for repository files&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Magic Value&lt;/strong&gt;&lt;br&gt;
Detects hardcoded literals that should be named constants.&lt;br&gt;
&lt;code&gt;min_string_length&lt;/code&gt; - minimum string length to flag, default 3&lt;br&gt;
&lt;code&gt;max_string_length&lt;/code&gt; - maximum string length to check, default 100&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cyclomatic Complexity&lt;/strong&gt;&lt;br&gt;
Detects functions and methods with high cyclomatic complexity.&lt;br&gt;
&lt;code&gt;thresholds&lt;/code&gt; - complexity thresholds for each severity level&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer Violation&lt;/strong&gt;&lt;br&gt;
Detects violations of layered architecture, such as the presentation layer importing from the repository layer.&lt;br&gt;
&lt;code&gt;layers&lt;/code&gt; - layer definitions with patterns and allowed calls&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Create a &lt;code&gt;.archguard.yml&lt;/code&gt; file in your project root. The config supports project metadata, include and exclude patterns, and per-detector options such as cycle length, maximum class size, and complexity thresholds. Output behavior, Git integration, and trend analysis are all controlled through the same file.&lt;/p&gt;
&lt;h2&gt;
  
  
  Git Hooks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python hooks/install.py                        &lt;span class="c"&gt;# Install pre-commit hook&lt;/span&gt;
python hooks/install.py &lt;span class="nt"&gt;--pre-commit&lt;/span&gt; &lt;span class="nt"&gt;--pre-push&lt;/span&gt;  &lt;span class="c"&gt;# Install both hooks&lt;/span&gt;
python hooks/install.py &lt;span class="nt"&gt;--force&lt;/span&gt;                &lt;span class="c"&gt;# Overwrite existing hooks&lt;/span&gt;
python hooks/install.py &lt;span class="nt"&gt;--uninstall&lt;/span&gt;            &lt;span class="c"&gt;# Remove hooks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pre-commit hook&lt;/strong&gt; - runs ArchGuard on staged Python files before committing.&lt;br&gt;
&lt;strong&gt;Pre-push hook&lt;/strong&gt; - runs trend analysis before pushing to remote.&lt;/p&gt;
&lt;h2&gt;
  
  
  GitHub Action
&lt;/h2&gt;

&lt;p&gt;The GitHub Action integrates ArchGuard into CI/CD pipelines. Basic usage runs on push or pull request workflows, checks out the repository with full history, and passes path, format, severity, and fail-on-violations settings as action inputs. Advanced configuration enables trend mode, selects Markdown output, sets the commit window, and uploads the generated report as an artifact.&lt;/p&gt;
&lt;h2&gt;
  
  
  Acknowledgments
&lt;/h2&gt;

&lt;p&gt;Built with Click for CLI, Python's built-in &lt;code&gt;ast&lt;/code&gt; module for AST parsing, NetworkX for dependency graph analysis, Rich for terminal output, and GitPython for Git integration.&lt;/p&gt;
&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/Arch-Guard
&lt;span class="nb"&gt;cd &lt;/span&gt;archguard
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pre-commit &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running Tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest                                                        &lt;span class="c"&gt;# Full suite&lt;/span&gt;
pytest &lt;span class="nt"&gt;--cov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;src/archguard &lt;span class="nt"&gt;--cov-report&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;html                 &lt;span class="c"&gt;# With coverage&lt;/span&gt;
pytest tests/unit/test_detectors.py                          &lt;span class="c"&gt;# Targeted detector check&lt;/span&gt;
pytest tests/integration/                                    &lt;span class="c"&gt;# Integration coverage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Quality&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruff check src/ tests/               &lt;span class="c"&gt;# Linting&lt;/span&gt;
ruff check &lt;span class="nt"&gt;--fix&lt;/span&gt; src/ tests/         &lt;span class="c"&gt;# Auto-fix&lt;/span&gt;
pyright src/                         &lt;span class="c"&gt;# Type checking&lt;/span&gt;
ruff check src/ tests/ &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pyright src/  &lt;span class="c"&gt;# Combined gate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;Fork the repository. Create a feature branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/amazing-feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make your changes, run tests with &lt;code&gt;pytest&lt;/code&gt;, run linting with &lt;code&gt;ruff check src/ tests/&lt;/code&gt;, commit, push to the branch, and open a Pull Request.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a production-ready static analysis tool that detects architecture drift in Python codebases over time - with six built-in detectors, trend analysis over git history, multiple output formats, git hook integration, and a GitHub Action for CI/CD. NEO built the full implementation: the core engine with AST parser, dependency graph via NetworkX, and base analyzer; all six detector modules; the formatter layer covering table, JSON, YAML, Markdown, and HTML output; the git integration via GitPython; the CLI built on Click; the YAML configuration layer; the git hook installer and pre-commit and pre-push hooks; the GitHub Action; and the full test suite covering unit and integration tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a quality gate in every pull request.&lt;/strong&gt;&lt;br&gt;
Add the GitHub Action to your workflow with &lt;code&gt;--fail-on-violations&lt;/code&gt; and the severity threshold you care about. Every PR gets checked for new circular dependencies, god classes, layer violations, and complexity regressions before it merges automatically, without any manual review step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use trend analysis to measure the health of an inherited codebase.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;archguard&lt;/code&gt; trend on a codebase you have just taken over. The last 10 commits give you a picture of whether the architecture is improving or degrading, and which detectors are firing most frequently - useful context before making any changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use git hooks to enforce standards locally before code reaches CI.&lt;/strong&gt;&lt;br&gt;
Install the pre-commit hook with &lt;code&gt;python hooks/install.py&lt;/code&gt;. Staged files are checked on every commit. The pre-push hook runs trend analysis before anything reaches the remote. Issues are caught at the developer's machine, not in CI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional detectors.&lt;/strong&gt;&lt;br&gt;
The six detectors share a common base analyzer interface. A new detector for a project-specific architecture rule follows the same pattern - implement the detection logic, add per-detector configuration to &lt;code&gt;.archguard.yml&lt;/code&gt;, and register it. It appears automatically in scan output, trend analysis, and all output formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Architecture drift is invisible until it is expensive. ArchGuard makes it visible at every commit, every PR, and every push - with deterministic static analysis that requires no API keys, no model downloads, and no network calls. Six detectors, trend tracking over git history, and CI/CD integration in one tool.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Arch-Guard" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Arch-Guard&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>Prepush-Guardian: Catch Secrets and Broken Tests Before They Reach Git History</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 01 Jun 2026 12:13:22 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/prepush-guardian-catch-secrets-and-broken-tests-before-they-reach-git-history-fpc</link>
      <guid>https://dev.to/nilofer_tweets/prepush-guardian-catch-secrets-and-broken-tests-before-they-reach-git-history-fpc</guid>
      <description>&lt;p&gt;You are about to push. There is a hardcoded API key buried in one of 30 changed files. Or you forgot to write a test for that new module. Or the test suite is silently failing. You will not know until it is already in git history.&lt;/p&gt;

&lt;p&gt;Prepush-Guardian catches all of this before the push lands. It is a production-grade Git pre-push hook that scans staged files for secrets, auto-generates missing tests, runs your full test suite, and blocks the push if anything fails before it ever reaches the remote.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck3j18nwpzfqr6fvekc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck3j18nwpzfqr6fvekc0.png" alt=" " width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Tool
&lt;/h2&gt;

&lt;p&gt;Manual review - Misses things, does not scale, no enforcement&lt;br&gt;
CI/CD only - Finds it after the push, already in history&lt;br&gt;
prepush-guardian - Blocked at push time, before it ever reaches remote&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scans every staged file for 20+ secret patterns: AWS, GitHub PATs, private keys, database URLs, bearer tokens, and more&lt;/li&gt;
&lt;li&gt;Shannon entropy scanner catches novel secrets not matched by patterns&lt;/li&gt;
&lt;li&gt;Auto-generates missing tests using OpenRouter AI, with a template fallback if no API key is set&lt;/li&gt;
&lt;li&gt;Runs your full test suite and blocks the push if coverage drops below threshold&lt;/li&gt;
&lt;li&gt;Writes a markdown report at &lt;code&gt;.neo/prepush-report.md&lt;/code&gt; for every push&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Clone and install the hook into your repo
git clone https://github.com/neo-ai/prepush-guardian
cd your-target-repo

# Install the pre-push hook
python3 /path/to/prepush-guardian/install.py

# Optional: set API key for AI test generation
cp .env.example .env   # fill in OPENROUTER_API_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The hook runs automatically on every &lt;code&gt;git push&lt;/code&gt;. To run manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 prepush_guardian.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp .env.example .env
# Required only for AI-based test generation
# Free key at: https://openrouter.ai/keys
OPENROUTER_API_KEY=your_openrouter_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without an API key, the tool falls back to template-based test generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Commands
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tgi2wz9gn2conp5x8g0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tgi2wz9gn2conp5x8g0.png" alt=" " width="725" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Detection Patterns
&lt;/h2&gt;

&lt;p&gt;The secret scanner covers 20+ patterns across four severity levels:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg6hga45t7gdnq2le8w0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg6hga45t7gdnq2le8w0y.png" alt=" " width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Shannon entropy scanner runs alongside the pattern matcher. It catches novel secrets - API keys or tokens not yet covered by a named pattern by flagging high-entropy strings assigned to variables named KEY, TOKEN, or SECRET.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scoring and Thresholds
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8flyrugjykeqciqwonh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8flyrugjykeqciqwonh.png" alt=" " width="382" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.neo/config.json&lt;/code&gt; to customize behavior. It is auto-created with defaults if absent:&lt;br&gt;
&lt;code&gt;coverage_warn_threshold&lt;/code&gt; - default 70. Warn if coverage drops below this percentage.&lt;br&gt;
&lt;code&gt;coverage_block_threshold&lt;/code&gt; - default 50. Block push if coverage drops below this percentage.&lt;br&gt;
&lt;code&gt;block_on_low_severity&lt;/code&gt; - default false. Also hard-block on LOW findings.&lt;br&gt;
&lt;code&gt;auto_fix_gitignore&lt;/code&gt; - default true. Add sensitive filenames to &lt;code&gt;.gitignore&lt;/code&gt; automatically.&lt;br&gt;
&lt;code&gt;generate_missing_tests&lt;/code&gt; - default true. Auto-generate tests for untested source files.&lt;br&gt;
&lt;code&gt;skip_test_check_for&lt;/code&gt; - default &lt;code&gt;["migrations/", "scripts/", "docs/"]&lt;/code&gt;. Directories excluded from test generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exit Codes
&lt;/h2&gt;

&lt;p&gt;0 : All checks passed - push proceeding&lt;br&gt;
1 : Push blocked - CRITICAL/HIGH findings or test failures&lt;/p&gt;

&lt;h2&gt;
  
  
  File Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prepush-guardian/
├── prepush_guardian.py      # Main orchestrator
├── leak_detector.py         # Phase 1: secret &amp;amp; entropy detection
├── test_generator.py        # Phase 2: AI test generation
├── test_runner.py           # Phase 2: test execution + coverage
├── reporter.py              # Phase 3: markdown report
├── install.py               # Hook installer
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
└── tests/
    ├── test_leak_detector.py
    └── fixtures/
        ├── sample_with_secrets.py
        └── sample_clean.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three-phase structure maps cleanly to the file names - &lt;code&gt;leak_detector.py&lt;/code&gt; handles Phase 1, &lt;code&gt;test_generator.py&lt;/code&gt; and &lt;code&gt;test_runner.py&lt;/code&gt; handle Phase 2, and &lt;code&gt;reporter.py&lt;/code&gt; handles Phase 3. &lt;code&gt;prepush_guardian.py&lt;/code&gt; orchestrates all three phases in sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a production-grade Git pre-push hook that catches secrets, validates test coverage, and auto-generates missing tests - blocking the push before anything problematic reaches the remote. NEO planned, wrote, tested, and verified every file in this repository without human intervention: the main orchestrator in &lt;code&gt;prepush_guardian.py&lt;/code&gt;, the secret and entropy scanner in &lt;code&gt;leak_detector.py&lt;/code&gt; covering 20+ patterns, the AI test generator in &lt;code&gt;test_generator.py&lt;/code&gt; with OpenRouter integration and template fallback, the test runner and coverage checker in &lt;code&gt;test_runner.py&lt;/code&gt;, the markdown report generator in &lt;code&gt;reporter.py&lt;/code&gt;, the hook installer in &lt;code&gt;install.py&lt;/code&gt;, and the test suite with fixtures.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install it into every repo your team pushes from.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;python3 install.py&lt;/code&gt; once in each repository. From that point, every &lt;code&gt;git push&lt;/code&gt; runs the full three-phase check automatically, no CI changes, no developer workflow changes. Secrets and test failures are blocked before they reach the remote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tune the thresholds to match your team's standards.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;.neo/config.json&lt;/code&gt; file controls coverage warn and block thresholds, whether LOW-severity findings hard-block the push, and which directories are excluded from test generation. These can be committed to the repo so the same standards apply across the whole team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the markdown report as a push audit trail.&lt;/strong&gt;&lt;br&gt;
Every push writes a report to &lt;code&gt;.neo/prepush-report.md&lt;/code&gt;.This gives you a record of what was scanned, what was found, and what was blocked, useful for teams with compliance requirements or for debugging why a push was blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend the detection patterns in &lt;code&gt;leak_detector.py&lt;/code&gt;.&lt;/strong&gt;&lt;br&gt;
The secret scanner covers 20+ named patterns. Adding a new pattern for a domain-specific secret type means adding it to the pattern list in &lt;code&gt;leak_detector.py&lt;/code&gt;. It is immediately active on the next push with no other changes needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;The gap between "I think this is clean" and "I know this is clean" is where prepush-guardian lives. Secrets get committed because no one checked. Tests go missing because there was no enforcement. prepush-guardian closes both gaps at the moment they matter most before the push lands.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/prepush-guardian" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/prepush-guardian&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>git</category>
      <category>opensource</category>
      <category>api</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Fine-Tuning Qwen2.5-0.5B to Write SRE Post-Mortem Summaries</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 30 May 2026 04:43:37 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/fine-tuning-qwen25-05b-to-write-sre-post-mortem-summaries-2jem</link>
      <guid>https://dev.to/nilofer_tweets/fine-tuning-qwen25-05b-to-write-sre-post-mortem-summaries-2jem</guid>
      <description>&lt;p&gt;Writing post-mortem root-cause summaries is time-consuming and inconsistent. Junior SREs miss contributing factors. Senior SREs write summaries that vary in depth and structure. Zero-shot LLMs produce verbose, generic output that does not follow SRE conventions.&lt;br&gt;
Fine-tuning a small model on real incident data produces structured, concise summaries that follow your organisation's format at a fraction of the cost of a large model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo10j1ff2xwcquhknpwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo10j1ff2xwcquhknpwo.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why This Approach
&lt;/h2&gt;

&lt;p&gt;Diffrent type of approaches and what you get: &lt;/p&gt;

&lt;p&gt;Manual SRE writing : Inconsistent, time-consuming, expertise-dependent&lt;br&gt;
Zero-shot large model : Generic format, verbose, high cost per call&lt;br&gt;
Qwen2.5-0.5B fine-tuned : SRE-format outputs, fast, cheap, runs on CPU or consumer GPU&lt;/p&gt;

&lt;p&gt;The key advantages of this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;700-sample training set of real incident timelines mapped to root-cause summaries&lt;/li&gt;
&lt;li&gt;4-bit quantized LoRA training, runs on a single consumer GPU with 8GB VRAM or more&lt;/li&gt;
&lt;li&gt;Evaluated against a structured rubric covering timeline reference, contributing factors, specific component, and prevention action&lt;/li&gt;
&lt;li&gt;Compared against &lt;code&gt;qwen3.6-plus:free&lt;/code&gt; and &lt;code&gt;gpt-5.4-nano&lt;/code&gt; baselines&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The HuggingFace Model
&lt;/h2&gt;

&lt;p&gt;The fine-tuned adapter is published at: &lt;code&gt;daksh-neo/postmortem-qwen2.5-0.5b-lora&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After training, the LoRA weights are saved to &lt;code&gt;models/postmortem-lora/hf_export/&lt;/code&gt; and pushed to HuggingFace.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# fill in OPENROUTER_API_KEY&lt;/span&gt;
&lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; .env | xargs&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Required for baseline evaluation with OpenRouter&lt;/span&gt;
&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_openrouter_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; is required only for running baseline evaluations against zero-shot models via OpenRouter. The fine-tuning and local evaluation steps run without it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline
&lt;/h2&gt;

&lt;p&gt;The full pipeline runs in four steps:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnoprfc6o8jtctn9wsrn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnoprfc6o8jtctn9wsrn.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each step is independent, you can run baseline evaluation before fine-tuning to establish the gap the fine-tuned model closes, and run evaluation again after to measure the improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Configuration
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb66xdhw38x54dtb6pbtu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb66xdhw38x54dtb6pbtu.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation Rubric&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every generated summary is scored against a four-criterion rubric. Each criterion carries equal weight:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F459x0v53vxhgjlelbwvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F459x0v53vxhgjlelbwvj.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pass threshold: 0.60 weighted score or above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expected Results
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;qwen/qwen3.6-plus:free&lt;/code&gt; (zero-shot) - 20–35%&lt;br&gt;
&lt;code&gt;openai/gpt-5.4-nano&lt;/code&gt; (zero-shot) - 35–50%&lt;br&gt;
Qwen2.5-0.5B (fine-tuned, 3 epochs) - &amp;gt; 60%&lt;/p&gt;

&lt;p&gt;The fine-tuned 0.5B model outperforms both zero-shot baselines on rubric compliance because it has been trained specifically on the output format the rubric measures, not on general-purpose tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  File Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml_project_0901/
├── scrape_postmortems.py    # Data collection
├── baseline.py              # Zero-shot baseline via OpenRouter
├── finetune.py              # LoRA fine-tuning
├── eval.py                  # Evaluation + comparison
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
├── data/
│   ├── train.jsonl          # 700 training examples
│   ├── test_100.jsonl       # 100 held-out test examples
│   ├── rubric.json          # Scoring rubric
│   └── baseline_results.jsonl
└── models/
    └── postmortem-lora/
        └── hf_export/       # Push to HuggingFace after training
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a complete fine-tuning pipeline for a small model on SRE post-mortem data, with data scraping, zero-shot baseline comparison, 4-bit LoRA fine-tuning, and structured rubric-based evaluation. NEO planned, wrote, tested, and verified every file in the repository without human intervention: the data scraper producing 700 training examples and 100 held-out test examples, the baseline evaluator running zero-shot prompts against OpenRouter models, the LoRA fine-tuning script with the full model configuration, the rubric-based evaluator producing the comparison table, and the HuggingFace export pipeline pushing the trained adapter to &lt;code&gt;daksh-neo/postmortem-qwen2.5-0.5b-lora&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to replace inconsistent manual post-mortem writing in your team.&lt;/strong&gt;&lt;br&gt;
Train on your own organisation's incident data by replacing &lt;code&gt;data/train.jsonl&lt;/code&gt; with your own incident timeline to root-cause summary pairs. The rubric in &lt;code&gt;data/rubric.json&lt;/code&gt; can be adapted to match your org's specific post-mortem format the evaluation pipeline measures compliance against whatever criteria you define.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the baseline comparison to justify the fine-tuning investment.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;python baseline.py&lt;/code&gt; before fine-tuning to measure what zero-shot models produce on your data. Run &lt;code&gt;python eval.py&lt;/code&gt; after fine-tuning to see the improvement. The comparison table gives you a concrete before-and-after that makes the case for domain-specific fine-tuning over general-purpose models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the published adapter directly without retraining.&lt;/strong&gt;&lt;br&gt;
The fine-tuned LoRA adapter is available at daksh-neo/postmortem-qwen2.5-0.5b-lora on HuggingFace. You can load it directly without running the training pipeline - useful for teams that want to evaluate the output before committing to their own fine-tuning run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it to other structured generation tasks.&lt;/strong&gt;&lt;br&gt;
The four-step pipeline - scrape, baseline, fine-tune, evaluate is domain-agnostic. Any task where structured output format matters more than general knowledge is a candidate: alert triage summaries, change request descriptions, deployment notes. Swap the training data and rubric criteria, and the rest of the pipeline runs unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Zero-shot large models produce verbose, generic post-mortem summaries that do not follow SRE conventions. A fine-tuned 0.5B model trained on 700 domain-specific examples outperforms them on every criterion of the rubric  - timeline reference, contributing factors, specific component identification, and concrete prevention actions, while running on a consumer GPU and costing a fraction per call.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/postmortem-finetune" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/postmortem-finetune&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Morph: AST-Level Refactoring Where the LLM Describes Intent, Not Code</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 23 May 2026 11:04:25 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/morph-ast-level-refactoring-where-the-llm-describes-intent-not-code-1hh6</link>
      <guid>https://dev.to/nilofer_tweets/morph-ast-level-refactoring-where-the-llm-describes-intent-not-code-1hh6</guid>
      <description>&lt;p&gt;When an LLM generates source code for a refactor, the output is a diff a reviewer must read line by line and trust blindly. There is no way to know if the model missed a reference, broke an import, or introduced a subtle logic change without reading every line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Morph&lt;/strong&gt; takes a different approach. Instead of asking the LLM to generate code, it asks the LLM to describe what to change as a structured plan of typed operations - RenameSymbol, MoveFunction, ExtractModule, and more. A reviewer reads ten structured operations in seconds and knows exactly what will change, why, and in what order. The transformation engine then validates the plan against the real codebase dependency graph, applies each operation atomically using tree-sitter AST manipulation, runs the test suite to confirm correctness, and stages clean changes for review. Failed transformations roll back automatically.&lt;/p&gt;

&lt;p&gt;The LLM's job is intent declaration, not code writing. Morph's engine handles everything else.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fandmnivkji8ox3d2p3l4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fandmnivkji8ox3d2p3l4.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Typed Plans Beat Source Code Generation
&lt;/h2&gt;

&lt;p&gt;When a refactoring is expressed as a typed plan, every operation is verifiable before it runs. The plan validator checks file existence, symbol existence, dependency conflicts, and operation conflicts against a real dependency graph. The transformer applies operations in dependency order. The verifier runs pytest after every apply - any failure triggers automatic rollback.&lt;/p&gt;

&lt;p&gt;Source code generation has none of these guarantees. A typed plan does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;p&gt;A natural language goal enters the LLM Planner, which outputs a validated &lt;code&gt;TransformationPlan&lt;/code&gt;. The Plan Validator checks file existence, symbol existence, dependency conflicts, and operation conflicts against a NetworkX dependency graph. The Transformer applies operations in dependency order using tree-sitter AST manipulation, creating a file backup first. The Verifier runs pytest - any failure triggers automatic rollback. Clean changes are handed off to the Staging Manager via GitPython and summarised in a Report.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23xuq96apm3jeuukb56b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23xuq96apm3jeuukb56b.png" alt=" " width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Operations
&lt;/h2&gt;

&lt;p&gt;Each operation is a typed Pydantic model. The LLM populates the fields — Morph validates and executes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6q4h80qjm5jbn4yb1y8d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6q4h80qjm5jbn4yb1y8d.png" alt=" " width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Dependency Graph Works
&lt;/h2&gt;

&lt;p&gt;Before validating any plan, Morph parses the entire codebase with tree-sitter and builds a NetworkX dependency graph. This graph is used to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detect files that import the symbol being moved or renamed&lt;/li&gt;
&lt;li&gt;Sort operations so dependencies are updated before dependents&lt;/li&gt;
&lt;li&gt;Warn when a move will cascade across downstream files&lt;/li&gt;
&lt;li&gt;Prevent circular dependency introduction from module extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what makes Morph safe to run on real codebases - the plan is validated against the actual dependency structure before a single file is touched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rollback Guarantee
&lt;/h2&gt;

&lt;p&gt;Every non-dry-run apply call snapshots all affected files before touching them. If pytest reports failures after transformation, Morph restores from the snapshot automatically. The workspace is always left in a clean, known-good state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Live Results
&lt;/h2&gt;

&lt;p&gt;A real dry-run against &lt;code&gt;anthropic/claude-haiku-4-5&lt;/code&gt; via OpenRouter - the LLM parsed a natural language rename goal and produced a validated &lt;code&gt;RenameSymbol&lt;/code&gt; plan in under 5 seconds. Full output and reproduction steps are in &lt;code&gt;RESULTS.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzbng722a9nge2uwlrj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzbng722a9nge2uwlrj7.png" alt=" " width="799" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -e .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For local inference, install Ollama and pull a model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ollama pull gemma4:e4b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For cloud backends, set the relevant environment variable:&lt;br&gt;
&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; - OpenRouter (recommended)&lt;br&gt;
&lt;code&gt;OPENAI_API_KEY&lt;/code&gt; - OpenAI&lt;br&gt;
&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; - Anthropic&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;Describe what you want in plain English. Morph figures out the operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph refactor --goal "rename calculate_total to compute_total" ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Preview the plan without touching any files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph refactor --goal "extract validation logic into validate_input()" ./src --dry-run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate and save the plan for inspection before applying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph plan --goal "add type annotations to all functions in utils.py" ./src --output plan.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply a saved plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph refactor --plan plan.json ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the codebase passes its own test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph verify ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a Markdown report of the last run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph report ./src --format markdown --output REFACTOR_REPORT.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Supported Models
&lt;/h2&gt;

&lt;p&gt;Morph works with any provider. OpenRouter is the recommended starting point - one API key routes to every model below without separate accounts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpo8i7vwd5oric8b85qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpo8i7vwd5oric8b85qr.png" alt=" " width="798" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The planner uses &lt;code&gt;temperature=0.1&lt;/code&gt; - low randomness produces more consistent structured output. Unknown model strings are automatically routed through OpenRouter with no &lt;code&gt;--backend&lt;/code&gt; flag required.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;morph refactor --goal "..." PATH&lt;/code&gt; - Generate plan from goal and apply it&lt;br&gt;
&lt;code&gt;morph refactor --plan FILE PATH&lt;/code&gt; - Apply a previously saved plan&lt;br&gt;
&lt;code&gt;morph refactor ... --dry-run&lt;/code&gt; - Show plan without modifying files&lt;br&gt;
&lt;code&gt;morph plan --goal "..." PATH&lt;/code&gt; - Generate and display plan only&lt;br&gt;
&lt;code&gt;morph verify PATH&lt;/code&gt; - Run the test suite and report pass/fail&lt;br&gt;
&lt;code&gt;morph report PATH&lt;/code&gt; - Generate Markdown/JSON report of last run&lt;/p&gt;

&lt;p&gt;Key flags: &lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--backend&lt;/code&gt;, &lt;code&gt;--dry-run&lt;/code&gt;, &lt;code&gt;--no-rollback&lt;/code&gt;, &lt;code&gt;--output&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;Clone and install in editable mode with dev dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/morph
cd morph
pip install -e ".[dev]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the full test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pytest tests/ -v
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lint and type-check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ruff check morph/ &amp;amp;&amp;amp; mypy morph/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a refactoring CLI where the LLM describes intent as a structured typed plan rather than generating raw code with AST-level execution, dependency graph validation, automatic rollback on test failure, and support for multiple LLM backends. NEO built the full implementation: the LLM Planner producing typed &lt;code&gt;TransformationPlan&lt;/code&gt; outputs with &lt;code&gt;temperature=0.1&lt;/code&gt;, the seven typed Pydantic operation models, the Plan Validator checking file existence, symbol existence, and dependency conflicts against a NetworkX graph, the Transformer applying operations in dependency order via tree-sitter AST manipulation with file backup, the Verifier running pytest with automatic snapshot rollback on failure, the Staging Manager via GitPython, the report generator, and the full CLI with all six commands and their key flags.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to refactor production codebases safely.&lt;/strong&gt;&lt;br&gt;
Instead of asking an LLM to rewrite files, describe the refactoring goal in plain English. Morph validates the plan against the real dependency graph, applies it atomically, and rolls back automatically if tests fail. The dry-run mode lets you inspect exactly what will happen before anything is touched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the saved plan workflow for team review.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;morph plan --goal "..." --output plan.json&lt;/code&gt; to generate the structured plan without applying it. Share the JSON with your team for review before running the apply step. Reviewers see ten typed operations instead of a raw diff - faster to review, easier to reason about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as a refactoring step in CI/CD pipelines.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;morph verify PATH&lt;/code&gt; runs the test suite and reports pass/fail with an exit code, making it composable as a CI step. Combined with &lt;code&gt;morph refactor&lt;/code&gt; and &lt;code&gt;--dry-run&lt;/code&gt;, you can build a pipeline that proposes, reviews, and applies refactors with automated test verification at every stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional operation types.&lt;/strong&gt;&lt;br&gt;
Each operation is a typed Pydantic model in the operations layer. A new operation follows the same pattern: define the Pydantic model, implement the transformer logic, and register it. The LLM Planner, Plan Validator, and CLI all pick it up automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Morph shifts refactoring from code generation to intent declaration. The LLM describes what to change in a structured, validated plan. The engine does the mechanical work. Tests confirm correctness. The result is refactoring that is auditable before it runs, verifiable after it runs, and automatically reversible if it breaks anything.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Morph" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Morph&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>ToolRouter: Switch AI Coding Tools Freely Without Losing Context</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Fri, 22 May 2026 11:13:41 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/toolrouter-switch-ai-coding-tools-freely-without-losing-context-2bo</link>
      <guid>https://dev.to/nilofer_tweets/toolrouter-switch-ai-coding-tools-freely-without-losing-context-2bo</guid>
      <description>&lt;p&gt;Every AI coding tool has its strengths. Claude Code is strong for complex multi-step tasks. Cursor is fast for inline edits. Gemini CLI is useful for quick questions. Most developers use more than one but every time you switch, the context is gone. The new tool has no idea what you just did, what you decided, or which files are in a partial state.&lt;/p&gt;

&lt;p&gt;On top of that, there is no clear picture of what different AI tools actually cost per session, per project, or per week. You are guessing at efficiency rather than measuring it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ToolRouter&lt;/strong&gt; is a local proxy daemon that solves both problems. It maintains shared session state across multiple AI coding tools, generates Handoff Briefs when you switch between them, and tracks real token spend per tool and model all transparently, without changing your API keys or your tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyua07jyge5sqcy6360y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyua07jyge5sqcy6360y.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;ToolRouter sits between your AI tools and their APIs as a local proxy on port 7863. Here is what happens at each stage:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - All traffic routes through the proxy.&lt;/strong&gt;&lt;br&gt;
You point each AI tool's API base URL at &lt;code&gt;localhost:7863&lt;/code&gt;. From that point, every request your tool makes passes through ToolRouter first. The proxy forwards it transparently to the real API, your API keys are unchanged, your tools behave exactly as before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - The proxy captures what matters as a side effect.&lt;/strong&gt;&lt;br&gt;
As AI responses come back through the proxy, ToolRouter reads the token counts and extracts decisions and task state from the response text using pattern matching. Statements like "let's use bcrypt" are classified as decisions. Lines like "implemented JWT validation" are classified as completed tasks. "Still need to finish the refresh logic" becomes an in-progress item. Everything is written to the SQLite state store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 - The file tracker watches the filesystem independently.&lt;/strong&gt;&lt;br&gt;
Alongside the proxy, a Watchdog-based file tracker monitors your project directories. It computes file hashes before and after each session to build an accurate list of what changed. It also scans for syntax errors, merge conflict markers, and unresolved TODOs to detect files that are in a partial state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 - When you switch tools, a Handoff Brief is generated and injected.&lt;/strong&gt;&lt;br&gt;
The Handoff Generator reads from the state store and assembles a brief - partial files first since they carry the highest risk, then in-progress tasks, then decisions and completed items. This brief is automatically injected into the first message of your new session. The receiving tool sees exactly where the last tool left off, before it writes a single line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5 - Spend is tracked on every proxied response.&lt;/strong&gt;&lt;br&gt;
Token counts from every response are accumulated and costed against current model pricing. No separate setup needed, spend tracking is a byproduct of the same proxy pass.&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -e .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - Start the daemon&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;toolrouter start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts the proxy on port 7863 and the dashboard on port 7864.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - Point your AI tools at the proxy&lt;/strong&gt;&lt;br&gt;
Each tool needs its API base URL pointed at the local proxy. This is a one-time configuration per tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code: export ANTHROPIC_API_URL=http://localhost:7863/v1
Cursor: Set OpenAI API base URL to http://localhost:7863/v1 in Settings → AI
Gemini CLI: export OPENAI_API_BASE=http://localhost:7863/v1
Ollama: export OLLAMA_HOST=http://localhost:7863/api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 - Work normally&lt;/strong&gt;&lt;br&gt;
Switch tools whenever you like. ToolRouter handles handoffs automatically.&lt;/p&gt;
&lt;h2&gt;
  
  
  Handoff Brief
&lt;/h2&gt;

&lt;p&gt;When you switch tools on the same project, ToolRouter injects a brief like this into the first message of your new session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ToolRouter Handoff — from claude-code, 5 minutes ago]

Files changed this session:
✓ src/auth.py — implemented JWT token validation
✓ src/models.py — added User model
⚠ src/api.py — PARTIALLY MODIFIED, do not use as-is

Completed:
✓ Set up authentication middleware
✓ Created database schema

In progress:
→ Implementing refresh token logic
→ Writing API documentation

Decisions made:
- Using bcrypt for password hashing
- JWT tokens expire after 24 hours
- Refresh tokens stored in Redis

⚠ Do not touch:
- src/api.py (has syntax errors)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The brief is generated from the state store - real file changes tracked by the watchdog, decisions extracted from AI responses, and partial-state detection on modified files. The receiving tool sees this at the start of the session and can immediately continue where the last tool left off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Spend Tracking
&lt;/h2&gt;

&lt;p&gt;ToolRouter reads token counts from every proxied response and calculates cost using current model pricing. Spend reports run directly from the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;toolrouter spend           # Today's report
toolrouter spend --week    # This week
toolrouter spend --month   # This month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dashboard at &lt;code&gt;http://localhost:7864&lt;/code&gt; shows daily spend bar charts per tool, session lists with per-session cost, per-tool and per-project breakdowns, which tool is most cost-efficient measured by cost per file changed, and projected monthly costs based on current pace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Pricing (May 2026)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek6ifmbv4fh7j4jlh78l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek6ifmbv4fh7j4jlh78l.png" alt=" " width="515" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Commands
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ksgtubtf0jy30c29svt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ksgtubtf0jy30c29svt.png" alt=" " width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;State Store -&lt;/strong&gt; SQLite with WAL mode for concurrent read/write. Stores sessions, per-session file changes with MD5 hashes, extracted decisions and tasks, and generated handoff briefs. Every table links back to a session ID so the full history is queryable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Tracker -&lt;/strong&gt; Watchdog-based monitoring of project directories. Computes file hashes before and after each session to build an accurate change list. Detects partial states by scanning for syntax errors, merge conflict markers, and unresolved TODOs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Extractor -&lt;/strong&gt; Pattern matching over AI responses to classify statements into decisions, completed tasks, in-progress work, and blockers. Phrases like "let's use" and "we'll go with" are decisions. Words like "done", "implemented", and "✓" signal completed tasks. "I've started" and "still need to" mark in-progress work. "Blocked by" and "waiting for" identify blockers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handoff Generator -&lt;/strong&gt; Assembles the brief from state store data, ordering by recency and priority: partial files first as they carry the highest risk, then in-progress tasks, then decisions and completed items.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Configuration is stored at &lt;code&gt;~/.toolrouter/config.json&lt;/code&gt;. The key settings are:&lt;br&gt;
&lt;code&gt;injection_enabled&lt;/code&gt; - whether to prepend handoff briefs&lt;br&gt;
&lt;code&gt;proxy_port&lt;/code&gt; - default 7863&lt;br&gt;
&lt;code&gt;dashboard_port&lt;/code&gt; - default 7864&lt;br&gt;
&lt;code&gt;log_level&lt;/code&gt; - logging verbosity&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;toolrouter config set &amp;lt;key&amp;gt; &amp;lt;value&amp;gt;&lt;/code&gt; to change any setting without editing the file directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a local proxy daemon that could sit transparently between AI coding tools and their APIs, maintain shared session state across tool switches, generate structured handoff briefs automatically, and track real token spend per tool and model - all without requiring any changes to the tools themselves or the user's API keys.&lt;/p&gt;

&lt;p&gt;NEO built the full implementation: the proxy daemon running on port 7863, the SQLite state store with WAL mode, the Watchdog-based file tracker with MD5 hashing and partial state detection, the pattern-matching decision extractor, the handoff brief generator with priority ordering, the spend tracker reading token counts from proxied responses, the dashboard on port 7864, and the full CLI with all twelve commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to switch between Claude Code, Cursor, and Gemini CLI on the same project without losing context.&lt;/strong&gt;&lt;br&gt;
Point each tool at the proxy once, and every subsequent tool switch gets an automatic handoff brief. The receiving tool knows which files changed, which tasks are in progress, and which files should not be touched - without you writing a single summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the spend dashboard to measure which AI tool is most cost-efficient for your workflow.&lt;/strong&gt;&lt;br&gt;
The dashboard breaks down cost per tool, per project, and per session. The "cost per file changed" metric tells you which tool delivers the most work per dollar - a data-driven way to decide which tool to reach for on different task types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the handoff brief preview before switching.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;toolrouter handoff&lt;/code&gt; before switching tools to see exactly what brief the next tool will receive. This lets you verify the context is accurate before handing off on a complex task where a wrong assumption by the next tool could cause real damage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional tool integrations.&lt;/strong&gt;&lt;br&gt;
The proxy currently supports Claude Code, Cursor, Gemini CLI, and Ollama via their respective API base URL environment variables. Any tool that accepts an OpenAI-compatible API base URL can be pointed at the proxy using the same pattern - no changes to ToolRouter needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;ToolRouter makes multi-tool AI development practical. Context persists across tool switches through automatically generated handoff briefs. Spend is tracked in real time with model-accurate pricing. The proxy is transparent - your tools and API keys are unchanged.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Tool-Router" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Tool-Router&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Context Time Machine: Forensic Investigation of What Your Agent Actually Saw</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 16 May 2026 11:10:19 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/contexttimemachine-forensic-investigation-of-what-your-agent-actually-saw-joo</link>
      <guid>https://dev.to/nilofer_tweets/contexttimemachine-forensic-investigation-of-what-your-agent-actually-saw-joo</guid>
      <description>&lt;p&gt;Long-running agent sessions fail in a specific way that is hard to debug. The agent runs 40 turns. At turn 38, it gives a wrong answer that ignores something it decided at turn 12. You look at the logs, the turn 12 decision is there. The turn 38 response is there. But you cannot see what the context window looked like at turn 38. Was the turn 12 decision still in context? Was it evicted? Was it there but semantically overwhelmed by 25 other turns?&lt;/p&gt;

&lt;p&gt;This is the forensic problem that ContextTimeMachine solves. It is different from real-time session monitoring, it is for deep post-hoc investigation of what happened during a session, after it has already run. The key insight it is built on: the context window at any given turn is deterministic given the conversation history. You can reconstruct exactly what the model saw at turn 38, render it interactively, and query it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcatqc3xd1wiridxycd3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcatqc3xd1wiridxycd3.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Investigation Modes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mode 1 - Timeline Navigator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The primary view is a vertical timeline of all turns in the session. Each turn shows the turn number, agent name if available, turn type, token count at that turn, and a sparkline showing how the context composition changed.&lt;/p&gt;

&lt;p&gt;Click any turn to travel to it - the context window at that exact point reconstructs and renders in the main panel. You see exactly what the model saw: every message in order, with token counts, with a red line showing where the context would have been truncated if it exceeded the model's limit. Scrub through turns with keyboard arrows. Watch the context window evolve turn by turn. See turns disappear as eviction happens. See tool results arrive and push older content further back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode 2 - Fact Tracker&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You know something specific, a decision made at turn 5, a fact retrieved at turn 15, a user instruction given at turn 3. You want to know: at what turn did this fact leave the context window?&lt;/p&gt;

&lt;p&gt;Enter any text snippet in the Fact Tracker search box. ContextTimeMachine embeds it locally using sentence-transformers, then searches every turn's context snapshot for the nearest matching content. It renders a presence chart, a horizontal bar across all turns colored green when the fact is present or red when absent and shows the exact turn where the fact entered context and the exact turn where it left.&lt;/p&gt;

&lt;p&gt;This answers the most common debugging question for long agent sessions: "When exactly did the agent stop knowing X?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode 3 - Divergence Finder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have two agent sessions that started identically but ended differently. One succeeded, one failed. Load both sessions and ContextTimeMachine finds the earliest turn where their context windows diverged where they started seeing different content and highlights that turn as the likely root cause of the different outcomes.&lt;/p&gt;

&lt;p&gt;It shows a side-by-side comparison of the two context windows at the divergence point with diffed content highlighted. This is the automated version of the manual debugging process every team does when comparing "the run that worked" against "the run that didn't."&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    ContextTimeMachine                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Frontend (React)                                               │
│  ├─ TimelineNavigator    — Turn-by-turn timeline scrubber       │
│  ├─ ContextPanel         — Renders reconstructed context        │
│  ├─ FactTracker          — Fact presence chart                  │
│  └─ DivergenceFinder     — Two-session comparison               │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FastAPI Backend                                                │
│  ├─ /api/session/load          — Load session from file         │
│  ├─ /api/session/{id}/profile  — Get token profile              │
│  ├─ /api/session/{id}/turn/{n} — Reconstruct context at turn    │
│  ├─ /api/session/{id}/fact     — Track fact presence            │
│  ├─ /api/divergence            — Find divergence point          │
│  └─ /api/sessions              — List all sessions              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Core Analysis Modules                                          │
│  ├─ SessionLoader        — Load from multiple formats           │
│  ├─ ContextReconstructor — Reconstruct at any turn              │
│  ├─ FactTracker          — Track presence via embeddings        │
│  ├─ DivergenceFinder     — Find divergence points               │
│  ├─ TokenAnalyzer        — Token budget analysis                │
│  └─ EmbeddingService     — Local embeddings (all-MiniLM)        │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage                                                        │
│  └─ SQLite DB            — Session snapshots &amp;amp; metadata         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;pip&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quick Start&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/dakshjain-1616/context-time-machine.git
&lt;span class="nb"&gt;cd &lt;/span&gt;context-time-machine

&lt;span class="c"&gt;# Create virtual environment&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# On Windows: venv\Scripts\activate&lt;/span&gt;

&lt;span class="c"&gt;# Install package&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Start the server&lt;/span&gt;
timemachine serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8000&lt;/code&gt; in your browser. The server will automatically open your browser if it can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Loading Sessions&lt;/strong&gt;&lt;br&gt;
Sessions can be loaded from two formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From LiveContext SQLite Export:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From Generic JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generic JSON format expects a &lt;code&gt;turns&lt;/code&gt; array where each turn contains a &lt;code&gt;messages&lt;/code&gt; list, a &lt;code&gt;model_id&lt;/code&gt;, and a &lt;code&gt;timestamp&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"turn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are helpful."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-09T10:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CLI Commands&lt;/strong&gt;&lt;br&gt;
The CLI covers the full workflow from loading sessions to querying them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the web interface&lt;/span&gt;
timemachine serve

&lt;span class="c"&gt;# Load a session&lt;/span&gt;
timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.json

&lt;span class="c"&gt;# Track fact across session&lt;/span&gt;
timemachine fact &lt;span class="nt"&gt;--session&lt;/span&gt; &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--fact&lt;/span&gt; &lt;span class="s2"&gt;"the user prefers JSON output"&lt;/span&gt;

&lt;span class="c"&gt;# Find divergence between two sessions&lt;/span&gt;
timemachine diverge &lt;span class="nt"&gt;--session-a&lt;/span&gt; &amp;lt;id-a&amp;gt; &lt;span class="nt"&gt;--session-b&lt;/span&gt; &amp;lt;id-b&amp;gt;

&lt;span class="c"&gt;# List all stored sessions&lt;/span&gt;
timemachine sessions

&lt;span class="c"&gt;# Clear all sessions&lt;/span&gt;
timemachine clear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every capability the CLI and web interface expose is also available as a Python library. This makes it straightforward to integrate ContextTimeMachine into evaluation pipelines or automated debugging scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;context_time_machine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SessionLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ContextReconstructor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FactTracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DivergenceFinder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TokenAnalyzer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load session
&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SessionLoader&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Reconstruct context at turn 10
&lt;/span&gt;&lt;span class="n"&gt;reconstructor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContextReconstructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reconstructor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reconstruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context at turn 10: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Messages: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Utilization: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utilization_percent&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Track a fact
&lt;/span&gt;&lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FactTracker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific decision from turn 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact first appeared: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_appeared_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact last present: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_present_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disappeared at: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disappeared_at_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Analyze token budget
&lt;/span&gt;&lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenAnalyzer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Peak tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; at turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Eviction turns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eviction_turns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Find divergence between sessions
&lt;/span&gt;&lt;span class="n"&gt;session_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_b.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;finder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DivergenceFinder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;finder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Divergence at turn: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;divergence_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Supported Session Formats
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynbb2elgx53qdn920c8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynbb2elgx53qdn920c8.png" alt=" " width="468" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Context Reconstruction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each turn N, ContextTimeMachine loads all messages from turns 0 to N and counts the total tokens using tiktoken. If the total exceeds the model's context limit, it simulates eviction using a model-specific strategy: GPT and Claude use left-truncation (oldest messages first), DeepSeek uses a sliding window with a recency bias, and Gemma uses local-global attention sampling from the middle. System messages are never evicted regardless of which strategy applies. The result is a reconstructed context with a full token breakdown exactly what the model would have seen at that turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fact Tracking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each turn, ContextTimeMachine embeds the fact text using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;. It then computes cosine similarity between that embedding and every message in the turn's reconstructed context. A fact is considered present if any message has a similarity above 0.75. Embeddings are cached for performance so repeated queries against the same session do not recompute embeddings. The output is a presence chart showing the fact's full lifecycle across the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Divergence Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For two sessions, ContextTimeMachine aligns turns and analyzes up to the minimum length of the two sessions. At each turn it reconstructs the context for both sessions, embeds all messages, and computes an average maximum cosine similarity between the two context windows. When this similarity drops below 0.85, the turn is flagged as the divergence point. The output includes a message diff at the divergence point and a summary of what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Session Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/session/load&lt;/code&gt; - load session from file or JSON&lt;br&gt;
&lt;code&gt;GET /api/sessions&lt;/code&gt; - list all stored sessions&lt;br&gt;
&lt;code&gt;DELETE /api/session/{id}&lt;/code&gt; - delete a session&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /api/session/{id}/profile&lt;/code&gt; - get token profile for session&lt;br&gt;
&lt;code&gt;GET /api/session/{id}/turn/{num}&lt;/code&gt; - reconstruct context at turn&lt;br&gt;
&lt;code&gt;POST /api/session/{id}/fact&lt;/code&gt; - track fact presence&lt;br&gt;
&lt;code&gt;POST /api/divergence&lt;/code&gt; - find divergence between sessions&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Context Reconstruction: &amp;lt; 100ms for typical sessions&lt;/li&gt;
&lt;li&gt;Fact Tracking: ~1-5 seconds for full session (includes embedding)&lt;/li&gt;
&lt;li&gt;Divergence Detection: ~2-10 seconds for 2 sessions&lt;/li&gt;
&lt;li&gt;Memory: ~50-200MB per stored session (depending on size)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Dependencies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Core&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;fastapi&lt;/strong&gt; - Web framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;uvicorn&lt;/strong&gt; - ASGI server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pydantic&lt;/strong&gt; - Data validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;click&lt;/strong&gt; - CLI framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tiktoken&lt;/strong&gt; - Token counting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sentence-transformers&lt;/strong&gt; - Local embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;numpy&lt;/strong&gt; - Numerical operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlalchemy&lt;/strong&gt; - Database ORM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;aiofiles&lt;/strong&gt; - Async file operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;br&gt;
React, Tailwind CSS, Framer Motion, Recharts&lt;/p&gt;

&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Frontend is a React stub - core analysis is fully functional&lt;/li&gt;
&lt;li&gt;LangSmith format not yet implemented&lt;/li&gt;
&lt;li&gt;No streaming support for very large sessions (&amp;gt;10k turns)&lt;/li&gt;
&lt;li&gt;Embedding cache cleared on restart&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Future Enhancements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Complete React frontend with real-time updates&lt;/li&gt;
&lt;li&gt;WebSocket streaming for large sessions&lt;/li&gt;
&lt;li&gt;LangSmith format support&lt;/li&gt;
&lt;li&gt;Multi-session comparison UI&lt;/li&gt;
&lt;li&gt;Export to markdown/HTML&lt;/li&gt;
&lt;li&gt;Attention visualization&lt;/li&gt;
&lt;li&gt;Custom eviction strategy support&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a forensic debugging tool for long-running agent sessions, one that could reconstruct the exact context window at any historical turn, track when specific facts entered and left context using semantic embeddings, and find the earliest point where two divergent sessions started seeing different content. The tool needed to support multiple session formats, expose a Python API alongside the web interface, and work entirely offline with local embeddings.&lt;/p&gt;

&lt;p&gt;NEO handled all 12 specification steps autonomously, building the &lt;code&gt;SessionLoader&lt;/code&gt; with support for LiveContext SQLite, generic JSON, and raw conversation formats, the &lt;code&gt;ContextReconstructor&lt;/code&gt; with model-specific eviction strategies for GPT, Claude, DeepSeek, and Gemma, the &lt;code&gt;FactTracker&lt;/code&gt; with &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; embeddings and cosine similarity scoring, the &lt;code&gt;DivergenceFinder&lt;/code&gt; with turn-aligned context comparison, the &lt;code&gt;TokenAnalyzer&lt;/code&gt; for peak token and eviction turn detection, the FastAPI backend with all six API endpoints, the SQLite storage layer via SQLAlchemy, the Click CLI with all six commands, and the full 58-test suite covering all core modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to find the root cause of long-session failures.&lt;/strong&gt;&lt;br&gt;
When an agent gives a wrong answer deep into a long session, load the session into ContextTimeMachine, travel to the failure turn in the Timeline Navigator, and see exactly what was in context at that point. The reconstructed view shows every message the model saw, in order, with token counts, so you can see immediately whether the relevant context was present or had been evicted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Fact Tracker to measure context retention across your agent design.&lt;/strong&gt;&lt;br&gt;
Before settling on a context management strategy for your agent, run Fact Tracker against a set of real sessions. The presence chart for key decisions and instructions tells you at what turn they reliably drop out of context giving you a data-driven basis for choosing context window sizes, eviction strategies, or compression approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Divergence Finder to debug non-deterministic agent behaviour.&lt;/strong&gt;&lt;br&gt;
When two runs of the same agent with the same input produce different outcomes, load both into Divergence Finder. The tool identifies the exact turn where their context windows started differing and shows a diff of what changed, turning a difficult debugging problem into a specific, actionable finding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional session format parsers.&lt;/strong&gt;&lt;br&gt;
SessionLoader already handles three formats following a common interface. Adding a new format - LangSmith is listed as planned, means implementing the same loader interface for the new format. It is then immediately available in the CLI, the Python API, and the web interface without touching any of the analysis modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;ContextTimeMachine makes the context window visible. Instead of inferring what the model saw from its outputs, you can reconstruct and inspect the exact context at any turn, track when specific information entered and left the window, and find where two sessions diverged. For teams debugging long-running agents, that visibility is the difference between guessing and knowing.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/ContextTimeMachine" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/ContextTimeMachine&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Agent Constitution: Policy Enforcement and PII Protection for AI Agents</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 16 May 2026 05:50:37 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/agent-constitution-policy-enforcement-and-pii-protection-for-ai-agents-ehf</link>
      <guid>https://dev.to/nilofer_tweets/agent-constitution-policy-enforcement-and-pii-protection-for-ai-agents-ehf</guid>
      <description>&lt;p&gt;AI agents are getting more capable. They can browse the web, call APIs, read and write files, and execute code. That capability is exactly what makes them useful and exactly what makes them dangerous without guardrails.&lt;/p&gt;

&lt;p&gt;Most agent safety approaches rely on prompt instructions. Tell the model not to delete files. Tell it not to send requests to untrusted URLs. Tell it not to leak PII. But instructions in a prompt are not enforceable — a sufficiently complex agent workflow, a jailbreak attempt, or just an edge case in reasoning can bypass them silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Constitution&lt;/strong&gt; is a policy enforcement framework for AI agents that enforces behavioral rules at the code level, not the prompt level. You define rules in a YAML constitution file, wrap your agent's tool calls with the enforcer, and get PII detection, audit logging, and a real-time dashboard all without modifying your agent's core logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4e3yb18v5gw0syzfztpv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4e3yb18v5gw0syzfztpv.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Policy-Based Enforcement&lt;/strong&gt; - Define rules using YAML constitution files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AST-Based Expression Evaluation&lt;/strong&gt; - Safe condition evaluation without code injection risks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII Detection&lt;/strong&gt; - Regex and Ollama-powered detection of sensitive information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit Logging&lt;/strong&gt; - JSONL-based audit trail with rotation support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Dashboard&lt;/strong&gt; - FastAPI + WebSocket + React dashboard for monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLI Interface&lt;/strong&gt; - Rich command-line interface for management&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The core concept is a constitution - a YAML file that defines policies, and within each policy, rules. Each rule has a condition written as a plain expression, an action (block or notify), and a severity level. The enforcer evaluates these conditions against every tool call before it executes.&lt;/p&gt;

&lt;p&gt;The condition evaluation uses AST-based expression parsing not &lt;code&gt;eval()&lt;/code&gt; so there is no code injection risk. An expression like &lt;code&gt;tool_name in ['rm', 'unlink', 'rmdir']&lt;/code&gt; is parsed as an abstract syntax tree and evaluated safely against the tool call context.&lt;/p&gt;

&lt;p&gt;PII detection runs as a separate layer. It can use regex patterns for common formats like email addresses, phone numbers, and SSNs, or it can use Ollama with a local model for more nuanced detection. When PII is detected in a tool's output, it can be blocked or redacted before it reaches the agent.&lt;/p&gt;

&lt;p&gt;Every enforcement decision allowed or blocked is written to a JSONL audit log with a timestamp, tool name, action taken, and the specific rule that triggered. The real-time dashboard reads from this audit log via WebSocket and shows violations, enforcement statistics, and the full constitution in one view.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/yourusername/agent-constitution.git
&lt;span class="nb"&gt;cd &lt;/span&gt;agent-constitution

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Create a Constitution&lt;/strong&gt;&lt;br&gt;
Start with a sample constitution to see the format, or create an empty one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a sample constitution&lt;/span&gt;
agent-constitution init &lt;span class="nt"&gt;--sample&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# Or create an empty one&lt;/span&gt;
agent-constitution init &lt;span class="nt"&gt;-o&lt;/span&gt; my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Validate Your Constitution&lt;/strong&gt;&lt;br&gt;
Before using it, validate that the YAML is well-formed and the expressions are safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-constitution validate my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Test Policy Enforcement&lt;/strong&gt;&lt;br&gt;
Check whether a specific tool call would be allowed or blocked before running it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-constitution check &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--arg&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/test &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Start the Dashboard&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-constitution dashboard &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;code&gt;http://localhost:8000&lt;/code&gt; in your browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Using the @enforce Decorator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The simplest integration is wrapping tool functions with the &lt;code&gt;@enforce&lt;/code&gt; decorator. The enforcer checks the function against the constitution before it executes, if a rule blocks the call, a &lt;code&gt;PolicyViolationError&lt;/code&gt; is raised before the function body runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enforcer&lt;/span&gt;

&lt;span class="c1"&gt;# Load constitution
&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_yaml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_constitution.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;enforcer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Enforcer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@enforcer.enforce&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;delete_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Delete a file.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# This will be blocked if rm/delete operations are restricted
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;delete_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/test.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;PolicyViolationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blocked: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Manual Policy Checking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For cases where you need to check a tool call without decorating a function, for example when the tool call is constructed dynamically the enforcer exposes a check method directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enforcer&lt;/span&gt;

&lt;span class="n"&gt;constitution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_yaml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_constitution.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;enforcer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Enforcer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check a tool call
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;enforcer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;extra_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blocked by rule: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;rule_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PII Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The PII detector can be used standalone - detect PII in any text, or redact it before it leaves the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution.rules.pii_detector&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PIIDetector&lt;/span&gt;

&lt;span class="n"&gt;detector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIDetector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Detect PII in text
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contact me at john@example.com or call 555-123-4567&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pattern_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matched_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Redact PII
&lt;/span&gt;&lt;span class="n"&gt;redacted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;redact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redacted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "Contact me at [REDACTED] or call [REDACTED]"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Audit Logging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The audit logger writes every enforcement decision to a JSONL file and supports log rotation. Logs can be read back programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution.audit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AuditLogger&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AuditLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./audit.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Log an event
&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rule_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block_file_deletion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read logs
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Constitution Format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The constitution is a YAML file with versioning, named policies, and rules within each policy. Each rule has a name, a condition expression, an action, and a severity level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0"&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Agent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Constitution"&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;policies&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;my&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;agent"&lt;/span&gt;

&lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_restrictions&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Restrict&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;access&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dangerous&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tools"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block_file_deletion&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prevent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;deletion&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;operations"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;['rm',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'unlink',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'rmdir']"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restrict_network_access&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Limit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;unrestricted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;network&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;access"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'curl'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;context.get('approved',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;False)"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notify&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data_protection&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Protect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sensitive&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pii_detection&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Detect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;protect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PII&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;outputs"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pii_detected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;True"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;

&lt;span class="na"&gt;pii_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ssn"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phone"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;use_ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;ollama_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b"&lt;/span&gt;
  &lt;span class="na"&gt;ollama_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434"&lt;/span&gt;

&lt;span class="na"&gt;audit_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;log_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./audit_logs.jsonl"&lt;/span&gt;
  &lt;span class="na"&gt;max_file_size_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
  &lt;span class="na"&gt;retention_days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;priority&lt;/code&gt; field controls which policies are evaluated first. Higher priority runs first. The &lt;code&gt;action&lt;/code&gt; field is either &lt;code&gt;block&lt;/code&gt; which raises a &lt;code&gt;PolicyViolationError&lt;/code&gt; or &lt;code&gt;notify&lt;/code&gt;, which logs the event but allows the call through.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Commands
&lt;/h2&gt;

&lt;p&gt;The CLI covers the full lifecycle from creating and validating a constitution to inspecting audit logs and testing expressions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize a constitution&lt;/span&gt;
agent-constitution init &lt;span class="nt"&gt;--sample&lt;/span&gt;

&lt;span class="c"&gt;# Validate a constitution&lt;/span&gt;
agent-constitution validate my_constitution.yaml

&lt;span class="c"&gt;# Display constitution contents&lt;/span&gt;
agent-constitution show my_constitution.yaml

&lt;span class="c"&gt;# Check if a tool call would be allowed&lt;/span&gt;
agent-constitution check &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--arg&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/test &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# Start the dashboard&lt;/span&gt;
agent-constitution dashboard &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# View audit logs&lt;/span&gt;
agent-constitution audit &lt;span class="nt"&gt;--log-path&lt;/span&gt; ./audit.jsonl

&lt;span class="c"&gt;# Show statistics&lt;/span&gt;
agent-constitution stats &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# Test expression evaluation&lt;/span&gt;
agent-constitution eval-expr &lt;span class="s2"&gt;"x &amp;gt; 5"&lt;/span&gt; &lt;span class="nt"&gt;--context&lt;/span&gt; &lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Dashboard
&lt;/h2&gt;

&lt;p&gt;The dashboard provides real-time monitoring via FastAPI, WebSocket, and a React frontend. It shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Policy violations&lt;/li&gt;
&lt;li&gt;Audit logs&lt;/li&gt;
&lt;li&gt;Constitution rules and policies&lt;/li&gt;
&lt;li&gt;Enforcement statistics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open &lt;code&gt;http://localhost:8000&lt;/code&gt; after starting with &lt;code&gt;agent-constitution dashboard&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent_constitution/
├── constitution.py      # Pydantic models and YAML handling
├── enforcer.py          # Policy enforcement and @enforce decorator
├── audit.py            # JSONL audit logging
├── cli.py              # Click CLI interface
├── rules/
│   ├── evaluator.py    # AST-based expression evaluation
│   └── pii_detector.py # PII detection with regex/Ollama
└── dashboard/
    ├── server.py       # FastAPI + WebSocket server
    └── frontend/       # React + Tailwind dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each module has a single responsibility - &lt;code&gt;constitution.py&lt;/code&gt; handles Pydantic models and YAML parsing, &lt;code&gt;enforcer.py&lt;/code&gt; owns the &lt;code&gt;@enforce&lt;/code&gt; decorator and manual check logic, &lt;code&gt;audit.py&lt;/code&gt; handles JSONL writing and rotation, and the &lt;code&gt;rules/&lt;/code&gt; directory separates expression evaluation from PII detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The project has comprehensive test coverage with 84 unit tests, all passing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run all tests&lt;/span&gt;
pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# All tests passing: 84/84 ✓&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test coverage includes constitution loading and YAML parsing, policy enforcement with the &lt;code&gt;@enforce&lt;/code&gt; decorator, manual policy checking, PII detection for regex and patterns, audit logging with rotation, expression evaluation and security validation, and rule violation tracking and statistics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install development dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;

&lt;span class="c"&gt;# Run tests&lt;/span&gt;
pytest

&lt;span class="c"&gt;# Run specific test file&lt;/span&gt;
pytest tests/test_evaluator.py &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# Run linting&lt;/span&gt;
flake8 agent_constitution

&lt;span class="c"&gt;# Run type checking&lt;/span&gt;
mypy agent_constitution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a policy enforcement framework for AI agents, one that enforces behavioral rules at the code level rather than relying on prompt instructions, with PII detection, a JSONL audit trail, and a real-time monitoring dashboard. NEO implemented the full system across 10 implementation steps, resulting in a production-ready framework with 84 tests passing.&lt;/p&gt;

&lt;p&gt;NEO built the Pydantic constitution models and YAML handling in &lt;code&gt;constitution.py&lt;/code&gt;, the policy enforcer with the &lt;code&gt;@enforce&lt;/code&gt; decorator in &lt;code&gt;enforcer.py&lt;/code&gt;, the AST-based expression evaluator in &lt;code&gt;rules/evaluator.py&lt;/code&gt;, the regex and Ollama-powered PII detector in &lt;code&gt;rules/pii_detector.py&lt;/code&gt;, the JSONL audit logger with rotation in &lt;code&gt;audit.py&lt;/code&gt;, the Click CLI with all eight commands in &lt;code&gt;cli.py&lt;/code&gt;, and the FastAPI and WebSocket dashboard server with the React and Tailwind frontend.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to enforce safety rules across any agent's tool calls.&lt;/strong&gt;&lt;br&gt;
Wrap any tool function with &lt;code&gt;@enforcer.enforce&lt;/code&gt; and define the rules in a YAML constitution. The enforcement happens at the code level not in the prompt, so it cannot be bypassed by the agent's reasoning or by jailbreak attempts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the audit log to build an observability layer for your agents.&lt;/strong&gt;&lt;br&gt;
Every enforcement decision lands in a JSONL file with a timestamp, tool name, action, and triggering rule. This gives you a structured, queryable record of everything your agent tried to do allowed or blocked, which is useful for debugging unexpected agent behaviour and for compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use PII detection as a standalone layer before agent outputs reach users.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;PIIDetector&lt;/code&gt; works independently of the enforcer. You can run it on any text, agent responses, tool outputs, retrieved documents before they are displayed or stored, and redact sensitive information automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with custom PII patterns.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;pii_config&lt;/code&gt; section of the constitution accepts a &lt;code&gt;patterns&lt;/code&gt; list. New regex patterns for domain-specific sensitive data can be added to the constitution file without touching any code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional rule conditions.&lt;/strong&gt;&lt;br&gt;
The AST-based evaluator supports arithmetic, comparisons, and context dictionary access. New conditions that reference additional context fields work immediately once those fields are passed as &lt;code&gt;extra_context&lt;/code&gt; in the enforcer's &lt;code&gt;check&lt;/code&gt; call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Constitution shifts AI agent safety from instructions to enforcement. Rules defined in a YAML file are evaluated at the code level on every tool call before the tool executes, so the safety layer is not part of the agent's reasoning but a hard boundary around it.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Agent-Constitution" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Agent-Constitution&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>security</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>ASR Evaluation Framework: Benchmarking Speech Recognition Models Across Accuracy, Speed, and Robustness</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Fri, 15 May 2026 19:53:04 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/asr-evaluation-framework-benchmarking-speech-recognition-models-across-accuracy-speed-and-5gcn</link>
      <guid>https://dev.to/nilofer_tweets/asr-evaluation-framework-benchmarking-speech-recognition-models-across-accuracy-speed-and-5gcn</guid>
      <description>&lt;p&gt;Picking an ASR model for production is not straightforward. Whisper might be the most accurate for general English but too slow for real-time use. Wav2Vec2 might be fast enough for edge devices but struggle with accented speech. Distil-Whisper might hit the sweet spot for your use case, or it might not. Without a systematic benchmark across your actual conditions, you are guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ASR Evaluation Framework&lt;/strong&gt; is an enterprise-grade benchmarking tool that answers the questions that matter before you commit to a model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which ASR model is most accurate for my use case?&lt;/li&gt;
&lt;li&gt;How fast can each model process audio in real-time?&lt;/li&gt;
&lt;li&gt;How robust is each model against background noise, accents, and degraded audio?&lt;/li&gt;
&lt;li&gt;What are the tradeoffs between speed and accuracy?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feypaypp5aspa9sc18zcu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feypaypp5aspa9sc18zcu.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5 ASR Models&lt;/strong&gt; : IBM Granite, OpenAI Whisper, NVIDIA Canary, Distil-Whisper, Wav2Vec2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive Metrics&lt;/strong&gt; : WER, CER, Accuracy, RTF, and Inference Time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15+ Test Scenarios&lt;/strong&gt; : Clean speech, background noise, accents, fast/slow speech, technical terms, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Evaluation Modes&lt;/strong&gt; : Speed, accuracy, or complete evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSON Output Schema&lt;/strong&gt; : Standardized metrics schema for result storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│         run_evaluation.py (CLI Entry)               │
├────────────┬──────────────┬──────────────┬──────────┤
│ --accuracy │ --speed      │ --all        │ Config   │
│ Evaluate   │ Evaluate RTF │ Complete     │ Loading  │
│ WER/CER    │ &amp;amp; Inference  │ Evaluation   │          │
└────────────┴──────────────┴──────────────┴──────────┘
              │
      ┌───────▼────────┐
      │   Evaluator    │
      │  - Load models │
      │  - Test audio  │
      │  - Calc metrics│
      └───────┬────────┘
              │
     ┌────────┼────────┐
     │        │        │
┌────▼──┐┌────▼──┐┌────▼──┐
│Granite ││Whisper││ Wav2V │  ... 5 models
│ Model  ││ Model ││ Model │
└────┬──┘└────┬──┘└────┬──┘
     └────────┼────────┘
              │
      ┌───────▼───────────┐
      │  Metrics Engine   │
      │ - WER/CER calc    │
      │ - RTF calc        │
      │ - Accuracy calc   │
      │ - Aggregation     │
      └───────┬───────────┘
              │
      ┌───────▼──────────┐
      │ JSON Results     │
      │ with schema      │
      └──────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Model Comparison Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ae75i13bofn51wtd978.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ae75i13bofn51wtd978.png" alt=" " width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Dimensions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Accuracy Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WER&lt;/strong&gt; : Word Error Rate. Percentage of words transcribed incorrectly compared to the reference.&lt;br&gt;
&lt;strong&gt;CER&lt;/strong&gt; : Character Error Rate. Character-level error rate for more detailed analysis.&lt;br&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; : 100% minus WER, normalized to a percentage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTF&lt;/strong&gt; : Real-Time Factor. Inference time divided by audio duration. Below 1.0 means the model is real-time capable. Above 1.0 means it requires more compute than the audio duration.&lt;br&gt;
&lt;strong&gt;Inference Time&lt;/strong&gt; : Absolute seconds to transcribe the audio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Robustness Testing&lt;/strong&gt;&lt;br&gt;
15 test scenarios covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean speech - baseline accuracy testing&lt;/li&gt;
&lt;li&gt;Background noise - office and street environments&lt;/li&gt;
&lt;li&gt;Accented English&lt;/li&gt;
&lt;li&gt;Fast and slow speech rates&lt;/li&gt;
&lt;li&gt;Technical vocabulary&lt;/li&gt;
&lt;li&gt;Whispered speech&lt;/li&gt;
&lt;li&gt;Phone quality audio&lt;/li&gt;
&lt;li&gt;Numbers and acronyms&lt;/li&gt;
&lt;li&gt;And more scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create virtual environment&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Python 3.10+. Core dependencies:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;librosa&lt;/code&gt; - Audio processing&lt;br&gt;
&lt;code&gt;numpy, scipy&lt;/code&gt; - Numerical computing&lt;br&gt;
&lt;code&gt;transformers&lt;/code&gt; - HuggingFace model loading&lt;br&gt;
&lt;code&gt;jiwer&lt;/code&gt; - WER and CER calculation&lt;br&gt;
&lt;code&gt;soundfile&lt;/code&gt; - Audio file I/O&lt;br&gt;
&lt;code&gt;pytest&lt;/code&gt; - Testing framework&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run Complete Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Runs accuracy and speed evaluation across all five models against all 15 test scenarios:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run Accuracy Evaluation Only&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--accuracy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run Speed Evaluation Only&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--speed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Specify Custom Paths&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-path&lt;/span&gt; ./my_data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-path&lt;/span&gt; ./my_results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results and Output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Console Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is what a complete evaluation run looks like in the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================
ASR EVALUATION FRAMEWORK v1.0.0
============================================================

=== RUNNING COMPLETE EVALUATION (ACCURACY + SPEED) ===

Evaluating Whisper...
Evaluating Wav2Vec2...
Evaluating Distil-Whisper...
Evaluating Canary...
Evaluating Granite...

✓ Results saved to: results/asr_eval_results_all_20260513_123045.json

============================================================
EVALUATION SUMMARY
============================================================

Model: Whisper
  Status: ✓ OK
  Mean Accuracy: 95.23%
  Mean WER: 0.0477

Model: Wav2Vec2
  Status: ✓ OK
  Mean Accuracy: 91.45%
  Mean WER: 0.0855

Model: Distil-Whisper
  Status: ✓ OK
  Mean Accuracy: 93.78%
  Mean WER: 0.0622
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;JSON Output Format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Results are saved as structured JSON to &lt;code&gt;results/asr_eval_results_{type}_{timestamp}.json&lt;/code&gt;. The schema includes evaluation metadata, per-model aggregate metrics, and per-scenario test results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-13T12:30:45.123Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluator_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"models_tested"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Whisper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Wav2Vec2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Distil-Whisper"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test_scenarios"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"all"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Whisper"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Whisper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai/whisper-base"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"initialized"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"aggregate_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;95.23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_wer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0477&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_cer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_rtf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"std_wer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0145&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"test_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"test_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"test_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clean_english"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"wer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.032&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"cer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.015&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;96.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"inference_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"rtf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.17&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"all"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"completed"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-scenario &lt;code&gt;test_results&lt;/code&gt; array shows exactly how each model performed on each specific condition, not just aggregated averages which is what makes this useful for production decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Environment variables, documented in &lt;code&gt;.env.example&lt;/code&gt;:&lt;br&gt;
&lt;code&gt;HUGGINGFACE_TOKEN&lt;/code&gt; : HuggingFace API token for model loading&lt;br&gt;
&lt;code&gt;OPENAI_API_KEY&lt;/code&gt; : OpenAI API key&lt;br&gt;
&lt;code&gt;ASR_EVAL_DATA_PATH&lt;/code&gt; : data directory path&lt;br&gt;
&lt;code&gt;ASR_EVAL_RESULTS_PATH&lt;/code&gt; : results output path&lt;br&gt;
&lt;code&gt;VERBOSE&lt;/code&gt; : enable verbose logging&lt;/p&gt;
&lt;h2&gt;
  
  
  Test Matrix
&lt;/h2&gt;

&lt;p&gt;15 test scenarios covering four categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean Speech&lt;/strong&gt; - Baseline accuracy testing&lt;br&gt;
&lt;strong&gt;Robustness&lt;/strong&gt; - Background noise, accents, variable speech rates&lt;br&gt;
&lt;strong&gt;Challenging Conditions&lt;/strong&gt; - Whispered speech, music, phone quality audio&lt;br&gt;
&lt;strong&gt;Domain-Specific&lt;/strong&gt; - Technical vocabulary, numbers, acronyms&lt;/p&gt;
&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Accuracy Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WER (Word Error Rate)&lt;/strong&gt; - Percentage of words that differ from reference&lt;br&gt;
&lt;strong&gt;CER (Character Error Rate)&lt;/strong&gt; - Percentage of characters that differ&lt;br&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; - 100% minus WER, normalized to a percentage&lt;br&gt;
Speed Metrics&lt;br&gt;
&lt;strong&gt;RTF (Real-Time Factor)&lt;/strong&gt; - Inference time divided by audio duration. Below 1.0 is real-time capable.&lt;br&gt;
&lt;strong&gt;Inference Time&lt;/strong&gt; - Total time to transcribe audio in seconds&lt;/p&gt;
&lt;h2&gt;
  
  
  Model Details
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0f31xqzg2m6a8bukj6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0f31xqzg2m6a8bukj6e.png" alt=" " width="406" height="228"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  When to Use This Framework
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Benchmarking ASR models before production deployment&lt;/strong&gt; : run a full evaluation before committing to a model, not after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparing model tradeoffs&lt;/strong&gt; : speed versus accuracy decisions are data-driven rather than based on published benchmarks that may not reflect your audio conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing robustness against real-world audio&lt;/strong&gt; : the 15 test scenarios cover conditions that synthetic benchmarks miss: phone quality audio, background noise, accents, and technical vocabulary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluating cost-performance of different models&lt;/strong&gt; : RTF and inference time metrics let you calculate the compute cost of each model at your actual workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality assurance in voice-enabled applications&lt;/strong&gt; : run evaluations to catch model regressions before they reach production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research and academic speech recognition studies&lt;/strong&gt; : the standardized JSON output schema makes results comparable and reproducible across experiments.&lt;/p&gt;
&lt;h2&gt;
  
  
  Real-World Scenarios
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1 - Call Center AI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluate which model handles phone quality audio best&lt;/li&gt;
&lt;li&gt;Test robustness against background noise&lt;/li&gt;
&lt;li&gt;Measure inference speed for cost calculation&lt;/li&gt;
&lt;li&gt;Result: Select fastest model that maintains accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2 - Voice Assistant&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test against various accents and speech rates&lt;/li&gt;
&lt;li&gt;Evaluate technical command recognition&lt;/li&gt;
&lt;li&gt;Measure real-time performance on edge devices&lt;/li&gt;
&lt;li&gt;Result: Pick model that runs on-device with good accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3 - Transcription Service&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark accuracy across multiple languages&lt;/li&gt;
&lt;li&gt;Evaluate cost versus accuracy tradeoffs&lt;/li&gt;
&lt;li&gt;Test on domain-specific vocabulary&lt;/li&gt;
&lt;li&gt;Result: Choose optimal model for service tier&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── src/                          # Core modules
│   ├── config.py                # Configuration
│   ├── metrics.py               # Metric calculations
│   ├── data_loader.py           # Data loading utilities
│   ├── base_model.py            # ASR model base class
│   └── evaluator.py             # Main evaluator class
├── models/                       # ASR model implementations
│   ├── wav2vec2.py
│   ├── whisper.py
│   ├── distil_whisper.py
│   ├── canary.py
│   └── granite.py
├── tests/                        # Test suite (36 tests)
├── data/                         # Audio files for evaluation
├── results/                      # Output evaluation results
├── notebooks/                    # Jupyter notebooks
├── run_evaluation.py             # CLI entry point
├── asr_eval_test_matrix.csv      # Test scenarios matrix
├── asr_eval_metrics_schema.json  # Output schema
└── requirements.txt              # Python dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;36 tests covering all core modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a systematic benchmarking framework for ASR models, one that could evaluate accuracy, speed, and robustness across real-world audio conditions, support multiple models through a common interface, and produce structured output for production decisions. The framework needed to cover five distinct model architectures, a 15-scenario test matrix, and three evaluation modes selectable from the CLI.&lt;/p&gt;

&lt;p&gt;NEO built the full implementation: the base model class in &lt;code&gt;base_model.py&lt;/code&gt; that all five model implementations extend, the five model wrappers for Whisper, Wav2Vec2, Distil-Whisper, Canary, and Granite, the metrics engine in &lt;code&gt;metrics.py&lt;/code&gt; computing WER, CER, accuracy, RTF, and inference time, the main evaluator class in &lt;code&gt;evaluator.py&lt;/code&gt;, the CLI entry point in &lt;code&gt;run_evaluation.py&lt;/code&gt; with all three evaluation modes, the data loader in &lt;code&gt;data_loader.py&lt;/code&gt;, the JSON output &lt;code&gt;schema in asr_eval_metrics_schema.json&lt;/code&gt;, the test scenario matrix in &lt;code&gt;asr_eval_test_matrix.csv&lt;/code&gt;, and the 36-test test suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it before committing to an ASR model in production.&lt;/strong&gt;&lt;br&gt;
Run the full evaluation against your own audio samples using &lt;code&gt;--data-path&lt;/code&gt;. The per-scenario breakdown shows exactly how each model performs on the conditions your application will actually encounter, not on generic benchmarks that may not reflect your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the JSON output to build model selection pipelines.&lt;/strong&gt;&lt;br&gt;
The structured output at &lt;code&gt;results/asr_eval_results_{type}_{timestamp}.json&lt;/code&gt; contains all the metrics needed to make a data-driven model selection decision programmatically. A script that reads the output and selects the model with the best WER for a given RTF threshold builds directly on top of the existing schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to evaluate cost-performance before scaling.&lt;/strong&gt;&lt;br&gt;
RTF and inference time metrics per model let you calculate the compute cost of each option at your actual call volume. The per-scenario breakdown shows where each model spends the most compute, useful for optimising before scaling a voice-enabled product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional ASR models.&lt;/strong&gt;&lt;br&gt;
All five models extend &lt;code&gt;base_model.py&lt;/code&gt; following the same interface. Adding a new ASR model available through HuggingFace Transformers means adding a new file in &lt;code&gt;models/&lt;/code&gt; that implements the same base class, it is then available in all three evaluation modes without touching the evaluator, metrics engine, or CLI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Choosing an ASR model without systematic evaluation is a production risk. ASR Evaluation Framework removes that risk by giving you per-model, per-scenario metrics across accuracy, speed, and robustness before you deploy with structured JSON output that makes the decision data-driven rather than intuitive.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Asr-Evaluation" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Asr-Evaluation&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>whisper</category>
      <category>llm</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>SPEC-TO-SHIP: A Multi-Agent Pipeline That Turns Feature Ideas Into Production Code</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 14 May 2026 11:10:51 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/spec-to-ship-a-multi-agent-pipeline-that-turns-feature-ideas-into-production-code-5e86</link>
      <guid>https://dev.to/nilofer_tweets/spec-to-ship-a-multi-agent-pipeline-that-turns-feature-ideas-into-production-code-5e86</guid>
      <description>&lt;p&gt;Writing a feature spec and getting it to production involves a lot of steps, architecture decisions, task planning, implementation, testing, and code review. In a real engineering team, these are handled by different people with different specializations. Most AI coding tools collapse all of that into a single step and ask one model to do everything.&lt;/p&gt;

&lt;p&gt;SPEC TO SHIP takes a different approach. It orchestrates five specialized AI agents Architect, Planner, Engineer, QA, and Reviewer within a single Node.js process to simulate a complete startup engineering team workflow. Raw feature ideas go in. Committed, tested, reviewed code comes out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0vtp1vgz2koehvvng3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0vtp1vgz2koehvvng3j.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Agents
&lt;/h2&gt;

&lt;p&gt;The pipeline follows a sequential flow where each agent's output informs the next, with a tight loop between Engineering and QA. Each agent has a defined role, a specific output format, and a clear handoff point - so no single agent is asked to do more than it is designed for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArchitectAgent-Senior Software Architect:&lt;/strong&gt; The first agent in the pipeline. Takes the raw feature idea and generates a comprehensive technical specification covering Overview, Goals, API Contracts, Data Models, and Security sections. Output is a Markdown spec file that every downstream agent works from. Model:&lt;code&gt;google/gemini-2.0-flash-001&lt;/code&gt;via OpenRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PlannerAgent-Staff Engineering Manager:&lt;/strong&gt; Receives the spec from the Architect and breaks it into actionable, dependency-aware development tasks. Output is a JSON array of tasks with topological ordering and acceptance criteria - so the Engineer knows exactly what to build and in what order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EngineerAgent-Principal Software Engineer:&lt;/strong&gt; Takes each task from the Planner and implements production-grade TypeScript code for it. Output is source files with proper typing, error handling, and JSDoc. This is where the actual code gets written.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QAAgent-Senior QA Engineer:&lt;/strong&gt; Receives the Engineer's output and writes exhaustive Vitest test suites for each task. Output is test files covering acceptance criteria and edge cases. The tight loop between Engineer and QA means the implementation is always tested before the Reviewer sees it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ReviewerAgent-Principal Engineer Reviewer:&lt;/strong&gt; The final stage, conducts an audit across security, performance, and correctness across everything the previous agents produced. Output is a score from 0 to 100 and an approval status that tells you whether the output is ready to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality and Resilience
&lt;/h2&gt;

&lt;p&gt;The pipeline is built for production reliability, not just happy-path execution. Several resilience patterns are built in at the infrastructure level so agent failures do not cascade into full pipeline failures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strict TypeScript&lt;/strong&gt; - No any types allowed anywhere in the generated code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exponential Backoff&lt;/strong&gt; - Retries on 429/529 errors at 1s, 2s, 4s, 8s, and 16s intervals. Rate limit hits do not kill the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON Robustness&lt;/strong&gt; - When an agent returns malformed JSON, the pipeline automatically retries with explicit instructions to fix the format rather than failing immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeout&lt;/strong&gt; - Ahard 20-minute limit per pipeline run prevents runaway executions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node.js 20+&lt;/li&gt;
&lt;li&gt;OpenRouter API Key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
Clone the repository, then install dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npm install
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure the environment. The only required variable is your OpenRouter API key - everything else has sensible defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env and add your OPENROUTER_API_KEY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;The system uses &lt;code&gt;envalid&lt;/code&gt; for robust configuration management:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; - required for LLM access&lt;br&gt;
&lt;code&gt;DEFAULT_MODEL&lt;/code&gt; - set to &lt;code&gt;google/gemini-2.0-flash-001&lt;/code&gt;&lt;br&gt;
&lt;code&gt;PORT&lt;/code&gt; - API server port, default: &lt;code&gt;3000&lt;/code&gt;&lt;br&gt;
&lt;code&gt;DB_PATH&lt;/code&gt; - SQLite database path, default: &lt;code&gt;spec-to-ship.db&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Terminal UI&lt;/strong&gt;&lt;br&gt;
Run the interactive CLI to start a new pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npm run start
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI uses Ink to provide real-time status updates and token streaming as each agent works through its stage. You can watch the pipeline progress in real time - each agent's output appears as it is generated rather than waiting for the full run to complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Industrial Dashboard&lt;/strong&gt;&lt;br&gt;
Start the API server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npm run dev
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;code&gt;dashboard/index.html&lt;/code&gt; in your browser. The dashboard features an Industrial Command Center aesthetic with Dark Charcoal and Amber Glow styling and uses Server-Sent Events for real-time observability of the pipeline as it runs. This gives a visual view of the same pipeline that the CLI runs, useful for sharing progress with others or monitoring longer runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Structure
&lt;/h2&gt;

&lt;p&gt;Every pipeline run writes its artifacts to ./output/{runId}/. Each file maps directly to one agent's output, so you can inspect any stage independently:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spec.md&lt;/code&gt; : Architectural specification from the Architect agent. The source of truth every downstream agent works from.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tasks.json&lt;/code&gt; : Task breakdown from the Planner agent. The dependency-ordered list of what gets built.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;src/&lt;/code&gt; : mplementation code from the Engineer agent.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tests/&lt;/code&gt; : Vitest tests from the QA agent.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;review.md&lt;/code&gt; : Final review report from the Reviewer agent, including the score and approval status.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;meta.json&lt;/code&gt; : Token usage, cost, and timing for the full run.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pipeline.log&lt;/code&gt; : NDJSON event log of the entire pipeline execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The idea was a multi-agent pipeline that mirrors how a real engineering team works each role specialized, each handoff structured, and the whole thing running autonomously from a feature idea through to reviewed, committed code. The requirements included five distinct agent roles with clear responsibilities, a sequential handoff structure with a QA loop, production-grade TypeScript output, real-time observability via both a CLI and a web dashboard, and resilience patterns like exponential backoff and JSON retry logic.&lt;/p&gt;

&lt;p&gt;NEO built the full system: the five agent implementations with their respective prompts and output schemas, the pipeline orchestration layer coordinating sequential handoffs, the Ink-based CLI with real-time token streaming, the Node.js API server with SSE for dashboard observability, the Industrial Command Center dashboard in HTML, the SQLite-backed database, the artifact output structure, and the &lt;code&gt;envalid&lt;/code&gt; configuration layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to go from idea to working code in a single command&lt;/strong&gt;. &lt;br&gt;
Write a feature description, run the pipeline, and get a complete implementation with architecture docs, TypeScript source, Vitest tests, and a reviewer score without manually coordinating any of the steps. The five-agent structure ensures each stage is handled by a role optimised for that specific task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the reviewer score as a quality gate&lt;/strong&gt;. &lt;br&gt;
The ReviewerAgent scores every run from 0 to 100 across security, performance, and correctness. Teams can use this score as a threshold before accepting generated code - only promoting runs that clear a minimum score into the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the NDJSON event log for pipeline observability&lt;/strong&gt;. &lt;br&gt;
Every run writes a structured &lt;code&gt;pipeline.log&lt;/code&gt; in NDJSON format. This can be parsed by any log processing tool to track pipeline performance, token costs, and approval rates across runs over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional agent roles&lt;/strong&gt;. &lt;br&gt;
The five-agent structure is sequential and modular. A new agent that receives the previous stage's output and produces its own artifact can be added without restructuring the existing pipeline - the handoff pattern is already established for each stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;SPEC TO SHIP compresses the gap between a feature idea and production-ready code by distributing the work across five specialized agents, each focused on what it does best. Architecture, planning, implementation, testing, and review - all coordinated automatically, with structured handoffs and resilience built in at every stage.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Spec-To-Ship" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Spec-To-Ship&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;. &lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>multiagent</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>RAG Pipeline Stress Tester: Battle-Test Your RAG System Before It Reaches Production</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 12 May 2026 11:45:30 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/rag-pipeline-stress-tester-battle-test-your-rag-system-before-it-reaches-production-397c</link>
      <guid>https://dev.to/nilofer_tweets/rag-pipeline-stress-tester-battle-test-your-rag-system-before-it-reaches-production-397c</guid>
      <description>&lt;p&gt;Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Pipeline Stress Tester&lt;/strong&gt; is a battle-testing toolkit that finds these issues before deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.&lt;/li&gt;
&lt;li&gt;Tracks relevance, hallucination, refusal quality, and latency for every query sent.&lt;/li&gt;
&lt;li&gt;Scores everything into a composite health score from 0 to 100.&lt;/li&gt;
&lt;li&gt;Breaks results down by query category so you know exactly which failure modes are causing issues.&lt;/li&gt;
&lt;li&gt;Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.&lt;/li&gt;
&lt;li&gt;Produces an HTML report with interactive charts and a JSON report for CI/CD integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28iyjk2nc9t6w3r1h1tq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28iyjk2nc9t6w3r1h1tq.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Exists
&lt;/h2&gt;

&lt;p&gt;Before deploying a RAG system to production, four questions need answers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does it hallucinate when asked about things not in the corpus?&lt;/li&gt;
&lt;li&gt;Does it refuse appropriately on out-of-scope questions?&lt;/li&gt;
&lt;li&gt;Does it stay consistent when the same question is asked multiple ways?&lt;/li&gt;
&lt;li&gt;Does it hold up under load 10, 25, 50 concurrent users?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Manual testing cannot answer these questions at scale. This tool does it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without stress testing&lt;/strong&gt; - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With this tool&lt;/strong&gt; - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 7 Query Categories
&lt;/h2&gt;

&lt;p&gt;The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;out_of_scope&lt;/code&gt; - Questions with no answer in the corpus, tests hallucination resistance&lt;br&gt;
&lt;code&gt;adversarial&lt;/code&gt; - Prompt injection and jailbreak attempts, tests instruction-following safety&lt;br&gt;
&lt;code&gt;ambiguous&lt;/code&gt; - Queries with multiple valid interpretations, tests disambiguation&lt;br&gt;
&lt;code&gt;multilingual&lt;/code&gt; - Non-English queries, tests language handling&lt;br&gt;
&lt;code&gt;temporal&lt;/code&gt; - Time-sensitive questions that depend on stale data&lt;br&gt;
&lt;code&gt;negation&lt;/code&gt; - "What is NOT X" style questions, a common failure mode&lt;br&gt;
&lt;code&gt;compound&lt;/code&gt; - Multi-part questions requiring multiple retrievals&lt;/p&gt;

&lt;p&gt;You can add your own queries by appending lines to any file in &lt;code&gt;query_bank/&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Health Score
&lt;/h2&gt;

&lt;p&gt;Every test run produces a composite Health Score from 0 to 100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;≥ 80  EXCELLENT   Production-ready
≥ 60  GOOD        Minor issues, review before deploying
≥ 40  FAIR        Significant issues, fix first
 &amp;lt; 40  POOR        Critical failures, do not deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Calculated from five weighted components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvislnqf6am0fb8i2m3yi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvislnqf6am0fb8i2m3yi.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main.py             Typer CLI — entry point and orchestration
adversarial.py      Query generator — 7 categories, pre-built + corpus-generated
loader.py           Async load driver — aiohttp, configurable concurrency
evaluator.py        Scorer — hallucination, precision, refusal, consistency
reporter.py         Report generator — HTML (Chart.js) + JSON output
corpus_analyzer.py  Optional: generate targeted queries from your own documents
query_bank/         7 pre-built adversarial query files (one per line)
tests/              58 pytest tests (no live endpoint needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The endpoint the tester sends requests to must accept POST with &lt;code&gt;{"query": "..."}&lt;/code&gt; and return JSON containing either a &lt;code&gt;response&lt;/code&gt; or &lt;code&gt;answer&lt;/code&gt; field. Any HTTP status other than 200 is counted as an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running a Stress Test
&lt;/h2&gt;

&lt;p&gt;The core command runs a full stress test against your RAG endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Basic — 10 concurrent users, 60-second run&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--duration&lt;/span&gt; 60

&lt;span class="c"&gt;# Test only specific query categories&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query-types&lt;/span&gt; out_of_scope,adversarial,multilingual

&lt;span class="c"&gt;# Custom output directory&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./my-reports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what a real terminal output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🚀 Starting RAG Stress Test
   Endpoint: http://localhost:8000/query
   Concurrency: 5
   Duration: 20s

📊 Generating test queries...
   Generated 350 test queries

⚡ Running load tests...
📈 Evaluating results...
📝 Generating reports...

✅ Stress test complete!
   JSON Report: reports/stress_test_results.json
   HTML Report: reports/stress_test_report.html

=======================================================
  Overall Health Score : 57.1/100
  Status               : FAIR - Significant issues detected
  Total requests       : 6355
  Error rate           : 0.0%
  Precision score      : 2.1%
  Hallucination rate   : 22.5%
  Refusal rate         : 77.5%
  Consistency score    : 72.1%
  Latency p50/p95/p99  : 2.9 / 6.3 / 8.7 ms

  Query Type          Count   Halluc%   Refusal%    AvgLat
  ------------------ ------  --------  ---------  --------
  adversarial           205     35.1%      64.9%      3.3ms
  ambiguous             250     12.0%      88.0%      3.2ms
  compound              200     22.0%      78.0%      4.0ms
  multilingual          250     10.0%      90.0%      3.1ms
  negation              200     20.0%      80.0%      5.3ms
  out_of_scope          250     20.0%      80.0%      4.0ms
  temporal              200     38.0%      62.0%      3.1ms

  Recommendations:
    - Low precision score. Enhance retrieval mechanism and relevance ranking.
    - Moderate: Several areas need improvement for production readiness.
=======================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Sanity Check
&lt;/h2&gt;

&lt;p&gt;For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py quick-test &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔍 Running quick sanity test...
   Testing with 35 sample queries

🎯 Quick Test Health Score: 72.4/100
   ✅ Endpoint appears functional
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Generate Queries From Your Own Corpus
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;analyze-corpus&lt;/code&gt; command analyzes your own &lt;code&gt;.txt&lt;/code&gt;, &lt;code&gt;.md&lt;/code&gt;, or &lt;code&gt;.json&lt;/code&gt; files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into &lt;code&gt;query_bank/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py analyze-corpus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--corpus&lt;/span&gt; ./my-docs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./query_bank &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-queries&lt;/span&gt; 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📚 Analyzing corpus: ./my-docs
   Generated 50 in_scope queries → query_bank/in_scope_generated.txt
   Generated 50 out_of_scope queries → query_bank/out_of_scope_generated.txt
   Generated 50 adversarial queries → query_bank/adversarial_generated.txt

✅ Corpus analysis complete!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For very small corpora, lower the keyword frequency threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py analyze-corpus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--corpus&lt;/span&gt; ./my-docs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./query_bank &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-queries&lt;/span&gt; 20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-word-freq&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Edit &lt;code&gt;config.yaml&lt;/code&gt; to customise load levels, thresholds, and reporting. The &lt;code&gt;--endpoint&lt;/code&gt; CLI flag always takes precedence over &lt;code&gt;config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;load.concurrency_levels&lt;/code&gt; - Concurrent user levels to test, for example &lt;code&gt;[1, 5, 10, 25]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.ramp_mode&lt;/code&gt; - If true, steps through each concurrency level; if false, runs at the first level for the full duration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.duration_seconds&lt;/code&gt; - How long to run at each concurrency level&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.rate_limit_per_second&lt;/code&gt; - Maximum requests per second&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluation.hallucination_threshold&lt;/code&gt; - Keyword-overlap score below which a response is flagged as a potential hallucination&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluation.refusal_keywords&lt;/code&gt; - Phrases that indicate a refused answer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reporter.output_dir&lt;/code&gt; - Where to save HTML and JSON reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pass the config file with &lt;code&gt;--config&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Output Reports
&lt;/h2&gt;

&lt;p&gt;Each test run saves two files to &lt;code&gt;./reports/&lt;/code&gt; or your &lt;code&gt;--output&lt;/code&gt; path:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stress_test_results.json&lt;/strong&gt; - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stress_test_report.html&lt;/strong&gt; - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpoint Requirements
&lt;/h2&gt;

&lt;p&gt;The tester sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/your-endpoint&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is machine learning?"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It expects a JSON response containing either a &lt;code&gt;response&lt;/code&gt; or &lt;code&gt;answer&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Machine learning is..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any HTTP status other than 200 is counted as an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Tests
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;58 tests covering all modules. Uses &lt;code&gt;aioresponses&lt;/code&gt; to mock HTTP - no live RAG endpoint required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rag-pipeline-stress-tester/
├── main.py             # CLI entry point
├── adversarial.py      # Query generators (7 types)
├── loader.py           # Async load test driver
├── evaluator.py        # Scoring and metrics
├── reporter.py         # HTML + JSON report generator
├── corpus_analyzer.py  # Optional corpus-based query generation
├── config.yaml         # Test configuration
├── requirements.txt
├── query_bank/         # 7 pre-built adversarial query files
└── tests/              # 58 pytest tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.&lt;/p&gt;

&lt;p&gt;xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a pre-deployment gate for every RAG system.&lt;/strong&gt;&lt;br&gt;
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it with your own domain queries.&lt;/strong&gt;&lt;br&gt;
The pre-built query banks are general purpose. For domain-specific testing, run &lt;code&gt;analyze-corpus&lt;/code&gt; on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into &lt;code&gt;query_bank/&lt;/code&gt; and run the stress test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrate the JSON report into CI/CD.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;stress_test_results.json&lt;/code&gt; is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional query categories.&lt;/strong&gt;&lt;br&gt;
The 7 query banks are plain text files in &lt;code&gt;query_bank/&lt;/code&gt;, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to &lt;code&gt;query_bank/&lt;/code&gt; and registering it in &lt;code&gt;adversarial.py&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/RAG-pipeline-stress-tester" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/RAG-pipeline-stress-tester&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Orbis: Turn Any GitHub Repository Into an Interactive 3D Dependency Graph</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 09 May 2026 10:58:10 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/orbis-turn-any-github-repository-into-an-interactive-3d-dependency-graph-3eei</link>
      <guid>https://dev.to/nilofer_tweets/orbis-turn-any-github-repository-into-an-interactive-3d-dependency-graph-3eei</guid>
      <description>&lt;p&gt;Understanding a large codebase is hard. You clone it, start reading files, and quickly lose track of how everything connects. Which modules are most depended on? Where are the circular dependencies? What would break if you refactored this file?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orbis&lt;/strong&gt; answers these questions visually. Paste a GitHub repository URL, and Orbis clones it, parses the ASTs across Python, JavaScript, TypeScript, Go, Rust, and Java, detects architectural patterns, and renders the entire codebase as a navigable 3D force-directed graph. Click any module to inspect its dependencies, metrics, and exported symbols. Ask the built-in AI assistant questions like "which module should I refactor first?" and get answers grounded in the actual code structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3D force-directed graph&lt;/strong&gt; - Nodes sized by lines of code, colored by type, with animated directional particles on edges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-language AST parsing&lt;/strong&gt; - Python, JavaScript/TypeScript, Go, Rust, and Java via tree-sitter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI chat assistant&lt;/strong&gt; - Ask Claude questions about the analyzed codebase. Questions like "Which modules have circular dependencies?" or "Where should I add feature X?" are answered with full architectural context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural insights&lt;/strong&gt; - Auto-detected issues including god modules, high coupling, and circular dependencies, each with severity ratings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Focus Mode&lt;/strong&gt; - Dim unconnected nodes to trace dependency paths clearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shareable URLs&lt;/strong&gt; - &lt;code&gt;?repo=https://github.com/...&lt;/code&gt; auto-triggers analysis on load, making it easy to share a specific codebase view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recent history&lt;/strong&gt; - Last 5 repos stored locally for quick re-analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo mode&lt;/strong&gt; — Load a pre-analyzed snapshot without a GitHub clone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Backend: FastAPI + Server-Sent Events (SSE)&lt;/li&gt;
&lt;li&gt;AST Parsing: tree-sitter (Python, JS/TS, Go, Rust, Java)&lt;/li&gt;
&lt;li&gt;AI Integration: Claude Opus 4.6 via Anthropic API&lt;/li&gt;
&lt;li&gt;3D Rendering: 3d-force-graph + Three.js&lt;/li&gt;
&lt;li&gt;Frontend: Vanilla JS SPA - no build step&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;orbis
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate   &lt;span class="c"&gt;# Windows: venv\Scripts\activate&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Set up environment&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env and add your ANTHROPIC_API_KEY for the AI chat feature&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get an API key at console.anthropic.com. The AI chat feature requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment. It degrades gracefully, if the key is missing, the chat panel shows an error message rather than breaking the rest of the app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Run&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8001&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; orbis &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8001:8001 &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-... orbis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;Once running, the workflow is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter a public GitHub repository URL - for example &lt;code&gt;https://github.com/expressjs/express&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Optionally specify a branch&lt;/li&gt;
&lt;li&gt;Click Analyze - Orbis clones the repo, parses ASTs, and builds the graph in roughly 5–30 seconds&lt;/li&gt;
&lt;li&gt;Explore the 3D graph - click a node to open its detail drawer, scroll to zoom, drag to rotate&lt;/li&gt;
&lt;li&gt;Use Focus Mode to highlight a node's direct connections&lt;/li&gt;
&lt;li&gt;Use layer filter chips to show or hide architectural layers&lt;/li&gt;
&lt;li&gt;Ask the AI assistant questions about the codebase in the chat panel&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Keyboard Shortcuts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;R: Reset camera&lt;/li&gt;
&lt;li&gt;P: Pause/resume rotation&lt;/li&gt;
&lt;li&gt;F: Toggle Focus Mode&lt;/li&gt;
&lt;li&gt;/: Focus search box&lt;/li&gt;
&lt;li&gt;Esc: Close detail drawer / exit Focus Mode&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The project has four files at its core - a FastAPI backend, a single-file AST parser, and a vanilla JS frontend with no build step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main.py           FastAPI backend — SSE streaming for /analyze, /chat
neo_parser.py     Multi-language AST parser (tree-sitter)
static/
  index.html      Single-page frontend (3d-force-graph + Three.js)
save_analysis.py  Utility: pre-generate demo data from a repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backend streams analysis progress to the frontend via Server-Sent Events, The backend streams analysis progress to the frontend via Server-Sent Events while cloning and analyzing the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiky04nqoykgwfknmlsm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiky04nqoykgwfknmlsm.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Schema
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/analyze&lt;/code&gt; emits SSE events and completes with a &lt;code&gt;complete&lt;/code&gt; event containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"architecture_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MVC"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"languages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Codebase contains 42 modules..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"auth.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"utility"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lines_of_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;315&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"complexity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"exported_symbols"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AuthBase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HTTPBasicAuth"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"internal_dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests/compat"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"external_dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"functions_total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"classes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"edges"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"import"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"insights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high_coupling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"High fan-in on requests/models"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"14 modules import this file directly."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"affected_nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests/models"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"recommendation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Consider splitting into smaller focused modules."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node carries its lines of code, complexity rating, exported symbols, and both internal and external dependencies. The insights block surfaces architectural issues automatically, high coupling, circular dependencies, and god modules - each with a severity rating and a specific recommendation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Languages
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python - &lt;code&gt;.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;JavaScript/TypeScript - &lt;code&gt;.js&lt;/code&gt;, &lt;code&gt;.mjs&lt;/code&gt;, &lt;code&gt;.cjs&lt;/code&gt;, &lt;code&gt;.jsx&lt;/code&gt;, &lt;code&gt;.ts&lt;/code&gt;, &lt;code&gt;.tsx&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Go - &lt;code&gt;.go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Rust - &lt;code&gt;.rs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Java - &lt;code&gt;.java&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI Chat
&lt;/h2&gt;

&lt;p&gt;The chat assistant uses Claude Opus 4.6 and receives the full architectural graph as context - node list, dependencies, insights, and summary. It can answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What does the auth module depend on?"&lt;/li&gt;
&lt;li&gt;"Why are there circular dependencies between X and Y?"&lt;/li&gt;
&lt;li&gt;"Which module should I refactor first?"&lt;/li&gt;
&lt;li&gt;"Where would I add a caching layer?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The assistant's answers are grounded in the actual parsed structure of the codebase - not generic advice. Requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run with auto-reload&lt;/span&gt;
uvicorn main:app &lt;span class="nt"&gt;--reload&lt;/span&gt; &lt;span class="nt"&gt;--port&lt;/span&gt; 8001

&lt;span class="c"&gt;# Re-generate demo data&lt;/span&gt;
python save_analysis.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The idea was a tool that turns any GitHub repository into an interactive 3D graph, something a developer could paste a URL into and immediately understand the architecture without reading a single file. The requirements included multi-language AST parsing, automatic architectural issue detection, an AI assistant grounded in the actual code structure, and a frontend that required no build step.&lt;/p&gt;

&lt;p&gt;NEO built the full stack from that description: the FastAPI backend with SSE streaming for real-time analysis progress, the multi-language AST parser in &lt;code&gt;neo_parser.py&lt;/code&gt; covering Python, JavaScript, TypeScript, Go, Rust, and Java via tree-sitter, the 3D force-directed graph frontend in vanilla JS, the Claude Opus 4.6 chat assistant with full architectural context, the insights engine detecting god modules, high coupling, and circular dependencies with severity ratings, and the demo mode with pre-generated analysis data.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to onboard onto an unfamiliar codebase.&lt;/strong&gt;&lt;br&gt;
Instead of spending hours reading files to understand how a project is structured, paste the repo URL into Orbis and get an immediate visual map of every module, its dependencies, and the architectural issues that already exist. The AI assistant can then answer specific questions about the structure without you having to trace imports manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during code review to understand structural impact.&lt;/strong&gt;&lt;br&gt;
When reviewing a large pull request, run Orbis on the repo and use the insights panel to see whether high coupling, circular dependencies, or god modules exist in the areas being changed. The AI assistant can answer specific questions about how the affected modules connect to the rest of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to plan a refactor.&lt;/strong&gt;&lt;br&gt;
Ask the AI assistant "which module should I refactor first?" or "where would I add a caching layer?" and get answers grounded in the actual dependency graph. The focus mode lets you isolate a specific module and trace exactly what depends on it before touching anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional language parsers.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;neo_parser.py&lt;/code&gt; already handles five languages via tree-sitter. Adding a new language - Ruby, C++, Swift - follows the same parser pattern and surfaces automatically in the language filter chips and the supported languages list without touching the frontend or the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Orbis makes codebase architecture something you can see and navigate rather than something you have to reconstruct in your head. A 3D dependency graph, multi-language AST parsing, automatic architectural issue detection, and an AI assistant that knows the actual structure - all from a single repo URL.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Orbit-dependency-visualised" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Orbit-dependency-visualised&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>SmolVLM2 Edge Vision Agent: Visual Monitoring Without a GPU or Cloud API</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 07 May 2026 11:43:31 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/smolvlm2-edge-vision-agent-visual-monitoring-without-a-gpu-or-cloud-api-2afp</link>
      <guid>https://dev.to/nilofer_tweets/smolvlm2-edge-vision-agent-visual-monitoring-without-a-gpu-or-cloud-api-2afp</guid>
      <description>&lt;p&gt;Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SmolVLM2 Edge Vision Agent&lt;/strong&gt; is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;The agent does five things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingests a live webcam feed or an image folder as input&lt;/li&gt;
&lt;li&gt;Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes&lt;/li&gt;
&lt;li&gt;Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations&lt;/li&gt;
&lt;li&gt;Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores&lt;/li&gt;
&lt;li&gt;Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use cases:&lt;/strong&gt; home security camera analysis, document digitization pipelines, accessibility tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfmzuo5rq01ymye9ocj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfmzuo5rq01ymye9ocj7.png" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-frame timeline:&lt;/strong&gt;&lt;br&gt;
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;FRAME_DIFF_THRESHOLD&lt;/code&gt; defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbhclbik1cc2vpheuh6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbhclbik1cc2vpheuh6k.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python:&lt;/strong&gt; 3.11 or newer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 16GB minimum for the real model; less is fine in &lt;code&gt;--mock&lt;/code&gt; mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; ~5GB free for the model cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No GPU required&lt;/strong&gt; - SmolVLM2-2.2B is designed for CPU inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
&lt;span class="nb"&gt;cd &lt;/span&gt;smolvlm2-edge-agent
make &lt;span class="nb"&gt;install&lt;/span&gt;                                  &lt;span class="c"&gt;# pip install -e .&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env                          &lt;span class="c"&gt;# then edit values as needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;make install&lt;/code&gt; command runs &lt;code&gt;pip install -e&lt;/code&gt; . which installs the package and its pinned runtime dependencies from &lt;code&gt;requirements.txt&lt;/code&gt;. The &lt;code&gt;.env.example&lt;/code&gt; file contains all documented environment variables, copy it to &lt;code&gt;.env&lt;/code&gt; and edit the values you want to override before running.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in &lt;code&gt;.env.example&lt;/code&gt; in the &lt;a href="https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MODEL_NAME&lt;/code&gt; - HuggingFace model id, default: &lt;code&gt;HuggingFaceTB/SmolVLM2-2.2B-Instruct&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;USE_MOCK_MODE&lt;/code&gt; - bypass model loading with deterministic stub responses, default: &lt;code&gt;false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MODEL_CACHE_DIR&lt;/code&gt; - where the HuggingFace model is cached on disk, default: &lt;code&gt;./models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DB_PATH&lt;/code&gt; - SQLite database file path, default: &lt;code&gt;./data/observations.db&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FRAME_DIFF_THRESHOLD&lt;/code&gt; - motion sensitivity on a 0–1 scale, higher means less sensitive, default: &lt;code&gt;0.15&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MIN_CONFIDENCE&lt;/code&gt; - minimum VLM confidence required to log an observation, default: &lt;code&gt;0.5&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PROCESSING_INTERVAL&lt;/code&gt; - seconds between frame samples, default: &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MAX_OBSERVATIONS&lt;/code&gt; - cap on stored rows, older observations are pruned, default: &lt;code&gt;10000&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DASHBOARD_HOST&lt;/code&gt; - FastAPI bind host, default: &lt;code&gt;0.0.0.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DASHBOARD_PORT&lt;/code&gt; - FastAPI port, default: &lt;code&gt;8080&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;INPUT_SOURCE&lt;/code&gt; - camera index or path to image folder, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OUTPUT_DIR&lt;/code&gt; - where observation artifacts are written, default: &lt;code&gt;./data/observations/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;THUMBNAIL_DIR&lt;/code&gt; - where frame thumbnails are saved, default: &lt;code&gt;./data/thumbnails/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOG_LEVEL&lt;/code&gt; - Python logging level, default: &lt;code&gt;INFO&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOG_FILE&lt;/code&gt; - optional log file path, default: &lt;code&gt;./data/agent.log&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;MIN_CONFIDENCE&lt;/code&gt; is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Quick start - mock mode, no model download&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; data/test_images
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--mock&lt;/span&gt; &lt;span class="nt"&gt;--input&lt;/span&gt; ./data/test_images &lt;span class="nt"&gt;--duration&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs the agent for 30 seconds against the &lt;code&gt;data/test_images/&lt;/code&gt; folder using the mock VLM, populates &lt;code&gt;data/observations.db&lt;/code&gt;, and writes thumbnails to &lt;code&gt;data/thumbnails/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run against a webcam&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--input&lt;/span&gt; 0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open &lt;code&gt;http://localhost:8080&lt;/code&gt; in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run against an image folder&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--input&lt;/span&gt; ./images &lt;span class="nt"&gt;--interval&lt;/span&gt; 2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Iterates over &lt;code&gt;./images&lt;/code&gt; at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboard only in read mode&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--mode&lt;/span&gt; dashboard &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Serves the dashboard against an existing &lt;code&gt;data/observations.db&lt;/code&gt; without running the agent. Useful for reviewing historical observations without starting a new capture session.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Reference
&lt;/h2&gt;

&lt;p&gt;The FastAPI dashboard exposes six endpoints:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm66ohxfya82kfvtv4gx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm66ohxfya82kfvtv4gx.png" alt=" " width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/search&lt;/code&gt; endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/observations&lt;/code&gt; endpoint is paginated with &lt;code&gt;limit&lt;/code&gt; and &lt;code&gt;offset&lt;/code&gt; parameters. The default returns the 50 most recent observations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models Used
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l2t21i3rhps3jm4ra5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l2t21i3rhps3jm4ra5f.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the default &lt;code&gt;--model&lt;/code&gt; argument and &lt;code&gt;MODEL_NAME&lt;/code&gt; env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in &lt;code&gt;./models&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;test&lt;/span&gt;                  &lt;span class="c"&gt;# python3 -m pytest tests/ -v
&lt;/span&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;lint&lt;/span&gt;                  &lt;span class="c"&gt;# ruff check src/ tests/ --fix
&lt;/span&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;typecheck&lt;/span&gt;             &lt;span class="c"&gt;# mypy src/ --ignore-missing-imports
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test coverage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tests/test_db.py&lt;/code&gt; - 10 tests covering SQLite schema, CRUD, and search&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_vision.py&lt;/code&gt; - 6 tests covering mock VLM and prompt rendering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_agent.py&lt;/code&gt; - 9 tests covering motion detection and the agent loop&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_dashboard.py&lt;/code&gt; - 6 tests covering HTTP route handlers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_cli.py&lt;/code&gt; - 7 tests covering argparse and env-var loading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total: 36 tests, all passing. No skipped tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;smolvlm2-edge-agent/
├── src/
│   ├── __init__.py
│   ├── __main__.py              # entry point for python -m src
│   ├── agent.py                 # MotionDetector + VisionAgent
│   ├── vision.py                # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│   ├── db.py                    # SQLite Database class
│   ├── dashboard.py             # FastAPI app factory + route handlers
│   └── cli.py                   # argparse + env loading
├── tests/                       # 36 pytest tests, VLM fully mocked
├── data/.gitkeep                # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep              # HF model cache
├── pyproject.toml               # ruff + mypy config + console_script
├── requirements.txt             # pinned runtime deps
├── Makefile                     # install, test, lint, typecheck, run, clean
├── .env.example                 # documented env vars
├── .gitignore
├── BUILD_NOTES.md               # build/verification trace
└── PUBLISH.md                   # exact GitHub push commands
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;src/&lt;/code&gt; directory maps cleanly to the agent's responsibilities - &lt;code&gt;agent.py&lt;/code&gt; handles the motion detection and VLM orchestration loop, &lt;code&gt;vision.py&lt;/code&gt; wraps the model with a mock-compatible interface, &lt;code&gt;db.py&lt;/code&gt; handles all SQLite operations, &lt;code&gt;dashboard.py&lt;/code&gt; is the FastAPI application, and &lt;code&gt;cli.py&lt;/code&gt; handles all argument parsing and environment variable loading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;PRs welcome. Before submitting, all three of the following must pass with zero errors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make lint &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make typecheck &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as an offline home security monitor:&lt;/strong&gt; Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it for document digitization pipelines:&lt;/strong&gt; Point &lt;code&gt;--input&lt;/code&gt; at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The &lt;code&gt;/api/search&lt;/code&gt; endpoint lets you query what was found across the full document set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as an accessibility tool:&lt;/strong&gt; Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional VLM backends:&lt;/strong&gt; &lt;code&gt;VisionEngine&lt;/code&gt; in &lt;code&gt;vision.py&lt;/code&gt; wraps SmolVLM2-2.2B with a clean interface that &lt;code&gt;MockVisionEngine&lt;/code&gt; also implements. Swapping in a different HuggingFace multimodal model means updating &lt;code&gt;vision.py&lt;/code&gt; - the agent, database, dashboard, and CLI stay entirely unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
