DEV Community: Nilofer 🚀

ArchGuard: Detect Architecture Drift Before It Becomes Technical Debt

Nilofer 🚀 — Tue, 02 Jun 2026 09:44:09 +0000

Architecture degrades gradually. A circular dependency here, a god class there, a controller reaching directly into the database layer. Each violation is small on its own. Over time they compound into a codebase that is expensive to change and expensive to understand.

Most teams discover this in retrospect when a refactor takes three times as long as expected, or when a seemingly isolated change breaks something unrelated. By then the drift is already embedded.

ArchGuard is a production-ready Python static analysis tool that detects architecture degradation patterns in codebases over time. It runs six built-in detectors, compares architecture health between branches, tracks drift over the last 10 commits, and integrates into CI/CD through a GitHub Action or git hooks - all without any AI model dependency, using deterministic local static analysis.

Features

6 Built-in Detectors - circular dependencies, god classes, service layer bypasses, magic values, cyclomatic complexity, and layer violations.
Per-PR Analysis - compare architecture health between branches to catch regressions before they merge.
Trend Analysis - track architecture health over the last 10 commits to see drift over time.
Multiple Output Formats - table, JSON, YAML, Markdown, and HTML.
CLI and Git Hooks - command-line tool with pre-commit and pre-push hooks.
GitHub Action - CI/CD integration for automated architecture checks.
YAML Configuration - flexible, project-specific configuration via .archguard.yml.

Architecture

The CLI and YAML config feed the core engine - an AST parser, dependency graph, and base analyzer which fans out to six detectors. Findings are graded by severity, rendered as Table, JSON, YAML, Markdown, or HTML, and delivered through the CLI, git hooks, or the GitHub Action.

Installation

From PyPI

pip install archguard

From Source

git clone https://github.com/dakshjain-1616/Arch-Guard
cd archguard
pip install -e ".[dev]"

Requires Python 3.10+.

Quick Start

Initialize a configuration file in the project root:

archguard init

Scan the current tree or point it at a specific path:

archguard scan
archguard scan ./src

For machine-readable results:

archguard scan --format json --output report.json

Review architecture drift over the last 10 commits:

archguard trend

CLI Commands

scan - Analyze Codebase

archguard scan [PATH] [OPTIONS]

Key flags: --format, --output, --detectors, --severity, --fail-on-violations. Global flags: --config, --verbose.

trend - Analyze Trends

archguard trend [OPTIONS]

Flags: --commits, --format, --output.

init - Create Configuration

archguard init [OPTIONS]

--path selects the config file location.

config - Manage Configuration

archguard config                          # Show active configuration
archguard config output_format            # Read a value
archguard config output_format json       # Update a value

The Six Detectors

Circular Dependency
Detects circular import dependencies between modules.
min_cycle_length - minimum cycle length to report, default 2
max_cycles - maximum cycles to report, default 100

God Class
Detects classes with too many methods, attributes, or lines.
max_methods - maximum methods per class, default 20
max_attributes - maximum attributes per class, default 15
max_lines - maximum lines per class, default 500

Service Layer Bypass
Detects when controller or presentation layers bypass service layers to access repositories directly.
controller_patterns - regex patterns for controller files
service_patterns - regex patterns for service files
repository_patterns - regex patterns for repository files

Magic Value
Detects hardcoded literals that should be named constants.
min_string_length - minimum string length to flag, default 3
max_string_length - maximum string length to check, default 100

Cyclomatic Complexity
Detects functions and methods with high cyclomatic complexity.
thresholds - complexity thresholds for each severity level

Layer Violation
Detects violations of layered architecture, such as the presentation layer importing from the repository layer.
layers - layer definitions with patterns and allowed calls

Configuration

Create a .archguard.yml file in your project root. The config supports project metadata, include and exclude patterns, and per-detector options such as cycle length, maximum class size, and complexity thresholds. Output behavior, Git integration, and trend analysis are all controlled through the same file.

Git Hooks

Installation

python hooks/install.py                        # Install pre-commit hook
python hooks/install.py --pre-commit --pre-push  # Install both hooks
python hooks/install.py --force                # Overwrite existing hooks
python hooks/install.py --uninstall            # Remove hooks

Pre-commit hook - runs ArchGuard on staged Python files before committing.
Pre-push hook - runs trend analysis before pushing to remote.

GitHub Action

The GitHub Action integrates ArchGuard into CI/CD pipelines. Basic usage runs on push or pull request workflows, checks out the repository with full history, and passes path, format, severity, and fail-on-violations settings as action inputs. Advanced configuration enables trend mode, selects Markdown output, sets the commit window, and uploads the generated report as an artifact.

Acknowledgments

Built with Click for CLI, Python's built-in ast module for AST parsing, NetworkX for dependency graph analysis, Rich for terminal output, and GitPython for Git integration.

Development

Setup

git clone https://github.com/dakshjain-1616/Arch-Guard
cd archguard
pip install -e ".[dev]"
pre-commit install

Running Tests

pytest                                                        # Full suite
pytest --cov=src/archguard --cov-report=html                 # With coverage
pytest tests/unit/test_detectors.py                          # Targeted detector check
pytest tests/integration/                                    # Integration coverage

Code Quality

ruff check src/ tests/               # Linting
ruff check --fix src/ tests/         # Auto-fix
pyright src/                         # Type checking
ruff check src/ tests/ && pyright src/  # Combined gate

Contributing

Fork the repository. Create a feature branch:

git checkout -b feature/amazing-feature

Make your changes, run tests with pytest, run linting with ruff check src/ tests/, commit, push to the branch, and open a Pull Request.

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a production-ready static analysis tool that detects architecture drift in Python codebases over time - with six built-in detectors, trend analysis over git history, multiple output formats, git hook integration, and a GitHub Action for CI/CD. NEO built the full implementation: the core engine with AST parser, dependency graph via NetworkX, and base analyzer; all six detector modules; the formatter layer covering table, JSON, YAML, Markdown, and HTML output; the git integration via GitPython; the CLI built on Click; the YAML configuration layer; the git hook installer and pre-commit and pre-push hooks; the GitHub Action; and the full test suite covering unit and integration tests.

How You Can Use and Extend This With NEO

Use it as a quality gate in every pull request.
Add the GitHub Action to your workflow with --fail-on-violations and the severity threshold you care about. Every PR gets checked for new circular dependencies, god classes, layer violations, and complexity regressions before it merges automatically, without any manual review step.

Use trend analysis to measure the health of an inherited codebase.
Run archguard trend on a codebase you have just taken over. The last 10 commits give you a picture of whether the architecture is improving or degrading, and which detectors are firing most frequently - useful context before making any changes.

Use git hooks to enforce standards locally before code reaches CI.
Install the pre-commit hook with python hooks/install.py. Staged files are checked on every commit. The pre-push hook runs trend analysis before anything reaches the remote. Issues are caught at the developer's machine, not in CI.

Extend it with additional detectors.
The six detectors share a common base analyzer interface. A new detector for a project-specific architecture rule follows the same pattern - implement the detection logic, add per-detector configuration to .archguard.yml, and register it. It appears automatically in scan output, trend analysis, and all output formats.

Final Notes

Architecture drift is invisible until it is expensive. ArchGuard makes it visible at every commit, every PR, and every push - with deterministic static analysis that requires no API keys, no model downloads, and no network calls. Six detectors, trend tracking over git history, and CI/CD integration in one tool.

The code is at https://github.com/dakshjain-1616/Arch-Guard
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Prepush-Guardian: Catch Secrets and Broken Tests Before They Reach Git History

Nilofer 🚀 — Mon, 01 Jun 2026 12:13:22 +0000

You are about to push. There is a hardcoded API key buried in one of 30 changed files. Or you forgot to write a test for that new module. Or the test suite is silently failing. You will not know until it is already in git history.

Prepush-Guardian catches all of this before the push lands. It is a production-grade Git pre-push hook that scans staged files for secrets, auto-generates missing tests, runs your full test suite, and blocks the push if anything fails before it ever reaches the remote.

Why This Tool

Manual review - Misses things, does not scale, no enforcement
CI/CD only - Finds it after the push, already in history
prepush-guardian - Blocked at push time, before it ever reaches remote

Scans every staged file for 20+ secret patterns: AWS, GitHub PATs, private keys, database URLs, bearer tokens, and more
Shannon entropy scanner catches novel secrets not matched by patterns
Auto-generates missing tests using OpenRouter AI, with a template fallback if no API key is set
Runs your full test suite and blocks the push if coverage drops below threshold
Writes a markdown report at .neo/prepush-report.md for every push

Quick Start

# Clone and install the hook into your repo
git clone https://github.com/neo-ai/prepush-guardian
cd your-target-repo

# Install the pre-push hook
python3 /path/to/prepush-guardian/install.py

# Optional: set API key for AI test generation
cp .env.example .env   # fill in OPENROUTER_API_KEY

The hook runs automatically on every git push. To run manually:

python3 prepush_guardian.py

Environment Variables

cp .env.example .env
# Required only for AI-based test generation
# Free key at: https://openrouter.ai/keys
OPENROUTER_API_KEY=your_openrouter_api_key_here

Without an API key, the tool falls back to template-based test generation.

Commands

Detection Patterns

The secret scanner covers 20+ patterns across four severity levels:

The Shannon entropy scanner runs alongside the pattern matcher. It catches novel secrets - API keys or tokens not yet covered by a named pattern by flagging high-entropy strings assigned to variables named KEY, TOKEN, or SECRET.

Scoring and Thresholds

Configuration

Create .neo/config.json to customize behavior. It is auto-created with defaults if absent:
coverage_warn_threshold - default 70. Warn if coverage drops below this percentage.
coverage_block_threshold - default 50. Block push if coverage drops below this percentage.
block_on_low_severity - default false. Also hard-block on LOW findings.
auto_fix_gitignore - default true. Add sensitive filenames to .gitignore automatically.
generate_missing_tests - default true. Auto-generate tests for untested source files.
skip_test_check_for - default ["migrations/", "scripts/", "docs/"]. Directories excluded from test generation.

Exit Codes

0 : All checks passed - push proceeding
1 : Push blocked - CRITICAL/HIGH findings or test failures

File Structure

prepush-guardian/
├── prepush_guardian.py      # Main orchestrator
├── leak_detector.py         # Phase 1: secret & entropy detection
├── test_generator.py        # Phase 2: AI test generation
├── test_runner.py           # Phase 2: test execution + coverage
├── reporter.py              # Phase 3: markdown report
├── install.py               # Hook installer
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
└── tests/
    ├── test_leak_detector.py
    └── fixtures/
        ├── sample_with_secrets.py
        └── sample_clean.py

The three-phase structure maps cleanly to the file names - leak_detector.py handles Phase 1, test_generator.py and test_runner.py handle Phase 2, and reporter.py handles Phase 3. prepush_guardian.py orchestrates all three phases in sequence.

How I Built This Using NEO

The requirement was a production-grade Git pre-push hook that catches secrets, validates test coverage, and auto-generates missing tests - blocking the push before anything problematic reaches the remote. NEO planned, wrote, tested, and verified every file in this repository without human intervention: the main orchestrator in prepush_guardian.py, the secret and entropy scanner in leak_detector.py covering 20+ patterns, the AI test generator in test_generator.py with OpenRouter integration and template fallback, the test runner and coverage checker in test_runner.py, the markdown report generator in reporter.py, the hook installer in install.py, and the test suite with fixtures.

How You Can Use and Extend This With NEO

Install it into every repo your team pushes from.
Run python3 install.py once in each repository. From that point, every git push runs the full three-phase check automatically, no CI changes, no developer workflow changes. Secrets and test failures are blocked before they reach the remote.

Tune the thresholds to match your team's standards.
The .neo/config.json file controls coverage warn and block thresholds, whether LOW-severity findings hard-block the push, and which directories are excluded from test generation. These can be committed to the repo so the same standards apply across the whole team.

Use the markdown report as a push audit trail.
Every push writes a report to .neo/prepush-report.md.This gives you a record of what was scanned, what was found, and what was blocked, useful for teams with compliance requirements or for debugging why a push was blocked.

Extend the detection patterns in leak_detector.py.
The secret scanner covers 20+ named patterns. Adding a new pattern for a domain-specific secret type means adding it to the pattern list in leak_detector.py. It is immediately active on the next push with no other changes needed.

Final Notes

The gap between "I think this is clean" and "I know this is clean" is where prepush-guardian lives. Secrets get committed because no one checked. Tests go missing because there was no enforcement. prepush-guardian closes both gaps at the moment they matter most before the push lands.
The code is at https://github.com/dakshjain-1616/prepush-guardian
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Fine-Tuning Qwen2.5-0.5B to Write SRE Post-Mortem Summaries

Nilofer 🚀 — Sat, 30 May 2026 04:43:37 +0000

Writing post-mortem root-cause summaries is time-consuming and inconsistent. Junior SREs miss contributing factors. Senior SREs write summaries that vary in depth and structure. Zero-shot LLMs produce verbose, generic output that does not follow SRE conventions.
Fine-tuning a small model on real incident data produces structured, concise summaries that follow your organisation's format at a fraction of the cost of a large model.

Why This Approach

Diffrent type of approaches and what you get:

Manual SRE writing : Inconsistent, time-consuming, expertise-dependent
Zero-shot large model : Generic format, verbose, high cost per call
Qwen2.5-0.5B fine-tuned : SRE-format outputs, fast, cheap, runs on CPU or consumer GPU

The key advantages of this approach:

700-sample training set of real incident timelines mapped to root-cause summaries
4-bit quantized LoRA training, runs on a single consumer GPU with 8GB VRAM or more
Evaluated against a structured rubric covering timeline reference, contributing factors, specific component, and prevention action
Compared against qwen3.6-plus:free and gpt-5.4-nano baselines

The HuggingFace Model

The fine-tuned adapter is published at: daksh-neo/postmortem-qwen2.5-0.5b-lora

After training, the LoRA weights are saved to models/postmortem-lora/hf_export/ and pushed to HuggingFace.

Quick Start

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # fill in OPENROUTER_API_KEY
export $(cat .env | xargs)

Environment Variables

cp .env.example .env
# Required for baseline evaluation with OpenRouter
OPENROUTER_API_KEY=your_openrouter_api_key_here

OPENROUTER_API_KEY is required only for running baseline evaluations against zero-shot models via OpenRouter. The fine-tuning and local evaluation steps run without it.

Pipeline

The full pipeline runs in four steps:

Each step is independent, you can run baseline evaluation before fine-tuning to establish the gap the fine-tuned model closes, and run evaluation again after to measure the improvement.

Model Configuration

Evaluation Rubric

Every generated summary is scored against a four-criterion rubric. Each criterion carries equal weight:

Pass threshold: 0.60 weighted score or above.

Expected Results

qwen/qwen3.6-plus:free (zero-shot) - 20–35%
openai/gpt-5.4-nano (zero-shot) - 35–50%
Qwen2.5-0.5B (fine-tuned, 3 epochs) - > 60%

The fine-tuned 0.5B model outperforms both zero-shot baselines on rubric compliance because it has been trained specifically on the output format the rubric measures, not on general-purpose tasks.

File Structure

ml_project_0901/
├── scrape_postmortems.py    # Data collection
├── baseline.py              # Zero-shot baseline via OpenRouter
├── finetune.py              # LoRA fine-tuning
├── eval.py                  # Evaluation + comparison
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
├── data/
│   ├── train.jsonl          # 700 training examples
│   ├── test_100.jsonl       # 100 held-out test examples
│   ├── rubric.json          # Scoring rubric
│   └── baseline_results.jsonl
└── models/
    └── postmortem-lora/
        └── hf_export/       # Push to HuggingFace after training

How I Built This Using NEO

The requirement was a complete fine-tuning pipeline for a small model on SRE post-mortem data, with data scraping, zero-shot baseline comparison, 4-bit LoRA fine-tuning, and structured rubric-based evaluation. NEO planned, wrote, tested, and verified every file in the repository without human intervention: the data scraper producing 700 training examples and 100 held-out test examples, the baseline evaluator running zero-shot prompts against OpenRouter models, the LoRA fine-tuning script with the full model configuration, the rubric-based evaluator producing the comparison table, and the HuggingFace export pipeline pushing the trained adapter to daksh-neo/postmortem-qwen2.5-0.5b-lora.

How You Can Use and Extend This With NEO

Use it to replace inconsistent manual post-mortem writing in your team.
Train on your own organisation's incident data by replacing data/train.jsonl with your own incident timeline to root-cause summary pairs. The rubric in data/rubric.json can be adapted to match your org's specific post-mortem format the evaluation pipeline measures compliance against whatever criteria you define.

Use the baseline comparison to justify the fine-tuning investment.
Run python baseline.py before fine-tuning to measure what zero-shot models produce on your data. Run python eval.py after fine-tuning to see the improvement. The comparison table gives you a concrete before-and-after that makes the case for domain-specific fine-tuning over general-purpose models.

Use the published adapter directly without retraining.
The fine-tuned LoRA adapter is available at daksh-neo/postmortem-qwen2.5-0.5b-lora on HuggingFace. You can load it directly without running the training pipeline - useful for teams that want to evaluate the output before committing to their own fine-tuning run.

Extend it to other structured generation tasks.
The four-step pipeline - scrape, baseline, fine-tune, evaluate is domain-agnostic. Any task where structured output format matters more than general knowledge is a candidate: alert triage summaries, change request descriptions, deployment notes. Swap the training data and rubric criteria, and the rest of the pipeline runs unchanged.

Final Notes

Zero-shot large models produce verbose, generic post-mortem summaries that do not follow SRE conventions. A fine-tuned 0.5B model trained on 700 domain-specific examples outperforms them on every criterion of the rubric - timeline reference, contributing factors, specific component identification, and concrete prevention actions, while running on a consumer GPU and costing a fraction per call.

The code is at https://github.com/dakshjain-1616/postmortem-finetune
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Morph: AST-Level Refactoring Where the LLM Describes Intent, Not Code

Nilofer 🚀 — Sat, 23 May 2026 11:04:25 +0000

When an LLM generates source code for a refactor, the output is a diff a reviewer must read line by line and trust blindly. There is no way to know if the model missed a reference, broke an import, or introduced a subtle logic change without reading every line.

Morph takes a different approach. Instead of asking the LLM to generate code, it asks the LLM to describe what to change as a structured plan of typed operations - RenameSymbol, MoveFunction, ExtractModule, and more. A reviewer reads ten structured operations in seconds and knows exactly what will change, why, and in what order. The transformation engine then validates the plan against the real codebase dependency graph, applies each operation atomically using tree-sitter AST manipulation, runs the test suite to confirm correctness, and stages clean changes for review. Failed transformations roll back automatically.

The LLM's job is intent declaration, not code writing. Morph's engine handles everything else.

Why Typed Plans Beat Source Code Generation

When a refactoring is expressed as a typed plan, every operation is verifiable before it runs. The plan validator checks file existence, symbol existence, dependency conflicts, and operation conflicts against a real dependency graph. The transformer applies operations in dependency order. The verifier runs pytest after every apply - any failure triggers automatic rollback.

Source code generation has none of these guarantees. A typed plan does.

The Pipeline

A natural language goal enters the LLM Planner, which outputs a validated TransformationPlan. The Plan Validator checks file existence, symbol existence, dependency conflicts, and operation conflicts against a NetworkX dependency graph. The Transformer applies operations in dependency order using tree-sitter AST manipulation, creating a file backup first. The Verifier runs pytest - any failure triggers automatic rollback. Clean changes are handed off to the Staging Manager via GitPython and summarised in a Report.

Supported Operations

Each operation is a typed Pydantic model. The LLM populates the fields — Morph validates and executes.

How the Dependency Graph Works

Before validating any plan, Morph parses the entire codebase with tree-sitter and builds a NetworkX dependency graph. This graph is used to:

Detect files that import the symbol being moved or renamed
Sort operations so dependencies are updated before dependents
Warn when a move will cascade across downstream files
Prevent circular dependency introduction from module extraction

This is what makes Morph safe to run on real codebases - the plan is validated against the actual dependency structure before a single file is touched.

Rollback Guarantee

Every non-dry-run apply call snapshots all affected files before touching them. If pytest reports failures after transformation, Morph restores from the snapshot automatically. The workspace is always left in a clean, known-good state.

Live Results

A real dry-run against anthropic/claude-haiku-4-5 via OpenRouter - the LLM parsed a natural language rename goal and produced a validated RenameSymbol plan in under 5 seconds. Full output and reproduction steps are in RESULTS.md.

Installation

pip install -e .

For local inference, install Ollama and pull a model:

ollama pull gemma4:e4b

For cloud backends, set the relevant environment variable:
OPENROUTER_API_KEY - OpenRouter (recommended)
OPENAI_API_KEY - OpenAI
ANTHROPIC_API_KEY - Anthropic

Usage

Describe what you want in plain English. Morph figures out the operations:

morph refactor --goal "rename calculate_total to compute_total" ./src

Preview the plan without touching any files:

morph refactor --goal "extract validation logic into validate_input()" ./src --dry-run

Generate and save the plan for inspection before applying:

morph plan --goal "add type annotations to all functions in utils.py" ./src --output plan.json

Apply a saved plan:

morph refactor --plan plan.json ./src

Verify the codebase passes its own test suite:

morph verify ./src

Generate a Markdown report of the last run:

morph report ./src --format markdown --output REFACTOR_REPORT.md

Supported Models

Morph works with any provider. OpenRouter is the recommended starting point - one API key routes to every model below without separate accounts.

The planner uses temperature=0.1 - low randomness produces more consistent structured output. Unknown model strings are automatically routed through OpenRouter with no --backend flag required.

CLI Reference

morph refactor --goal "..." PATH - Generate plan from goal and apply it
morph refactor --plan FILE PATH - Apply a previously saved plan
morph refactor ... --dry-run - Show plan without modifying files
morph plan --goal "..." PATH - Generate and display plan only
morph verify PATH - Run the test suite and report pass/fail
morph report PATH - Generate Markdown/JSON report of last run

Key flags: --model, --backend, --dry-run, --no-rollback, --output

Development

Clone and install in editable mode with dev dependencies:

git clone https://github.com/dakshjain-1616/morph
cd morph
pip install -e ".[dev]"

Run the full test suite:

pytest tests/ -v

Lint and type-check:

ruff check morph/ && mypy morph/

How I Built This Using NEO

The requirement was a refactoring CLI where the LLM describes intent as a structured typed plan rather than generating raw code with AST-level execution, dependency graph validation, automatic rollback on test failure, and support for multiple LLM backends. NEO built the full implementation: the LLM Planner producing typed TransformationPlan outputs with temperature=0.1, the seven typed Pydantic operation models, the Plan Validator checking file existence, symbol existence, and dependency conflicts against a NetworkX graph, the Transformer applying operations in dependency order via tree-sitter AST manipulation with file backup, the Verifier running pytest with automatic snapshot rollback on failure, the Staging Manager via GitPython, the report generator, and the full CLI with all six commands and their key flags.

How You Can Use and Extend This With NEO

Use it to refactor production codebases safely.
Instead of asking an LLM to rewrite files, describe the refactoring goal in plain English. Morph validates the plan against the real dependency graph, applies it atomically, and rolls back automatically if tests fail. The dry-run mode lets you inspect exactly what will happen before anything is touched.

Use the saved plan workflow for team review.
Run morph plan --goal "..." --output plan.json to generate the structured plan without applying it. Share the JSON with your team for review before running the apply step. Reviewers see ten typed operations instead of a raw diff - faster to review, easier to reason about.

Use it as a refactoring step in CI/CD pipelines.
morph verify PATH runs the test suite and reports pass/fail with an exit code, making it composable as a CI step. Combined with morph refactor and --dry-run, you can build a pipeline that proposes, reviews, and applies refactors with automated test verification at every stage.

Extend it with additional operation types.
Each operation is a typed Pydantic model in the operations layer. A new operation follows the same pattern: define the Pydantic model, implement the transformer logic, and register it. The LLM Planner, Plan Validator, and CLI all pick it up automatically.

Final Notes

Morph shifts refactoring from code generation to intent declaration. The LLM describes what to change in a structured, validated plan. The engine does the mechanical work. Tests confirm correctness. The result is refactoring that is auditable before it runs, verifiable after it runs, and automatically reversible if it breaks anything.

The code is at https://github.com/dakshjain-1616/Morph
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

ToolRouter: Switch AI Coding Tools Freely Without Losing Context

Nilofer 🚀 — Fri, 22 May 2026 11:13:41 +0000

Every AI coding tool has its strengths. Claude Code is strong for complex multi-step tasks. Cursor is fast for inline edits. Gemini CLI is useful for quick questions. Most developers use more than one but every time you switch, the context is gone. The new tool has no idea what you just did, what you decided, or which files are in a partial state.

On top of that, there is no clear picture of what different AI tools actually cost per session, per project, or per week. You are guessing at efficiency rather than measuring it.

ToolRouter is a local proxy daemon that solves both problems. It maintains shared session state across multiple AI coding tools, generates Handoff Briefs when you switch between them, and tracks real token spend per tool and model all transparently, without changing your API keys or your tools.

How It Works

ToolRouter sits between your AI tools and their APIs as a local proxy on port 7863. Here is what happens at each stage:

Step 1 - All traffic routes through the proxy.
You point each AI tool's API base URL at localhost:7863. From that point, every request your tool makes passes through ToolRouter first. The proxy forwards it transparently to the real API, your API keys are unchanged, your tools behave exactly as before.

Step 2 - The proxy captures what matters as a side effect.
As AI responses come back through the proxy, ToolRouter reads the token counts and extracts decisions and task state from the response text using pattern matching. Statements like "let's use bcrypt" are classified as decisions. Lines like "implemented JWT validation" are classified as completed tasks. "Still need to finish the refresh logic" becomes an in-progress item. Everything is written to the SQLite state store.

Step 3 - The file tracker watches the filesystem independently.
Alongside the proxy, a Watchdog-based file tracker monitors your project directories. It computes file hashes before and after each session to build an accurate list of what changed. It also scans for syntax errors, merge conflict markers, and unresolved TODOs to detect files that are in a partial state.

Step 4 - When you switch tools, a Handoff Brief is generated and injected.
The Handoff Generator reads from the state store and assembles a brief - partial files first since they carry the highest risk, then in-progress tasks, then decisions and completed items. This brief is automatically injected into the first message of your new session. The receiving tool sees exactly where the last tool left off, before it writes a single line.

Step 5 - Spend is tracked on every proxied response.
Token counts from every response are accumulated and costed against current model pricing. No separate setup needed, spend tracking is a byproduct of the same proxy pass.

Installation

pip install -e .

Quick Start

Step 1 - Start the daemon

toolrouter start

This starts the proxy on port 7863 and the dashboard on port 7864.

Step 2 - Point your AI tools at the proxy
Each tool needs its API base URL pointed at the local proxy. This is a one-time configuration per tool:

Claude Code: export ANTHROPIC_API_URL=http://localhost:7863/v1
Cursor: Set OpenAI API base URL to http://localhost:7863/v1 in Settings → AI
Gemini CLI: export OPENAI_API_BASE=http://localhost:7863/v1
Ollama: export OLLAMA_HOST=http://localhost:7863/api

Step 3 - Work normally
Switch tools whenever you like. ToolRouter handles handoffs automatically.

Handoff Brief

When you switch tools on the same project, ToolRouter injects a brief like this into the first message of your new session:

[ToolRouter Handoff — from claude-code, 5 minutes ago]

Files changed this session:
✓ src/auth.py — implemented JWT token validation
✓ src/models.py — added User model
⚠ src/api.py — PARTIALLY MODIFIED, do not use as-is

Completed:
✓ Set up authentication middleware
✓ Created database schema

In progress:
→ Implementing refresh token logic
→ Writing API documentation

Decisions made:
- Using bcrypt for password hashing
- JWT tokens expire after 24 hours
- Refresh tokens stored in Redis

⚠ Do not touch:
- src/api.py (has syntax errors)

The brief is generated from the state store - real file changes tracked by the watchdog, decisions extracted from AI responses, and partial-state detection on modified files. The receiving tool sees this at the start of the session and can immediately continue where the last tool left off.

Spend Tracking

ToolRouter reads token counts from every proxied response and calculates cost using current model pricing. Spend reports run directly from the terminal:

toolrouter spend           # Today's report
toolrouter spend --week    # This week
toolrouter spend --month   # This month

The dashboard at http://localhost:7864 shows daily spend bar charts per tool, session lists with per-session cost, per-tool and per-project breakdowns, which tool is most cost-efficient measured by cost per file changed, and projected monthly costs based on current pace.

Model Pricing (May 2026)

Commands

Architecture

State Store - SQLite with WAL mode for concurrent read/write. Stores sessions, per-session file changes with MD5 hashes, extracted decisions and tasks, and generated handoff briefs. Every table links back to a session ID so the full history is queryable.

File Tracker - Watchdog-based monitoring of project directories. Computes file hashes before and after each session to build an accurate change list. Detects partial states by scanning for syntax errors, merge conflict markers, and unresolved TODOs.

Decision Extractor - Pattern matching over AI responses to classify statements into decisions, completed tasks, in-progress work, and blockers. Phrases like "let's use" and "we'll go with" are decisions. Words like "done", "implemented", and "✓" signal completed tasks. "I've started" and "still need to" mark in-progress work. "Blocked by" and "waiting for" identify blockers.

Handoff Generator - Assembles the brief from state store data, ordering by recency and priority: partial files first as they carry the highest risk, then in-progress tasks, then decisions and completed items.

Configuration

Configuration is stored at ~/.toolrouter/config.json. The key settings are:
injection_enabled - whether to prepend handoff briefs
proxy_port - default 7863
dashboard_port - default 7864
log_level - logging verbosity

Use toolrouter config set <key> <value> to change any setting without editing the file directly.

How I Built This Using NEO

The requirement was a local proxy daemon that could sit transparently between AI coding tools and their APIs, maintain shared session state across tool switches, generate structured handoff briefs automatically, and track real token spend per tool and model - all without requiring any changes to the tools themselves or the user's API keys.

NEO built the full implementation: the proxy daemon running on port 7863, the SQLite state store with WAL mode, the Watchdog-based file tracker with MD5 hashing and partial state detection, the pattern-matching decision extractor, the handoff brief generator with priority ordering, the spend tracker reading token counts from proxied responses, the dashboard on port 7864, and the full CLI with all twelve commands.

How You Can Use and Extend This With NEO

Use it to switch between Claude Code, Cursor, and Gemini CLI on the same project without losing context.
Point each tool at the proxy once, and every subsequent tool switch gets an automatic handoff brief. The receiving tool knows which files changed, which tasks are in progress, and which files should not be touched - without you writing a single summary.

Use the spend dashboard to measure which AI tool is most cost-efficient for your workflow.
The dashboard breaks down cost per tool, per project, and per session. The "cost per file changed" metric tells you which tool delivers the most work per dollar - a data-driven way to decide which tool to reach for on different task types.

Use the handoff brief preview before switching.
Run toolrouter handoff before switching tools to see exactly what brief the next tool will receive. This lets you verify the context is accurate before handing off on a complex task where a wrong assumption by the next tool could cause real damage.

Extend it with additional tool integrations.
The proxy currently supports Claude Code, Cursor, Gemini CLI, and Ollama via their respective API base URL environment variables. Any tool that accepts an OpenAI-compatible API base URL can be pointed at the proxy using the same pattern - no changes to ToolRouter needed.

Final Notes

ToolRouter makes multi-tool AI development practical. Context persists across tool switches through automatically generated handoff briefs. Spend is tracked in real time with model-accurate pricing. The proxy is transparent - your tools and API keys are unchanged.

The code is at https://github.com/dakshjain-1616/Tool-Router
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Context Time Machine: Forensic Investigation of What Your Agent Actually Saw

Nilofer 🚀 — Sat, 16 May 2026 11:10:19 +0000

Long-running agent sessions fail in a specific way that is hard to debug. The agent runs 40 turns. At turn 38, it gives a wrong answer that ignores something it decided at turn 12. You look at the logs, the turn 12 decision is there. The turn 38 response is there. But you cannot see what the context window looked like at turn 38. Was the turn 12 decision still in context? Was it evicted? Was it there but semantically overwhelmed by 25 other turns?

This is the forensic problem that ContextTimeMachine solves. It is different from real-time session monitoring, it is for deep post-hoc investigation of what happened during a session, after it has already run. The key insight it is built on: the context window at any given turn is deterministic given the conversation history. You can reconstruct exactly what the model saw at turn 38, render it interactively, and query it.

Three Investigation Modes

Mode 1 - Timeline Navigator

The primary view is a vertical timeline of all turns in the session. Each turn shows the turn number, agent name if available, turn type, token count at that turn, and a sparkline showing how the context composition changed.

Click any turn to travel to it - the context window at that exact point reconstructs and renders in the main panel. You see exactly what the model saw: every message in order, with token counts, with a red line showing where the context would have been truncated if it exceeded the model's limit. Scrub through turns with keyboard arrows. Watch the context window evolve turn by turn. See turns disappear as eviction happens. See tool results arrive and push older content further back.

Mode 2 - Fact Tracker

You know something specific, a decision made at turn 5, a fact retrieved at turn 15, a user instruction given at turn 3. You want to know: at what turn did this fact leave the context window?

Enter any text snippet in the Fact Tracker search box. ContextTimeMachine embeds it locally using sentence-transformers, then searches every turn's context snapshot for the nearest matching content. It renders a presence chart, a horizontal bar across all turns colored green when the fact is present or red when absent and shows the exact turn where the fact entered context and the exact turn where it left.

This answers the most common debugging question for long agent sessions: "When exactly did the agent stop knowing X?"

Mode 3 - Divergence Finder

You have two agent sessions that started identically but ended differently. One succeeded, one failed. Load both sessions and ContextTimeMachine finds the earliest turn where their context windows diverged where they started seeing different content and highlights that turn as the likely root cause of the different outcomes.

It shows a side-by-side comparison of the two context windows at the divergence point with diffed content highlighted. This is the automated version of the manual debugging process every team does when comparing "the run that worked" against "the run that didn't."

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ContextTimeMachine                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Frontend (React)                                               │
│  ├─ TimelineNavigator    — Turn-by-turn timeline scrubber       │
│  ├─ ContextPanel         — Renders reconstructed context        │
│  ├─ FactTracker          — Fact presence chart                  │
│  └─ DivergenceFinder     — Two-session comparison               │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FastAPI Backend                                                │
│  ├─ /api/session/load          — Load session from file         │
│  ├─ /api/session/{id}/profile  — Get token profile              │
│  ├─ /api/session/{id}/turn/{n} — Reconstruct context at turn    │
│  ├─ /api/session/{id}/fact     — Track fact presence            │
│  ├─ /api/divergence            — Find divergence point          │
│  └─ /api/sessions              — List all sessions              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Core Analysis Modules                                          │
│  ├─ SessionLoader        — Load from multiple formats           │
│  ├─ ContextReconstructor — Reconstruct at any turn              │
│  ├─ FactTracker          — Track presence via embeddings        │
│  ├─ DivergenceFinder     — Find divergence points               │
│  ├─ TokenAnalyzer        — Token budget analysis                │
│  └─ EmbeddingService     — Local embeddings (all-MiniLM)        │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage                                                        │
│  └─ SQLite DB            — Session snapshots & metadata         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.10+
pip

Quick Start

# Clone the repository
git clone https://github.com/dakshjain-1616/context-time-machine.git
cd context-time-machine

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Start the server
timemachine serve

Open http://localhost:8000 in your browser. The server will automatically open your browser if it can.

Usage

Loading Sessions
Sessions can be loaded from two formats.

From LiveContext SQLite Export:

timemachine load --file session.db

From Generic JSON:

timemachine load --file session.json

The generic JSON format expects a turns array where each turn contains a messages list, a model_id, and a timestamp:

{
  "turns": [
    {
      "turn": 0,
      "messages": [
        {"role": "system", "content": "You are helpful.", "token_count": 3},
        {"role": "user", "content": "What is 2+2?", "token_count": 4}
      ],
      "model_id": "gpt-4",
      "timestamp": "2026-05-09T10:00:00Z"
    }
  ],
  "model_id": "gpt-4"
}

CLI Commands
The CLI covers the full workflow from loading sessions to querying them:

# Start the web interface
timemachine serve

# Load a session
timemachine load --file session.json

# Track fact across session
timemachine fact --session <session-id> --fact "the user prefers JSON output"

# Find divergence between two sessions
timemachine diverge --session-a <id-a> --session-b <id-b>

# List all stored sessions
timemachine sessions

# Clear all sessions
timemachine clear

Python API

Every capability the CLI and web interface expose is also available as a Python library. This makes it straightforward to integrate ContextTimeMachine into evaluation pipelines or automated debugging scripts:

from context_time_machine import (
    SessionLoader,
    ContextReconstructor,
    FactTracker,
    DivergenceFinder,
    TokenAnalyzer,
)

# Load session
loader = SessionLoader()
session = loader.load("session.json")

# Reconstruct context at turn 10
reconstructor = ContextReconstructor()
context = reconstructor.reconstruct(session, turn_number=10)
print(f"Context at turn 10: {context.total_tokens} tokens")
print(f"Messages: {len(context.messages)}")
print(f"Utilization: {context.utilization_percent}%")

# Track a fact
tracker = FactTracker()
result = tracker.track(session, "specific decision from turn 5")
print(f"Fact first appeared: Turn {result.first_appeared_turn}")
print(f"Fact last present: Turn {result.last_present_turn}")
print(f"Disappeared at: Turn {result.disappeared_at_turn}")

# Analyze token budget
analyzer = TokenAnalyzer()
profile = analyzer.analyze_session(session)
print(f"Peak tokens: {profile.peak_tokens} at turn {profile.peak_turn}")
print(f"Eviction turns: {profile.eviction_turns}")

# Find divergence between sessions
session_b = loader.load("session_b.json")
finder = DivergenceFinder()
result = finder.find(session, session_b)
print(f"Divergence at turn: {result.divergence_turn}")
print(result.summary)

Supported Session Formats

How It Works

Context Reconstruction

For each turn N, ContextTimeMachine loads all messages from turns 0 to N and counts the total tokens using tiktoken. If the total exceeds the model's context limit, it simulates eviction using a model-specific strategy: GPT and Claude use left-truncation (oldest messages first), DeepSeek uses a sliding window with a recency bias, and Gemma uses local-global attention sampling from the middle. System messages are never evicted regardless of which strategy applies. The result is a reconstructed context with a full token breakdown exactly what the model would have seen at that turn.

Fact Tracking

For each turn, ContextTimeMachine embeds the fact text using all-MiniLM-L6-v2. It then computes cosine similarity between that embedding and every message in the turn's reconstructed context. A fact is considered present if any message has a similarity above 0.75. Embeddings are cached for performance so repeated queries against the same session do not recompute embeddings. The output is a presence chart showing the fact's full lifecycle across the session.

Divergence Detection

For two sessions, ContextTimeMachine aligns turns and analyzes up to the minimum length of the two sessions. At each turn it reconstructs the context for both sessions, embeds all messages, and computes an average maximum cosine similarity between the two context windows. When this similarity drops below 0.85, the turn is flagged as the divergence point. The output includes a message diff at the divergence point and a summary of what changed.

API Endpoints

Session Management

POST /api/session/load - load session from file or JSON
GET /api/sessions - list all stored sessions
DELETE /api/session/{id} - delete a session

Analysis

GET /api/session/{id}/profile - get token profile for session
GET /api/session/{id}/turn/{num} - reconstruct context at turn
POST /api/session/{id}/fact - track fact presence
POST /api/divergence - find divergence between sessions

Performance

Context Reconstruction: < 100ms for typical sessions
Fact Tracking: ~1-5 seconds for full session (includes embedding)
Divergence Detection: ~2-10 seconds for 2 sessions
Memory: ~50-200MB per stored session (depending on size)

Dependencies

Core

fastapi - Web framework
uvicorn - ASGI server
pydantic - Data validation
click - CLI framework
tiktoken - Token counting
sentence-transformers - Local embeddings
numpy - Numerical operations
sqlalchemy - Database ORM
aiofiles - Async file operations

Frontend
React, Tailwind CSS, Framer Motion, Recharts

Known Limitations

Frontend is a React stub - core analysis is fully functional
LangSmith format not yet implemented
No streaming support for very large sessions (>10k turns)
Embedding cache cleared on restart

Future Enhancements

Complete React frontend with real-time updates
WebSocket streaming for large sessions
LangSmith format support
Multi-session comparison UI
Export to markdown/HTML
Attention visualization
Custom eviction strategy support

How I Built This Using NEO

The requirement was a forensic debugging tool for long-running agent sessions, one that could reconstruct the exact context window at any historical turn, track when specific facts entered and left context using semantic embeddings, and find the earliest point where two divergent sessions started seeing different content. The tool needed to support multiple session formats, expose a Python API alongside the web interface, and work entirely offline with local embeddings.

NEO handled all 12 specification steps autonomously, building the SessionLoader with support for LiveContext SQLite, generic JSON, and raw conversation formats, the ContextReconstructor with model-specific eviction strategies for GPT, Claude, DeepSeek, and Gemma, the FactTracker with all-MiniLM-L6-v2 embeddings and cosine similarity scoring, the DivergenceFinder with turn-aligned context comparison, the TokenAnalyzer for peak token and eviction turn detection, the FastAPI backend with all six API endpoints, the SQLite storage layer via SQLAlchemy, the Click CLI with all six commands, and the full 58-test suite covering all core modules.

How You Can Use and Extend This With NEO

Use it to find the root cause of long-session failures.
When an agent gives a wrong answer deep into a long session, load the session into ContextTimeMachine, travel to the failure turn in the Timeline Navigator, and see exactly what was in context at that point. The reconstructed view shows every message the model saw, in order, with token counts, so you can see immediately whether the relevant context was present or had been evicted.

Use Fact Tracker to measure context retention across your agent design.
Before settling on a context management strategy for your agent, run Fact Tracker against a set of real sessions. The presence chart for key decisions and instructions tells you at what turn they reliably drop out of context giving you a data-driven basis for choosing context window sizes, eviction strategies, or compression approaches.

Use Divergence Finder to debug non-deterministic agent behaviour.
When two runs of the same agent with the same input produce different outcomes, load both into Divergence Finder. The tool identifies the exact turn where their context windows started differing and shows a diff of what changed, turning a difficult debugging problem into a specific, actionable finding.

Extend it with additional session format parsers.
SessionLoader already handles three formats following a common interface. Adding a new format - LangSmith is listed as planned, means implementing the same loader interface for the new format. It is then immediately available in the CLI, the Python API, and the web interface without touching any of the analysis modules.

Final Notes

ContextTimeMachine makes the context window visible. Instead of inferring what the model saw from its outputs, you can reconstruct and inspect the exact context at any turn, track when specific information entered and left the window, and find where two sessions diverged. For teams debugging long-running agents, that visibility is the difference between guessing and knowing.

The code is at https://github.com/dakshjain-1616/ContextTimeMachine
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Agent Constitution: Policy Enforcement and PII Protection for AI Agents

Nilofer 🚀 — Sat, 16 May 2026 05:50:37 +0000

AI agents are getting more capable. They can browse the web, call APIs, read and write files, and execute code. That capability is exactly what makes them useful and exactly what makes them dangerous without guardrails.

Most agent safety approaches rely on prompt instructions. Tell the model not to delete files. Tell it not to send requests to untrusted URLs. Tell it not to leak PII. But instructions in a prompt are not enforceable — a sufficiently complex agent workflow, a jailbreak attempt, or just an edge case in reasoning can bypass them silently.

Agent Constitution is a policy enforcement framework for AI agents that enforces behavioral rules at the code level, not the prompt level. You define rules in a YAML constitution file, wrap your agent's tool calls with the enforcer, and get PII detection, audit logging, and a real-time dashboard all without modifying your agent's core logic.

Features

Policy-Based Enforcement - Define rules using YAML constitution files
AST-Based Expression Evaluation - Safe condition evaluation without code injection risks
PII Detection - Regex and Ollama-powered detection of sensitive information
Audit Logging - JSONL-based audit trail with rotation support
Real-Time Dashboard - FastAPI + WebSocket + React dashboard for monitoring
CLI Interface - Rich command-line interface for management

How It Works

The core concept is a constitution - a YAML file that defines policies, and within each policy, rules. Each rule has a condition written as a plain expression, an action (block or notify), and a severity level. The enforcer evaluates these conditions against every tool call before it executes.

The condition evaluation uses AST-based expression parsing not eval() so there is no code injection risk. An expression like tool_name in ['rm', 'unlink', 'rmdir'] is parsed as an abstract syntax tree and evaluated safely against the tool call context.

PII detection runs as a separate layer. It can use regex patterns for common formats like email addresses, phone numbers, and SSNs, or it can use Ollama with a local model for more nuanced detection. When PII is detected in a tool's output, it can be blocked or redacted before it reaches the agent.

Every enforcement decision allowed or blocked is written to a JSONL audit log with a timestamp, tool name, action taken, and the specific rule that triggered. The real-time dashboard reads from this audit log via WebSocket and shows violations, enforcement statistics, and the full constitution in one view.

Installation

# Clone the repository
git clone https://github.com/yourusername/agent-constitution.git
cd agent-constitution

# Install dependencies
pip install -e .

Quick Start

1. Create a Constitution
Start with a sample constitution to see the format, or create an empty one:

# Create a sample constitution
agent-constitution init --sample -o my_constitution.yaml

# Or create an empty one
agent-constitution init -o my_constitution.yaml

2. Validate Your Constitution
Before using it, validate that the YAML is well-formed and the expressions are safe:

agent-constitution validate my_constitution.yaml

3. Test Policy Enforcement
Check whether a specific tool call would be allowed or blocked before running it:

agent-constitution check rm --arg path=/tmp/test --constitution my_constitution.yaml

4. Start the Dashboard

agent-constitution dashboard --constitution my_constitution.yaml

Then open http://localhost:8000 in your browser.

Usage

Using the @enforce Decorator

The simplest integration is wrapping tool functions with the @enforce decorator. The enforcer checks the function against the constitution before it executes, if a rule blocks the call, a PolicyViolationError is raised before the function body runs.

from agent_constitution import Constitution, Enforcer

# Load constitution
constitution = Constitution.from_yaml("my_constitution.yaml")
enforcer = Enforcer(constitution=constitution)

@enforcer.enforce
def delete_file(path: str):
    """Delete a file."""
    import os
    os.remove(path)

# This will be blocked if rm/delete operations are restricted
try:
    delete_file("/tmp/test.txt")
except PolicyViolationError as e:
    print(f"Blocked: {e}")

Manual Policy Checking

For cases where you need to check a tool call without decorating a function, for example when the tool call is constructed dynamically the enforcer exposes a check method directly:

from agent_constitution import Constitution, Enforcer

constitution = Constitution.from_yaml("my_constitution.yaml")
enforcer = Enforcer(constitution=constitution)

# Check a tool call
result = enforcer.check(
    tool_name="curl",
    tool_args={"url": "https://example.com"},
    extra_context={"approved": False}
)

if result.blocked:
    print(f"Blocked by rule: {result.violations[0].rule_name}")
else:
    print("Allowed")

PII Detection

The PII detector can be used standalone - detect PII in any text, or redact it before it leaves the agent:

from agent_constitution.rules.pii_detector import PIIDetector

detector = PIIDetector()

# Detect PII in text
text = "Contact me at john@example.com or call 555-123-4567"
matches = detector.detect(text)

for match in matches:
    print(f"Found {match.pattern_name}: {match.matched_text}")

# Redact PII
redacted = detector.redact(text)
print(redacted)  # "Contact me at [REDACTED] or call [REDACTED]"

Audit Logging

The audit logger writes every enforcement decision to a JSONL file and supports log rotation. Logs can be read back programmatically:

from agent_constitution.audit import AuditLogger

logger = AuditLogger(log_path="./audit.jsonl")

# Log an event
logger.log(
    event_type="tool_call",
    tool_name="rm",
    action="block",
    allowed=False,
    violations=[{"rule_name": "block_file_deletion"}]
)

# Read logs
for entry in logger.read_logs(limit=10):
    print(f"{entry.timestamp}: {entry.event_type} - {entry.tool_name}")

Constitution Format

The constitution is a YAML file with versioning, named policies, and rules within each policy. Each rule has a name, a condition expression, an action, and a severity level:

version: "1.0"
name: "My Agent Constitution"
description: "Security policies for my AI agent"

policies:
  - name: tool_restrictions
    description: "Restrict access to dangerous tools"
    priority: 10
    rules:
      - name: block_file_deletion
        description: "Prevent file system deletion operations"
        condition: "tool_name in ['rm', 'unlink', 'rmdir']"
        action: block
        severity: critical

      - name: restrict_network_access
        description: "Limit unrestricted network access"
        condition: "tool_name == 'curl' and not context.get('approved', False)"
        action: notify
        severity: high

  - name: data_protection
    description: "Protect sensitive data"
    priority: 5
    rules:
      - name: pii_detection
        description: "Detect and protect PII in outputs"
        condition: "pii_detected == True"
        action: block
        severity: high

pii_config:
  enabled: true
  patterns: ["email", "ssn", "phone"]
  use_ollama: true
  ollama_model: "gemma3:4b"
  ollama_url: "http://localhost:11434"

audit_config:
  enabled: true
  log_path: "./audit_logs.jsonl"
  max_file_size_mb: 100
  retention_days: 30

The priority field controls which policies are evaluated first. Higher priority runs first. The action field is either block which raises a PolicyViolationError or notify, which logs the event but allows the call through.

CLI Commands

The CLI covers the full lifecycle from creating and validating a constitution to inspecting audit logs and testing expressions:

# Initialize a constitution
agent-constitution init --sample

# Validate a constitution
agent-constitution validate my_constitution.yaml

# Display constitution contents
agent-constitution show my_constitution.yaml

# Check if a tool call would be allowed
agent-constitution check rm --arg path=/tmp/test --constitution my_constitution.yaml

# Start the dashboard
agent-constitution dashboard --constitution my_constitution.yaml

# View audit logs
agent-constitution audit --log-path ./audit.jsonl

# Show statistics
agent-constitution stats --constitution my_constitution.yaml

# Test expression evaluation
agent-constitution eval-expr "x > 5" --context x=10

Dashboard

The dashboard provides real-time monitoring via FastAPI, WebSocket, and a React frontend. It shows:

Policy violations
Audit logs
Constitution rules and policies
Enforcement statistics

Open http://localhost:8000 after starting with agent-constitution dashboard.

Architecture

agent_constitution/
├── constitution.py      # Pydantic models and YAML handling
├── enforcer.py          # Policy enforcement and @enforce decorator
├── audit.py            # JSONL audit logging
├── cli.py              # Click CLI interface
├── rules/
│   ├── evaluator.py    # AST-based expression evaluation
│   └── pii_detector.py # PII detection with regex/Ollama
└── dashboard/
    ├── server.py       # FastAPI + WebSocket server
    └── frontend/       # React + Tailwind dashboard

Each module has a single responsibility - constitution.py handles Pydantic models and YAML parsing, enforcer.py owns the @enforce decorator and manual check logic, audit.py handles JSONL writing and rotation, and the rules/ directory separates expression evaluation from PII detection.

Testing

The project has comprehensive test coverage with 84 unit tests, all passing:

# Run all tests
pytest tests/ -v

# All tests passing: 84/84 ✓

Test coverage includes constitution loading and YAML parsing, policy enforcement with the @enforce decorator, manual policy checking, PII detection for regex and patterns, audit logging with rotation, expression evaluation and security validation, and rule violation tracking and statistics.

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run specific test file
pytest tests/test_evaluator.py -v

# Run linting
flake8 agent_constitution

# Run type checking
mypy agent_constitution

How I Built This Using NEO

The requirement was a policy enforcement framework for AI agents, one that enforces behavioral rules at the code level rather than relying on prompt instructions, with PII detection, a JSONL audit trail, and a real-time monitoring dashboard. NEO implemented the full system across 10 implementation steps, resulting in a production-ready framework with 84 tests passing.

NEO built the Pydantic constitution models and YAML handling in constitution.py, the policy enforcer with the @enforce decorator in enforcer.py, the AST-based expression evaluator in rules/evaluator.py, the regex and Ollama-powered PII detector in rules/pii_detector.py, the JSONL audit logger with rotation in audit.py, the Click CLI with all eight commands in cli.py, and the FastAPI and WebSocket dashboard server with the React and Tailwind frontend.

How You Can Use and Extend This With NEO

Use it to enforce safety rules across any agent's tool calls.
Wrap any tool function with @enforcer.enforce and define the rules in a YAML constitution. The enforcement happens at the code level not in the prompt, so it cannot be bypassed by the agent's reasoning or by jailbreak attempts.

Use the audit log to build an observability layer for your agents.
Every enforcement decision lands in a JSONL file with a timestamp, tool name, action, and triggering rule. This gives you a structured, queryable record of everything your agent tried to do allowed or blocked, which is useful for debugging unexpected agent behaviour and for compliance requirements.

Use PII detection as a standalone layer before agent outputs reach users.
The PIIDetector works independently of the enforcer. You can run it on any text, agent responses, tool outputs, retrieved documents before they are displayed or stored, and redact sensitive information automatically.

Extend it with custom PII patterns.
The pii_config section of the constitution accepts a patterns list. New regex patterns for domain-specific sensitive data can be added to the constitution file without touching any code.

Extend it with additional rule conditions.
The AST-based evaluator supports arithmetic, comparisons, and context dictionary access. New conditions that reference additional context fields work immediately once those fields are passed as extra_context in the enforcer's check call.

Final Notes

Agent Constitution shifts AI agent safety from instructions to enforcement. Rules defined in a YAML file are evaluated at the code level on every tool call before the tool executes, so the safety layer is not part of the agent's reasoning but a hard boundary around it.

The code is at https://github.com/dakshjain-1616/Agent-Constitution
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

ASR Evaluation Framework: Benchmarking Speech Recognition Models Across Accuracy, Speed, and Robustness

Nilofer 🚀 — Fri, 15 May 2026 19:53:04 +0000

Picking an ASR model for production is not straightforward. Whisper might be the most accurate for general English but too slow for real-time use. Wav2Vec2 might be fast enough for edge devices but struggle with accented speech. Distil-Whisper might hit the sweet spot for your use case, or it might not. Without a systematic benchmark across your actual conditions, you are guessing.

ASR Evaluation Framework is an enterprise-grade benchmarking tool that answers the questions that matter before you commit to a model:

Which ASR model is most accurate for my use case?
How fast can each model process audio in real-time?
How robust is each model against background noise, accents, and degraded audio?
What are the tradeoffs between speed and accuracy?

Features

5 ASR Models : IBM Granite, OpenAI Whisper, NVIDIA Canary, Distil-Whisper, Wav2Vec2
Comprehensive Metrics : WER, CER, Accuracy, RTF, and Inference Time
15+ Test Scenarios : Clean speech, background noise, accents, fast/slow speech, technical terms, and more
Flexible Evaluation Modes : Speed, accuracy, or complete evaluation
JSON Output Schema : Standardized metrics schema for result storage

Architecture

┌─────────────────────────────────────────────────────┐
│         run_evaluation.py (CLI Entry)               │
├────────────┬──────────────┬──────────────┬──────────┤
│ --accuracy │ --speed      │ --all        │ Config   │
│ Evaluate   │ Evaluate RTF │ Complete     │ Loading  │
│ WER/CER    │ & Inference  │ Evaluation   │          │
└────────────┴──────────────┴──────────────┴──────────┘
              │
      ┌───────▼────────┐
      │   Evaluator    │
      │  - Load models │
      │  - Test audio  │
      │  - Calc metrics│
      └───────┬────────┘
              │
     ┌────────┼────────┐
     │        │        │
┌────▼──┐┌────▼──┐┌────▼──┐
│Granite ││Whisper││ Wav2V │  ... 5 models
│ Model  ││ Model ││ Model │
└────┬──┘└────┬──┘└────┬──┘
     └────────┼────────┘
              │
      ┌───────▼───────────┐
      │  Metrics Engine   │
      │ - WER/CER calc    │
      │ - RTF calc        │
      │ - Accuracy calc   │
      │ - Aggregation     │
      └───────┬───────────┘
              │
      ┌───────▼──────────┐
      │ JSON Results     │
      │ with schema      │
      └──────────────────┘

Model Comparison Overview

Evaluation Dimensions

Accuracy Metrics

WER : Word Error Rate. Percentage of words transcribed incorrectly compared to the reference.
CER : Character Error Rate. Character-level error rate for more detailed analysis.
Accuracy : 100% minus WER, normalized to a percentage.

Speed Metrics

RTF : Real-Time Factor. Inference time divided by audio duration. Below 1.0 means the model is real-time capable. Above 1.0 means it requires more compute than the audio duration.
Inference Time : Absolute seconds to transcribe the audio.

Robustness Testing
15 test scenarios covering:

Clean speech - baseline accuracy testing
Background noise - office and street environments
Accented English
Fast and slow speech rates
Technical vocabulary
Whispered speech
Phone quality audio
Numbers and acronyms
And more scenarios

Installation

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Requires Python 3.10+. Core dependencies:

librosa - Audio processing
numpy, scipy - Numerical computing
transformers - HuggingFace model loading
jiwer - WER and CER calculation
soundfile - Audio file I/O
pytest - Testing framework

Usage

Run Complete Evaluation

Runs accuracy and speed evaluation across all five models against all 15 test scenarios:

python run_evaluation.py --all

Run Accuracy Evaluation Only

python run_evaluation.py --accuracy

Run Speed Evaluation Only

python run_evaluation.py --speed

Specify Custom Paths

python run_evaluation.py --all \
  --data-path ./my_data \
  --output-path ./my_results

Results and Output

Console Output

Here is what a complete evaluation run looks like in the terminal:

============================================================
ASR EVALUATION FRAMEWORK v1.0.0
============================================================

=== RUNNING COMPLETE EVALUATION (ACCURACY + SPEED) ===

Evaluating Whisper...
Evaluating Wav2Vec2...
Evaluating Distil-Whisper...
Evaluating Canary...
Evaluating Granite...

✓ Results saved to: results/asr_eval_results_all_20260513_123045.json

============================================================
EVALUATION SUMMARY
============================================================

Model: Whisper
  Status: ✓ OK
  Mean Accuracy: 95.23%
  Mean WER: 0.0477

Model: Wav2Vec2
  Status: ✓ OK
  Mean Accuracy: 91.45%
  Mean WER: 0.0855

Model: Distil-Whisper
  Status: ✓ OK
  Mean Accuracy: 93.78%
  Mean WER: 0.0622

JSON Output Format

Results are saved as structured JSON to results/asr_eval_results_{type}_{timestamp}.json. The schema includes evaluation metadata, per-model aggregate metrics, and per-scenario test results:

{
  "evaluation_metadata": {
    "timestamp": "2026-05-13T12:30:45.123Z",
    "evaluator_version": "1.0.0",
    "models_tested": ["Whisper", "Wav2Vec2", "Distil-Whisper"],
    "test_scenarios": 15,
    "evaluation_type": "all"
  },
  "model_results": {
    "Whisper": {
      "model_name": "Whisper",
      "model_id": "openai/whisper-base",
      "initialized": true,
      "aggregate_metrics": {
        "mean_accuracy": 95.23,
        "mean_wer": 0.0477,
        "mean_cer": 0.0234,
        "mean_rtf": 1.15,
        "std_wer": 0.0145
      },
      "test_results": [
        {
          "test_id": 1,
          "test_name": "clean_english",
          "wer": 0.032,
          "cer": 0.015,
          "accuracy": 96.8,
          "inference_time": 2.34,
          "rtf": 1.17
        }
      ]
    }
  },
  "summary": {
    "total_tests": 15,
    "evaluation_type": "all",
    "status": "completed"
  }
}

The per-scenario test_results array shows exactly how each model performed on each specific condition, not just aggregated averages which is what makes this useful for production decisions.

Configuration

Environment variables, documented in .env.example:
HUGGINGFACE_TOKEN : HuggingFace API token for model loading
OPENAI_API_KEY : OpenAI API key
ASR_EVAL_DATA_PATH : data directory path
ASR_EVAL_RESULTS_PATH : results output path
VERBOSE : enable verbose logging

Test Matrix

15 test scenarios covering four categories:

Clean Speech - Baseline accuracy testing
Robustness - Background noise, accents, variable speech rates
Challenging Conditions - Whispered speech, music, phone quality audio
Domain-Specific - Technical vocabulary, numbers, acronyms

Metrics

Accuracy Metrics

WER (Word Error Rate) - Percentage of words that differ from reference
CER (Character Error Rate) - Percentage of characters that differ
Accuracy - 100% minus WER, normalized to a percentage
Speed Metrics
RTF (Real-Time Factor) - Inference time divided by audio duration. Below 1.0 is real-time capable.
Inference Time - Total time to transcribe audio in seconds

Model Details

When to Use This Framework

Benchmarking ASR models before production deployment : run a full evaluation before committing to a model, not after.

Comparing model tradeoffs : speed versus accuracy decisions are data-driven rather than based on published benchmarks that may not reflect your audio conditions.

Testing robustness against real-world audio : the 15 test scenarios cover conditions that synthetic benchmarks miss: phone quality audio, background noise, accents, and technical vocabulary.

Evaluating cost-performance of different models : RTF and inference time metrics let you calculate the compute cost of each model at your actual workload.

Quality assurance in voice-enabled applications : run evaluations to catch model regressions before they reach production.

Research and academic speech recognition studies : the standardized JSON output schema makes results comparable and reproducible across experiments.

Real-World Scenarios

Scenario 1 - Call Center AI

Evaluate which model handles phone quality audio best
Test robustness against background noise
Measure inference speed for cost calculation
Result: Select fastest model that maintains accuracy

Scenario 2 - Voice Assistant

Test against various accents and speech rates
Evaluate technical command recognition
Measure real-time performance on edge devices
Result: Pick model that runs on-device with good accuracy

Scenario 3 - Transcription Service

Benchmark accuracy across multiple languages
Evaluate cost versus accuracy tradeoffs
Test on domain-specific vocabulary
Result: Choose optimal model for service tier

Project Structure

.
├── src/                          # Core modules
│   ├── config.py                # Configuration
│   ├── metrics.py               # Metric calculations
│   ├── data_loader.py           # Data loading utilities
│   ├── base_model.py            # ASR model base class
│   └── evaluator.py             # Main evaluator class
├── models/                       # ASR model implementations
│   ├── wav2vec2.py
│   ├── whisper.py
│   ├── distil_whisper.py
│   ├── canary.py
│   └── granite.py
├── tests/                        # Test suite (36 tests)
├── data/                         # Audio files for evaluation
├── results/                      # Output evaluation results
├── notebooks/                    # Jupyter notebooks
├── run_evaluation.py             # CLI entry point
├── asr_eval_test_matrix.csv      # Test scenarios matrix
├── asr_eval_metrics_schema.json  # Output schema
└── requirements.txt              # Python dependencies

Testing

pytest tests/ -v

36 tests covering all core modules.

How I Built This Using NEO

The requirement was a systematic benchmarking framework for ASR models, one that could evaluate accuracy, speed, and robustness across real-world audio conditions, support multiple models through a common interface, and produce structured output for production decisions. The framework needed to cover five distinct model architectures, a 15-scenario test matrix, and three evaluation modes selectable from the CLI.

NEO built the full implementation: the base model class in base_model.py that all five model implementations extend, the five model wrappers for Whisper, Wav2Vec2, Distil-Whisper, Canary, and Granite, the metrics engine in metrics.py computing WER, CER, accuracy, RTF, and inference time, the main evaluator class in evaluator.py, the CLI entry point in run_evaluation.py with all three evaluation modes, the data loader in data_loader.py, the JSON output schema in asr_eval_metrics_schema.json, the test scenario matrix in asr_eval_test_matrix.csv, and the 36-test test suite.

How You Can Use and Extend This With NEO

Use it before committing to an ASR model in production.
Run the full evaluation against your own audio samples using --data-path. The per-scenario breakdown shows exactly how each model performs on the conditions your application will actually encounter, not on generic benchmarks that may not reflect your use case.

Use the JSON output to build model selection pipelines.
The structured output at results/asr_eval_results_{type}_{timestamp}.json contains all the metrics needed to make a data-driven model selection decision programmatically. A script that reads the output and selects the model with the best WER for a given RTF threshold builds directly on top of the existing schema.

Use it to evaluate cost-performance before scaling.
RTF and inference time metrics per model let you calculate the compute cost of each option at your actual call volume. The per-scenario breakdown shows where each model spends the most compute, useful for optimising before scaling a voice-enabled product.

Extend it with additional ASR models.
All five models extend base_model.py following the same interface. Adding a new ASR model available through HuggingFace Transformers means adding a new file in models/ that implements the same base class, it is then available in all three evaluation modes without touching the evaluator, metrics engine, or CLI.

Final Notes

Choosing an ASR model without systematic evaluation is a production risk. ASR Evaluation Framework removes that risk by giving you per-model, per-scenario metrics across accuracy, speed, and robustness before you deploy with structured JSON output that makes the decision data-driven rather than intuitive.

The code is at https://github.com/dakshjain-1616/Asr-Evaluation
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

SPEC-TO-SHIP: A Multi-Agent Pipeline That Turns Feature Ideas Into Production Code

Nilofer 🚀 — Thu, 14 May 2026 11:10:51 +0000

Writing a feature spec and getting it to production involves a lot of steps, architecture decisions, task planning, implementation, testing, and code review. In a real engineering team, these are handled by different people with different specializations. Most AI coding tools collapse all of that into a single step and ask one model to do everything.

SPEC TO SHIP takes a different approach. It orchestrates five specialized AI agents Architect, Planner, Engineer, QA, and Reviewer within a single Node.js process to simulate a complete startup engineering team workflow. Raw feature ideas go in. Committed, tested, reviewed code comes out.

The Five Agents

The pipeline follows a sequential flow where each agent's output informs the next, with a tight loop between Engineering and QA. Each agent has a defined role, a specific output format, and a clear handoff point - so no single agent is asked to do more than it is designed for.

ArchitectAgent-Senior Software Architect: The first agent in the pipeline. Takes the raw feature idea and generates a comprehensive technical specification covering Overview, Goals, API Contracts, Data Models, and Security sections. Output is a Markdown spec file that every downstream agent works from. Model:google/gemini-2.0-flash-001via OpenRouter.

PlannerAgent-Staff Engineering Manager: Receives the spec from the Architect and breaks it into actionable, dependency-aware development tasks. Output is a JSON array of tasks with topological ordering and acceptance criteria - so the Engineer knows exactly what to build and in what order.

EngineerAgent-Principal Software Engineer: Takes each task from the Planner and implements production-grade TypeScript code for it. Output is source files with proper typing, error handling, and JSDoc. This is where the actual code gets written.

QAAgent-Senior QA Engineer: Receives the Engineer's output and writes exhaustive Vitest test suites for each task. Output is test files covering acceptance criteria and edge cases. The tight loop between Engineer and QA means the implementation is always tested before the Reviewer sees it.

ReviewerAgent-Principal Engineer Reviewer: The final stage, conducts an audit across security, performance, and correctness across everything the previous agents produced. Output is a score from 0 to 100 and an approval status that tells you whether the output is ready to ship.

Quality and Resilience

The pipeline is built for production reliability, not just happy-path execution. Several resilience patterns are built in at the infrastructure level so agent failures do not cascade into full pipeline failures:

Strict TypeScript - No any types allowed anywhere in the generated code.

Exponential Backoff - Retries on 429/529 errors at 1s, 2s, 4s, 8s, and 16s intervals. Rate limit hits do not kill the pipeline.

JSON Robustness - When an agent returns malformed JSON, the pipeline automatically retries with explicit instructions to fix the format rather than failing immediately.

Timeout - Ahard 20-minute limit per pipeline run prevents runaway executions.

Getting Started

Prerequisites

Node.js 20+
OpenRouter API Key

Installation
Clone the repository, then install dependencies:

npm install

Configure the environment. The only required variable is your OpenRouter API key - everything else has sensible defaults:

cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Configuration

The system uses envalid for robust configuration management:

OPENROUTER_API_KEY - required for LLM access
DEFAULT_MODEL - set to google/gemini-2.0-flash-001
PORT - API server port, default: 3000
DB_PATH - SQLite database path, default: spec-to-ship.db

Usage

Terminal UI
Run the interactive CLI to start a new pipeline:

npm run start

The CLI uses Ink to provide real-time status updates and token streaming as each agent works through its stage. You can watch the pipeline progress in real time - each agent's output appears as it is generated rather than waiting for the full run to complete.

Industrial Dashboard
Start the API server:

npm run dev

Then open dashboard/index.html in your browser. The dashboard features an Industrial Command Center aesthetic with Dark Charcoal and Amber Glow styling and uses Server-Sent Events for real-time observability of the pipeline as it runs. This gives a visual view of the same pipeline that the CLI runs, useful for sharing progress with others or monitoring longer runs.

Output Structure

Every pipeline run writes its artifacts to ./output/{runId}/. Each file maps directly to one agent's output, so you can inspect any stage independently:

spec.md : Architectural specification from the Architect agent. The source of truth every downstream agent works from.

tasks.json : Task breakdown from the Planner agent. The dependency-ordered list of what gets built.

src/ : mplementation code from the Engineer agent.

tests/ : Vitest tests from the QA agent.

review.md : Final review report from the Reviewer agent, including the score and approval status.

meta.json : Token usage, cost, and timing for the full run.

pipeline.log : NDJSON event log of the entire pipeline execution.

How I Built This Using NEO

The idea was a multi-agent pipeline that mirrors how a real engineering team works each role specialized, each handoff structured, and the whole thing running autonomously from a feature idea through to reviewed, committed code. The requirements included five distinct agent roles with clear responsibilities, a sequential handoff structure with a QA loop, production-grade TypeScript output, real-time observability via both a CLI and a web dashboard, and resilience patterns like exponential backoff and JSON retry logic.

NEO built the full system: the five agent implementations with their respective prompts and output schemas, the pipeline orchestration layer coordinating sequential handoffs, the Ink-based CLI with real-time token streaming, the Node.js API server with SSE for dashboard observability, the Industrial Command Center dashboard in HTML, the SQLite-backed database, the artifact output structure, and the envalid configuration layer.

How You Can Use and Extend This With NEO

Use it to go from idea to working code in a single command.
Write a feature description, run the pipeline, and get a complete implementation with architecture docs, TypeScript source, Vitest tests, and a reviewer score without manually coordinating any of the steps. The five-agent structure ensures each stage is handled by a role optimised for that specific task.

Use the reviewer score as a quality gate.
The ReviewerAgent scores every run from 0 to 100 across security, performance, and correctness. Teams can use this score as a threshold before accepting generated code - only promoting runs that clear a minimum score into the codebase.

Use the NDJSON event log for pipeline observability.
Every run writes a structured pipeline.log in NDJSON format. This can be parsed by any log processing tool to track pipeline performance, token costs, and approval rates across runs over time.

Extend it with additional agent roles.
The five-agent structure is sequential and modular. A new agent that receives the previous stage's output and produces its own artifact can be added without restructuring the existing pipeline - the handoff pattern is already established for each stage.

Final Notes

SPEC TO SHIP compresses the gap between a feature idea and production-ready code by distributing the work across five specialized agents, each focused on what it does best. Architecture, planning, implementation, testing, and review - all coordinated automatically, with structured handoffs and resilience built in at every stage.

The code is at https://github.com/dakshjain-1616/Spec-To-Ship
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

RAG Pipeline Stress Tester: Battle-Test Your RAG System Before It Reaches Production

Nilofer 🚀 — Tue, 12 May 2026 11:45:30 +0000

Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.

RAG Pipeline Stress Tester is a battle-testing toolkit that finds these issues before deployment.

What It Does

Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.
Tracks relevance, hallucination, refusal quality, and latency for every query sent.
Scores everything into a composite health score from 0 to 100.
Breaks results down by query category so you know exactly which failure modes are causing issues.
Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.
Produces an HTML report with interactive charts and a JSON report for CI/CD integration.

Why This Exists

Before deploying a RAG system to production, four questions need answers:

Does it hallucinate when asked about things not in the corpus?
Does it refuse appropriately on out-of-scope questions?
Does it stay consistent when the same question is asked multiple ways?
Does it hold up under load 10, 25, 50 concurrent users?

Manual testing cannot answer these questions at scale. This tool does it automatically.

Without stress testing - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.

With this tool - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.

The 7 Query Categories

The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:

out_of_scope - Questions with no answer in the corpus, tests hallucination resistance
adversarial - Prompt injection and jailbreak attempts, tests instruction-following safety
ambiguous - Queries with multiple valid interpretations, tests disambiguation
multilingual - Non-English queries, tests language handling
temporal - Time-sensitive questions that depend on stale data
negation - "What is NOT X" style questions, a common failure mode
compound - Multi-part questions requiring multiple retrievals

You can add your own queries by appending lines to any file in query_bank/.

Health Score

Every test run produces a composite Health Score from 0 to 100:

≥ 80  EXCELLENT   Production-ready
≥ 60  GOOD        Minor issues, review before deploying
≥ 40  FAIR        Significant issues, fix first
 < 40  POOR        Critical failures, do not deploy

Calculated from five weighted components:

Architecture

main.py             Typer CLI — entry point and orchestration
adversarial.py      Query generator — 7 categories, pre-built + corpus-generated
loader.py           Async load driver — aiohttp, configurable concurrency
evaluator.py        Scorer — hallucination, precision, refusal, consistency
reporter.py         Report generator — HTML (Chart.js) + JSON output
corpus_analyzer.py  Optional: generate targeted queries from your own documents
query_bank/         7 pre-built adversarial query files (one per line)
tests/              58 pytest tests (no live endpoint needed)

Install

pip install -r requirements.txt

The endpoint the tester sends requests to must accept POST with {"query": "..."} and return JSON containing either a response or answer field. Any HTTP status other than 200 is counted as an error.

Running a Stress Test

The core command runs a full stress test against your RAG endpoint:

# Basic — 10 concurrent users, 60-second run
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --concurrency 10 \
  --duration 60

# Test only specific query categories
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --query-types out_of_scope,adversarial,multilingual

# Custom output directory
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --output ./my-reports

Here is what a real terminal output looks like:

🚀 Starting RAG Stress Test
   Endpoint: http://localhost:8000/query
   Concurrency: 5
   Duration: 20s

📊 Generating test queries...
   Generated 350 test queries

⚡ Running load tests...
📈 Evaluating results...
📝 Generating reports...

✅ Stress test complete!
   JSON Report: reports/stress_test_results.json
   HTML Report: reports/stress_test_report.html

=======================================================
  Overall Health Score : 57.1/100
  Status               : FAIR - Significant issues detected
  Total requests       : 6355
  Error rate           : 0.0%
  Precision score      : 2.1%
  Hallucination rate   : 22.5%
  Refusal rate         : 77.5%
  Consistency score    : 72.1%
  Latency p50/p95/p99  : 2.9 / 6.3 / 8.7 ms

  Query Type          Count   Halluc%   Refusal%    AvgLat
  ------------------ ------  --------  ---------  --------
  adversarial           205     35.1%      64.9%      3.3ms
  ambiguous             250     12.0%      88.0%      3.2ms
  compound              200     22.0%      78.0%      4.0ms
  multilingual          250     10.0%      90.0%      3.1ms
  negation              200     20.0%      80.0%      5.3ms
  out_of_scope          250     20.0%      80.0%      4.0ms
  temporal              200     38.0%      62.0%      3.1ms

  Recommendations:
    - Low precision score. Enhance retrieval mechanism and relevance ranking.
    - Moderate: Several areas need improvement for production readiness.
=======================================================

Quick Sanity Check

For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:

python3 main.py quick-test --endpoint http://localhost:8000/query

🔍 Running quick sanity test...
   Testing with 35 sample queries

🎯 Quick Test Health Score: 72.4/100
   ✅ Endpoint appears functional

Generate Queries From Your Own Corpus

The analyze-corpus command analyzes your own .txt, .md, or .json files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into query_bank/:

python3 main.py analyze-corpus \
  --corpus ./my-docs \
  --output ./query_bank \
  --num-queries 50

📚 Analyzing corpus: ./my-docs
   Generated 50 in_scope queries → query_bank/in_scope_generated.txt
   Generated 50 out_of_scope queries → query_bank/out_of_scope_generated.txt
   Generated 50 adversarial queries → query_bank/adversarial_generated.txt

✅ Corpus analysis complete!

For very small corpora, lower the keyword frequency threshold:

python3 main.py analyze-corpus \
  --corpus ./my-docs \
  --output ./query_bank \
  --num-queries 20 \
  --min-word-freq 1

Configuration

Edit config.yaml to customise load levels, thresholds, and reporting. The --endpoint CLI flag always takes precedence over config.yaml.

load.concurrency_levels - Concurrent user levels to test, for example [1, 5, 10, 25]
load.ramp_mode - If true, steps through each concurrency level; if false, runs at the first level for the full duration
load.duration_seconds - How long to run at each concurrency level
load.rate_limit_per_second - Maximum requests per second
evaluation.hallucination_threshold - Keyword-overlap score below which a response is flagged as a potential hallucination
evaluation.refusal_keywords - Phrases that indicate a refused answer
reporter.output_dir - Where to save HTML and JSON reports

Pass the config file with --config:

python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --config config.yaml

Output Reports

Each test run saves two files to ./reports/ or your --output path:

stress_test_results.json - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.

stress_test_report.html - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.

Endpoint Requirements

The tester sends:

POST /your-endpoint
{"query": "What is machine learning?"}

It expects a JSON response containing either a response or answer field:

{"response": "Machine learning is..."}

Any HTTP status other than 200 is counted as an error.

Running Tests

python3 -m pytest tests/ -v

58 tests covering all modules. Uses aioresponses to mock HTTP - no live RAG endpoint required.

Project Structure

rag-pipeline-stress-tester/
├── main.py             # CLI entry point
├── adversarial.py      # Query generators (7 types)
├── loader.py           # Async load test driver
├── evaluator.py        # Scoring and metrics
├── reporter.py         # HTML + JSON report generator
├── corpus_analyzer.py  # Optional corpus-based query generation
├── config.yaml         # Test configuration
├── requirements.txt
├── query_bank/         # 7 pre-built adversarial query files
└── tests/              # 58 pytest tests

How I Built This Using NEO

The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.

xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.

How You Can Use and Extend This With NEO

Use it as a pre-deployment gate for every RAG system.
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.

Use it with your own domain queries.
The pre-built query banks are general purpose. For domain-specific testing, run analyze-corpus on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into query_bank/ and run the stress test.

Integrate the JSON report into CI/CD.
stress_test_results.json is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.

Extend it with additional query categories.
The 7 query banks are plain text files in query_bank/, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to query_bank/ and registering it in adversarial.py.

Final Notes

RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.

The code is at https://github.com/dakshjain-1616/RAG-pipeline-stress-tester
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Orbis: Turn Any GitHub Repository Into an Interactive 3D Dependency Graph

Nilofer 🚀 — Sat, 09 May 2026 10:58:10 +0000

Understanding a large codebase is hard. You clone it, start reading files, and quickly lose track of how everything connects. Which modules are most depended on? Where are the circular dependencies? What would break if you refactored this file?

Orbis answers these questions visually. Paste a GitHub repository URL, and Orbis clones it, parses the ASTs across Python, JavaScript, TypeScript, Go, Rust, and Java, detects architectural patterns, and renders the entire codebase as a navigable 3D force-directed graph. Click any module to inspect its dependencies, metrics, and exported symbols. Ask the built-in AI assistant questions like "which module should I refactor first?" and get answers grounded in the actual code structure.

Features

3D force-directed graph - Nodes sized by lines of code, colored by type, with animated directional particles on edges.

Multi-language AST parsing - Python, JavaScript/TypeScript, Go, Rust, and Java via tree-sitter.

AI chat assistant - Ask Claude questions about the analyzed codebase. Questions like "Which modules have circular dependencies?" or "Where should I add feature X?" are answered with full architectural context.

Architectural insights - Auto-detected issues including god modules, high coupling, and circular dependencies, each with severity ratings.

Focus Mode - Dim unconnected nodes to trace dependency paths clearly.

Shareable URLs - ?repo=https://github.com/... auto-triggers analysis on load, making it easy to share a specific codebase view.

Recent history - Last 5 repos stored locally for quick re-analysis.

Demo mode — Load a pre-analyzed snapshot without a GitHub clone.

Tech Stack

Backend: FastAPI + Server-Sent Events (SSE)
AST Parsing: tree-sitter (Python, JS/TS, Go, Rust, Java)
AI Integration: Claude Opus 4.6 via Anthropic API
3D Rendering: 3d-force-graph + Three.js
Frontend: Vanilla JS SPA - no build step

Quick Start

1. Clone and install

cd orbis
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Set up environment

cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY for the AI chat feature

Get an API key at console.anthropic.com. The AI chat feature requires ANTHROPIC_API_KEY in your environment. It degrades gracefully, if the key is missing, the chat panel shows an error message rather than breaking the rest of the app.

3. Run

uvicorn main:app --host 0.0.0.0 --port 8001

Open http://localhost:8001.

Docker

docker build -t orbis .
docker run -p 8001:8001 -e ANTHROPIC_API_KEY=sk-ant-... orbis

Usage

Once running, the workflow is straightforward:

Enter a public GitHub repository URL - for example https://github.com/expressjs/express
Optionally specify a branch
Click Analyze - Orbis clones the repo, parses ASTs, and builds the graph in roughly 5–30 seconds
Explore the 3D graph - click a node to open its detail drawer, scroll to zoom, drag to rotate
Use Focus Mode to highlight a node's direct connections
Use layer filter chips to show or hide architectural layers
Ask the AI assistant questions about the codebase in the chat panel

Keyboard Shortcuts

R: Reset camera
P: Pause/resume rotation
F: Toggle Focus Mode
/: Focus search box
Esc: Close detail drawer / exit Focus Mode

Architecture

The project has four files at its core - a FastAPI backend, a single-file AST parser, and a vanilla JS frontend with no build step:

main.py           FastAPI backend — SSE streaming for /analyze, /chat
neo_parser.py     Multi-language AST parser (tree-sitter)
static/
  index.html      Single-page frontend (3d-force-graph + Three.js)
save_analysis.py  Utility: pre-generate demo data from a repo

The backend streams analysis progress to the frontend via Server-Sent Events, The backend streams analysis progress to the frontend via Server-Sent Events while cloning and analyzing the repo.

API Endpoints

Output Schema

/analyze emits SSE events and completes with a complete event containing:

{
  "schema_version": "2.0",
  "architecture_type": "MVC",
  "languages": { "python": 42 },
  "summary": "Codebase contains 42 modules...",
  "nodes": [{
    "id": "requests/auth",
    "label": "auth.py",
    "type": "utility",
    "language": "python",
    "lines_of_code": 315,
    "complexity": "medium",
    "exported_symbols": ["AuthBase", "HTTPBasicAuth"],
    "internal_dependencies": ["requests/compat"],
    "external_dependencies": [],
    "metrics": { "functions_total": 12, "classes": 4 }
  }],
  "edges": [{ "from": "requests/api", "to": "requests/auth", "type": "import" }],
  "insights": [{
    "type": "high_coupling",
    "severity": "high",
    "title": "High fan-in on requests/models",
    "description": "14 modules import this file directly.",
    "affected_nodes": ["requests/models"],
    "recommendation": "Consider splitting into smaller focused modules."
  }]
}

Each node carries its lines of code, complexity rating, exported symbols, and both internal and external dependencies. The insights block surfaces architectural issues automatically, high coupling, circular dependencies, and god modules - each with a severity rating and a specific recommendation.

Supported Languages

Python - .py
JavaScript/TypeScript - .js, .mjs, .cjs, .jsx, .ts, .tsx
Go - .go
Rust - .rs
Java - .java

AI Chat

The chat assistant uses Claude Opus 4.6 and receives the full architectural graph as context - node list, dependencies, insights, and summary. It can answer questions like:

"What does the auth module depend on?"
"Why are there circular dependencies between X and Y?"
"Which module should I refactor first?"
"Where would I add a caching layer?"

The assistant's answers are grounded in the actual parsed structure of the codebase - not generic advice. Requires ANTHROPIC_API_KEY in your environment.

Development

# Run with auto-reload
uvicorn main:app --reload --port 8001

# Re-generate demo data
python save_analysis.py

How I Built This Using NEO

The idea was a tool that turns any GitHub repository into an interactive 3D graph, something a developer could paste a URL into and immediately understand the architecture without reading a single file. The requirements included multi-language AST parsing, automatic architectural issue detection, an AI assistant grounded in the actual code structure, and a frontend that required no build step.

NEO built the full stack from that description: the FastAPI backend with SSE streaming for real-time analysis progress, the multi-language AST parser in neo_parser.py covering Python, JavaScript, TypeScript, Go, Rust, and Java via tree-sitter, the 3D force-directed graph frontend in vanilla JS, the Claude Opus 4.6 chat assistant with full architectural context, the insights engine detecting god modules, high coupling, and circular dependencies with severity ratings, and the demo mode with pre-generated analysis data.

How You Can Use and Extend This With NEO

Use it to onboard onto an unfamiliar codebase.
Instead of spending hours reading files to understand how a project is structured, paste the repo URL into Orbis and get an immediate visual map of every module, its dependencies, and the architectural issues that already exist. The AI assistant can then answer specific questions about the structure without you having to trace imports manually.

Use it during code review to understand structural impact.
When reviewing a large pull request, run Orbis on the repo and use the insights panel to see whether high coupling, circular dependencies, or god modules exist in the areas being changed. The AI assistant can answer specific questions about how the affected modules connect to the rest of the codebase.

Use it to plan a refactor.
Ask the AI assistant "which module should I refactor first?" or "where would I add a caching layer?" and get answers grounded in the actual dependency graph. The focus mode lets you isolate a specific module and trace exactly what depends on it before touching anything.

Extend it with additional language parsers.
neo_parser.py already handles five languages via tree-sitter. Adding a new language - Ruby, C++, Swift - follows the same parser pattern and surfaces automatically in the language filter chips and the supported languages list without touching the frontend or the API.

Final Notes

Orbis makes codebase architecture something you can see and navigate rather than something you have to reconstruct in your head. A 3D dependency graph, multi-language AST parsing, automatic architectural issue detection, and an AI assistant that knows the actual structure - all from a single repo URL.

The code is at https://github.com/dakshjain-1616/Orbit-dependency-visualised
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

SmolVLM2 Edge Vision Agent: Visual Monitoring Without a GPU or Cloud API

Nilofer 🚀 — Thu, 07 May 2026 11:43:31 +0000

Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.

SmolVLM2 Edge Vision Agent is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.

Project Overview

The agent does five things:

Ingests a live webcam feed or an image folder as input
Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes
Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations
Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores
Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log

It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.

Use cases: home security camera analysis, document digitization pipelines, accessibility tools.

Architecture

The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.

Per-frame timeline:
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.

The FRAME_DIFF_THRESHOLD defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.

Prerequisites

Python: 3.11 or newer.
RAM: 16GB minimum for the real model; less is fine in --mock mode.
Disk: ~5GB free for the model cache.
OS: Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.
No GPU required - SmolVLM2-2.2B is designed for CPU inference.

Installation

git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
cd smolvlm2-edge-agent
make install                                  # pip install -e .
cp .env.example .env                          # then edit values as needed

The make install command runs pip install -e . which installs the package and its pinned runtime dependencies from requirements.txt. The .env.example file contains all documented environment variables, copy it to .env and edit the values you want to override before running.

Configuration

Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in .env.example in the repository.

MODEL_NAME - HuggingFace model id, default: HuggingFaceTB/SmolVLM2-2.2B-Instruct
USE_MOCK_MODE - bypass model loading with deterministic stub responses, default: false
MODEL_CACHE_DIR - where the HuggingFace model is cached on disk, default: ./models
DB_PATH - SQLite database file path, default: ./data/observations.db
FRAME_DIFF_THRESHOLD - motion sensitivity on a 0–1 scale, higher means less sensitive, default: 0.15
MIN_CONFIDENCE - minimum VLM confidence required to log an observation, default: 0.5
PROCESSING_INTERVAL - seconds between frame samples, default: 1.0
MAX_OBSERVATIONS - cap on stored rows, older observations are pruned, default: 10000
DASHBOARD_HOST - FastAPI bind host, default: 0.0.0.0
DASHBOARD_PORT - FastAPI port, default: 8080
INPUT_SOURCE - camera index or path to image folder, default: 0
OUTPUT_DIR - where observation artifacts are written, default: ./data/observations/
THUMBNAIL_DIR - where frame thumbnails are saved, default: ./data/thumbnails/
LOG_LEVEL - Python logging level, default: INFO
LOG_FILE - optional log file path, default: ./data/agent.log

MIN_CONFIDENCE is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.

Usage

Quick start - mock mode, no model download

The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:

mkdir -p data/test_images
python3 -m src --mock --input ./data/test_images --duration 30

This runs the agent for 30 seconds against the data/test_images/ folder using the mock VLM, populates data/observations.db, and writes thumbnails to data/thumbnails/.

Run against a webcam

python3 -m src --input 0 --port 8080

Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open http://localhost:8080 in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.

Run against an image folder

python3 -m src --input ./images --interval 2.0

Iterates over ./images at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.

Dashboard only in read mode

python3 -m src --mode dashboard --port 8080

Serves the dashboard against an existing data/observations.db without running the agent. Useful for reviewing historical observations without starting a new capture session.

API Reference

The FastAPI dashboard exposes six endpoints:

The /api/search endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.

The /api/observations endpoint is paginated with limit and offset parameters. The default returns the 50 most recent observations.

Models Used

This is the default --model argument and MODEL_NAME env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in ./models.

Testing

The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:

make test                  # python3 -m pytest tests/ -v
make lint                  # ruff check src/ tests/ --fix
make typecheck             # mypy src/ --ignore-missing-imports

Test coverage:

tests/test_db.py - 10 tests covering SQLite schema, CRUD, and search
tests/test_vision.py - 6 tests covering mock VLM and prompt rendering
tests/test_agent.py - 9 tests covering motion detection and the agent loop
tests/test_dashboard.py - 6 tests covering HTTP route handlers
tests/test_cli.py - 7 tests covering argparse and env-var loading

Total: 36 tests, all passing. No skipped tests.

Project Structure

smolvlm2-edge-agent/
├── src/
│   ├── __init__.py
│   ├── __main__.py              # entry point for python -m src
│   ├── agent.py                 # MotionDetector + VisionAgent
│   ├── vision.py                # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│   ├── db.py                    # SQLite Database class
│   ├── dashboard.py             # FastAPI app factory + route handlers
│   └── cli.py                   # argparse + env loading
├── tests/                       # 36 pytest tests, VLM fully mocked
├── data/.gitkeep                # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep              # HF model cache
├── pyproject.toml               # ruff + mypy config + console_script
├── requirements.txt             # pinned runtime deps
├── Makefile                     # install, test, lint, typecheck, run, clean
├── .env.example                 # documented env vars
├── .gitignore
├── BUILD_NOTES.md               # build/verification trace
└── PUBLISH.md                   # exact GitHub push commands

The src/ directory maps cleanly to the agent's responsibilities - agent.py handles the motion detection and VLM orchestration loop, vision.py wraps the model with a mock-compatible interface, db.py handles all SQLite operations, dashboard.py is the FastAPI application, and cli.py handles all argument parsing and environment variable loading.

Contributing

PRs welcome. Before submitting, all three of the following must pass with zero errors:

make lint && make typecheck && make test

How I Built This Using NEO

The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.

How You Can Use and Extend This With NEO

Use it as an offline home security monitor: Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.

Use it for document digitization pipelines: Point --input at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The /api/search endpoint lets you query what was found across the full document set.

Use it as an accessibility tool: Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.

Extend it with additional VLM backends: VisionEngine in vision.py wraps SmolVLM2-2.2B with a clean interface that MockVisionEngine also implements. Swapping in a different HuggingFace multimodal model means updating vision.py - the agent, database, dashboard, and CLI stay entirely unchanged.

Final Notes

SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.

The code is at https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code