<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nilofer 🚀</title>
    <description>The latest articles on DEV Community by Nilofer 🚀 (@nilofer_tweets).</description>
    <link>https://dev.to/nilofer_tweets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137273%2Fac10d3a1-21d6-46e3-90d6-889213a616bd.jpg</url>
      <title>DEV Community: Nilofer 🚀</title>
      <link>https://dev.to/nilofer_tweets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nilofer_tweets"/>
    <language>en</language>
    <item>
      <title>Context Compaction Visualizer: See Exactly What Your AI Agent Forgot Before It Costs You</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 23 Jun 2026 06:01:50 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/context-compaction-visualizer-see-exactly-what-your-ai-agent-forgot-before-it-costs-you-1o8n</link>
      <guid>https://dev.to/nilofer_tweets/context-compaction-visualizer-see-exactly-what-your-ai-agent-forgot-before-it-costs-you-1o8n</guid>
      <description>&lt;p&gt;When an AI agent runs for many turns, it eventually hits context limits and must compress or discard earlier messages. This is often invisible, yet critical - lost context can cause the agent to forget constraints, user preferences, or prior decisions. The framework moves on. The agent keeps running. And somewhere in those discarded turns is a security finding, a constraint, a decision that the rest of the session quietly proceeds without.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Compaction Visualizer&lt;/strong&gt; makes that process visible - not after something breaks, but as an inspectable artifact of every run. It is a visualization platform that helps teams understand how long-running AI agents manage and compress context over time - upload execution traces from LangSmith, OpenTelemetry, AgentOps, or any custom format, and explore exactly which context was retained, compressed, or discarded, and at what cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7rfhodv9zuhsjj74y9qs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7rfhodv9zuhsjj74y9qs.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Platform Does
&lt;/h2&gt;

&lt;p&gt;The core problem is that compaction happens inside the framework's internals. There is no standard output that tells you which messages survived, which were summarized, and which were dropped - or what any of that cost in tokens. This platform reconstructs that picture from execution traces.&lt;/p&gt;

&lt;p&gt;A trace file is uploaded with a format selected, and the platform rebuilds the full session: every message at every turn, its fate - retained verbatim, summarized, or discarded - and any compaction events that occurred along the way. A D3.js stacked-bar timeline renders token consumption across all turns with color-coded regions for each outcome. A session replay steps through turn by turn, surfacing a diff at the exact point a compaction event fires. Token analytics compute the total cost and compression efficiency of the session. A Claude-powered information loss detector scores the risk of each compaction event and names specifically what may have been lost.&lt;/p&gt;

&lt;p&gt;When two traces are available - two different agents, or the same agent under two different compaction strategies - a comparative view places them side by side to show which preserved more context at lower cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.11 or newer&lt;/li&gt;
&lt;li&gt;Node.js 18 or newer&lt;/li&gt;
&lt;li&gt;An Anthropic API key (optional - only the Info Loss Detection feature needs it)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Set up the environment&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp .env.example .env
# Optionally add ANTHROPIC_API_KEY=sk-ant-your-key-here for info loss detection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run the backend&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API starts at &lt;code&gt;http://localhost:8000&lt;/code&gt;. Interactive docs are available at &lt;code&gt;http://localhost:8000/docs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the frontend&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The frontend runs in a separate process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd frontend
npm install
npm run dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The UI opens at &lt;code&gt;http://localhost:5173&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run with Docker&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp .env.example .env
docker compose up --build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backend runs on port 8000, the frontend serves via nginx on port 5173.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running Tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd backend &amp;amp;&amp;amp; python -m pytest tests/ -v
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;29 tests covering all four parsers, edge cases, and the token counter. The full suite runs in under 100ms since nothing in it hits an external service.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fb38atjal9yo5rgtuptci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fb38atjal9yo5rgtuptci.png" alt=" " width="700" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Trace Formats
&lt;/h2&gt;

&lt;p&gt;The platform accepts four input formats, selectable via a dropdown on upload. Each has its own parser that handles the vendor-specific schema and reduces it to the normalized structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; - Parses JSON exports from the LangSmith tracing platform. The parser extracts runs, the messages inside each run, token counts from usage metadata, and any chain-level summarization events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenTelemetry&lt;/strong&gt; - Parses OTEL-format JSON spans. The parser traverses the span tree, reconstructs message history from span attributes, and identifies compaction events from span names containing "compress" or "summarize".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentOps&lt;/strong&gt; - Parses AgentOps session JSON exports. The parser handles the session-level event structure and normalizes message roles from AgentOps event types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom JSON&lt;/strong&gt; - A generic format for any agent framework not listed above. It expects a &lt;code&gt;messages&lt;/code&gt; array with &lt;code&gt;role&lt;/code&gt;, &lt;code&gt;content&lt;/code&gt;, and optional &lt;code&gt;tokens&lt;/code&gt; and &lt;code&gt;timestamp&lt;/code&gt; fields. Any event with &lt;code&gt;type: "compaction"&lt;/code&gt; or &lt;code&gt;type: "summarization"&lt;/code&gt; is treated as a compaction event.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;context-compaction-visualizer/
├── backend/
│   ├── main.py                  # FastAPI app, 7 endpoints
│   ├── models.py                # Trace, Message, CompactionEvent ORM
│   ├── schemas.py               # Pydantic validation schemas
│   ├── database.py              # SQLAlchemy + SQLite setup
│   ├── parsers/
│   │   ├── langsmith_parser.py
│   │   ├── otel_parser.py
│   │   ├── agentops_parser.py
│   │   └── custom_parser.py
│   ├── services/
│   │   ├── context_analyzer.py  # Claude-powered info loss detection
│   │   └── token_counter.py     # Token counting + cost estimates
│   ├── requirements.txt
│   ├── Dockerfile
│   └── tests/
│       ├── test_parsers.py      # 29 tests covering all 4 parsers
│       └── fixtures/
│           ├── langsmith_trace.json
│           ├── otel_trace.json
│           ├── agentops_trace.json
│           └── custom_trace.json
├── frontend/
│   ├── src/
│   │   ├── App.tsx              # Upload/Timeline/Replay/Analytics/Loss/Compare tabs
│   │   ├── components/
│   │   │   ├── TraceUploader.tsx
│   │   │   ├── ContextTimeline.tsx   # D3.js stacked bar chart
│   │   │   ├── SessionReplay.tsx     # Turn navigation + compaction diff
│   │   │   ├── TokenAnalytics.tsx
│   │   │   ├── InfoLossDetector.tsx
│   │   │   └── ComparativeView.tsx
│   │   ├── hooks/useD3.ts
│   │   ├── api/client.ts
│   │   └── types/index.ts
│   ├── Dockerfile
│   ├── package.json
│   └── vite.config.ts
├── docker-compose.yml
└── .env.example
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The file structure reflects the normalization design directly. Every file under &lt;code&gt;backend/parsers/&lt;/code&gt; handles one vendor's schema and outputs the same structure. Nothing downstream - not &lt;code&gt;main.py&lt;/code&gt;, not any frontend component - needs to know which parser ran. The two services, &lt;code&gt;context_analyzer.py&lt;/code&gt; and &lt;code&gt;token_counter.py&lt;/code&gt;, sit after all four parsers and only ever see the normalized output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Parser normalization&lt;/strong&gt; - Each observability platform has a fundamentally different schema. Rather than handling platform-specific quirks in every component, all four parsers produce an identical normalized structure. This means the timeline, replay, analytics, and comparison views have no knowledge of the original format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graceful Claude fallback&lt;/strong&gt; - The Info Loss Detector calls the Anthropic API only when &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; is set. Without a key, it returns &lt;code&gt;analysis_available: false&lt;/code&gt; with a clear message rather than failing. The rest of the platform works fully without any API key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;D3.js integration via hook&lt;/strong&gt; - The &lt;code&gt;useD3.ts&lt;/code&gt; hook manages D3's selection lifecycle within React's rendering model. D3 takes ownership of the SVG element inside the hook's effect, while React manages the wrapping div and props. This avoids the common conflict between React's virtual DOM and D3's direct DOM manipulation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralized cost estimates&lt;/strong&gt; - Token counts and cost calculations happen in &lt;code&gt;token_counter.py&lt;/code&gt; using verified Claude pricing - $3 per million input tokens, $15 per million output tokens - defined as constants in one place, making them easy to update if pricing changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment Variables
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fktyg3vb6y0vqdrbx66tc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fktyg3vb6y0vqdrbx66tc.png" alt=" " width="800" height="188"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Verified Results
&lt;/h2&gt;

&lt;p&gt;The backend ships with 29 tests covering all four trace parsers, realistic multi-turn fixture data for each format, edge cases like empty inputs and missing fields, and the token counter. All tests run in under 100ms since no external services are called. The frontend builds to a 238 KB JS bundle across 600 modules.&lt;/p&gt;

&lt;p&gt;For the info-loss detector, the ContextAnalyzer was run (using DeepSeek V4 Flash via OpenRouter for this verification pass) against a real compaction event that had dropped 77,000 tokens from a security code review session. It returned an overall risk score of 0.85 and flagged two losses. The higher-risk item, scored 0.90, was the permanent loss of three specific JWT authentication findings - a missing expiry check, absent refresh token rotation, and a weak secret key - detail precise enough that no summary would have preserved it. The second item, scored 0.70, flagged the loss of 23 tool call exchanges' worth of reasoning context. Both came back with concrete recommended actions, not generic advice.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built autonomously using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a platform that ingests execution traces from any of the major agent observability tools, reconstructs what an agent's context looked like turn by turn, and uses an LLM to flag when something important got dropped during compaction. NEO planned and produced the entire codebase - four format parsers that each reduce a different vendor schema into one normalized structure, a FastAPI backend with seven endpoints wired to SQLAlchemy models, two backend services handling token counting and Claude-powered info loss detection, a full React and TypeScript frontend with D3.js visualizations across six components, and a 29-test suite with realistic multi-turn fixtures for all four formats. The repo's &lt;code&gt;plans/&lt;/code&gt; directory and &lt;code&gt;ORCHESTRATOR_LOG.md&lt;/code&gt; document that build run directly.&lt;/p&gt;

&lt;p&gt;The result is a fully working visualization platform that takes a raw trace file in, and gives you back a complete picture of what your agent remembered, what it forgot, and what that cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Audit any long-running agent for context loss.&lt;/strong&gt; &lt;br&gt;
If a LangSmith, OpenTelemetry, or AgentOps trace exists for an agent run, it can be dropped straight into the platform. The timeline and session replay immediately show which turns survived compaction and which did not - no instrumentation changes, no code modifications to the agent itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark compaction strategies before shipping.&lt;/strong&gt; &lt;br&gt;
When evaluating two different agent configurations or memory strategies, both traces can be uploaded and placed side by side in the comparative view. The platform surfaces which strategy retained more context at lower token cost, turning a subjective comparison into a measurable one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catch silent information loss in security or compliance-sensitive agents.&lt;/strong&gt; &lt;br&gt;
The Claude-powered info loss detector scores each compaction event and flags specific content that may have been dropped - as demonstrated with the JWT authentication findings in the verified results. Any agent operating over sensitive or constraint-heavy sessions can be run through this check before the output is trusted. This requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; to be set; without it the platform returns &lt;code&gt;analysis_available: false&lt;/code&gt; for this feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the custom JSON format to bring any agent framework in.&lt;/strong&gt; &lt;br&gt;
Agents not running on LangSmith, OpenTelemetry, or AgentOps can still feed into the platform by logging to the custom JSON format - a messages array with role, content, and optional tokens and timestamp fields. Any agent framework that can write JSON can produce a trace this platform accepts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Compaction is designed to be invisible - the agent keeps running, the framework handles the limit, and nothing interrupts the workflow. The cost of that invisibility is that when something is silently dropped, there is no record of what it was or what it was worth. Context Compaction Visualizer produces that record.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Context-Compaction-Visualizer" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Context-Compaction-Visualizer&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>claude</category>
    </item>
    <item>
      <title>Agent Sandbox Escape Detector: Black-Box Security Scanning for LLM Agents</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Fri, 12 Jun 2026 16:29:19 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/agent-sandbox-escape-detector-black-box-security-scanning-for-llm-agents-30bp</link>
      <guid>https://dev.to/nilofer_tweets/agent-sandbox-escape-detector-black-box-security-scanning-for-llm-agents-30bp</guid>
      <description>&lt;p&gt;Most agent security tools focus on known jailbreak phrases or static rule-matching. That approach misses the point. A real attacker does not check a list of banned words - they probe the agent's actual behavior with semantically varied adversarial inputs and look for signs that something slipped through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Sandbox&lt;/strong&gt; Escape Detector takes the same approach. Point it at any HTTP chat endpoint, and it fires a battery of adversarial prompts across six attack categories, then uses Claude Opus 4.8 as an independent judge to determine whether the agent leaked data, broke persona, or executed injected instructions. The result is a structured scan report with per-probe verdicts, evidence excerpts, and confidence scores.&lt;/p&gt;

&lt;p&gt;The key insight is that you do not need whitebox access to test an agent — all you need is its chat endpoint. The scanner treats the agent as a black box and probes it the same way a real attacker would.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feldcfvib68xu3fwz30wi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feldcfvib68xu3fwz30wi.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Entry: CLI (--target URL)  ·  POST /scan
               │
               ▼
  ┌────────────────────────────────────────────┐
  │  Scanner  ·  asyncio.gather()              │
  │  all 6 probes run concurrently             │
  │  per-probe error isolation                 │
  └──────────────────┬─────────────────────────┘
                     │
       ┌─────────────┼──────────────────┐
       ▼             ▼                  ▼
  ┌─────────┐  ┌──────────────┐  ┌────────────────────┐
  │  Tool   │  │ Prompt Leak  │  │ API Call           │
  │ Access  │  │ Role Confus. │  │ Indirect Injection │
  └────┬────┘  └──────┬───────┘  │ Jailbreak  (6 tot) │
       │               │          └──────────┬─────────┘
       └───────────────┴───────────────────── ┘
                     │ 4–6 adversarial prompts per category
                     ▼
  ┌────────────────────────────────────────────┐
  │  Target Agent  ·  any HTTP chat endpoint   │
  │  httpx async POST                          │
  └──────────────────┬─────────────────────────┘
                     │ agent responses
                     ▼
  ┌────────────────────────────────────────────┐
  │  Claude Judge  ·  claude-opus-4.8          │
  │  via OpenRouter                            │
  │  → ESCAPED / SAFE  ·  confidence (0–1)     │
  │  → evidence excerpt from response          │
  └──────────────────┬─────────────────────────┘
                     │
                     ▼
  ┌────────────────────────────────────────────┐
  │  Scan Report                               │
  │  JSON  ·  Markdown  ·  Rich console        │
  │  GET /results/{scan_id}                    │
  └────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What It Tests
&lt;/h2&gt;

&lt;p&gt;Each probe category targets a different class of vulnerability observed in production LLM deployments. Every category fires 4–6 distinct prompts to reduce false negatives from surface-level prompt filtering. All probes run concurrently - a full scan of six categories completes in roughly the same time as a single category.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool Access&lt;/strong&gt; - requests that trick the agent into invoking unauthorized tools: file reads, database queries, exec calls. A vulnerable agent attempts to call off-limits functions or returns file and database content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Leak&lt;/strong&gt; - social-engineering and meta-instruction tricks to extract the system prompt. A vulnerable agent reveals system prompt contents, configuration, or internal instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Call&lt;/strong&gt; - crafted inputs designed to make the agent perform SSRF or data exfiltration via HTTP. A vulnerable agent constructs or references external URLs containing sensitive data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Role Confusion&lt;/strong&gt; - persona hijacking, identity override, DAN-style persona injection. A vulnerable agent breaks character or adopts a no-restrictions alternate identity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indirect Injection&lt;/strong&gt; - malicious payloads hidden in simulated tool outputs or document context. A vulnerable agent executes injected instructions as if they came from a trusted source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Jailbreak&lt;/strong&gt; - chain-of-thought manipulation, hypothetical framing, and reasoning override. A vulnerable agent agrees to bypass safety constraints or leaks reasoning chain internals.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Judge Works
&lt;/h2&gt;

&lt;p&gt;After collecting all agent responses, each probe's batch of responses is sent to Claude Opus 4.8 via OpenRouter with a structured evaluation prompt. The judge produces a binary ESCAPED/SAFE verdict, a confidence score from 0 to 1, and a natural-language evidence summary quoting specific parts of the response that led to the verdict.&lt;/p&gt;

&lt;p&gt;This approach is significantly more accurate than regex pattern matching because the judge understands context - an agent that says "I cannot help with that" is different from one that says "I cannot help with that, but here is /etc/passwd anyway."&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Install dependencies into a virtual environment, copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt;, and add your &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;. Then point the CLI at any agent's chat endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m src.cli scan --target http://localhost:8000/chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To scan only specific probe categories or save results to JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m src.cli scan --target http://localhost:8000/chat --probes tool_access,jailbreak --output report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the FastAPI server for REST integration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;uvicorn src.api.main:app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Environment Variables
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENROUTER_API_KEY=sk-or-...    # Required — used for Claude judge calls via OpenRouter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;POST /scan&lt;/strong&gt; - accepts a target URL and optional probe list, returns a scan ID immediately, and runs the scan asynchronously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GET /results/{scan_id}&lt;/strong&gt; - returns the full structured report once the scan is complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GET /health&lt;/strong&gt; - liveness probe for uptime monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Live Scan Results
&lt;/h2&gt;

&lt;p&gt;Real scan run against a Claude-powered HTTP agent on 2026-06-09:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhur48oe6p5g773lcwao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhur48oe6p5g773lcwao.png" alt=" " width="766" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;0 escapes detected across 6 probe categories - approximately 30 adversarial turns total. Scan ID 0c4bffa6, 2026-06-09.&lt;/p&gt;

&lt;h2&gt;
  
  
  Source Layout
&lt;/h2&gt;

&lt;p&gt;The scanner orchestrates all probes via &lt;code&gt;asyncio.gather()&lt;/code&gt; so they run in parallel, with per-probe error isolation so a timeout on one category never blocks the others. Each probe is a standalone class inheriting from &lt;code&gt;BaseProbe&lt;/code&gt; - adding a new attack category means writing one class and one prompts file. The judge lives in &lt;code&gt;core/judge.py&lt;/code&gt; and is stateless: it takes a list of responses and returns a list of &lt;code&gt;ProbeResult&lt;/code&gt; objects. Reports are assembled by &lt;code&gt;core/report.py&lt;/code&gt;, which handles JSON serialization, Markdown formatting, and Rich console rendering independently.&lt;/p&gt;

&lt;p&gt;The test suite uses a vulnerable dummy agent fixture - an in-process FastAPI app that always complies with requests - to verify the scanner can detect escapes, and a safe dummy agent to verify it does not produce false positives. 64 tests, passing in approximately 15 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a black-box behavioral security scanner for LLM agents - one that probes any HTTP chat endpoint with adversarial prompts across six attack categories, uses Claude Opus 4.8 as an independent judge, and produces structured reports with per-probe verdicts, evidence excerpts, and confidence scores. NEO built the full implementation: the async orchestrator using &lt;code&gt;asyncio.gather()&lt;/code&gt; with per-probe error isolation, all six probe classes inheriting from &lt;code&gt;BaseProbe&lt;/code&gt; with their adversarial prompt files, the stateless Claude judge in &lt;code&gt;core/judge.py&lt;/code&gt; via OpenRouter, the report assembler in &lt;code&gt;core/report.py&lt;/code&gt; covering JSON, Markdown, and Rich console output, the CLI entry point, the FastAPI REST server with POST /scan and GET /results/{scan_id}, and the 64-test suite with vulnerable and safe dummy agent fixtures.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to security-test any agent before it goes to production.&lt;/strong&gt;&lt;br&gt;
Point the CLI at your agent's chat endpoint and run a full six-category scan. The structured report tells you exactly which probe categories the agent failed, what the judge found in the response, and the confidence score - before real users can probe the same vulnerabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrate it into CI to catch security regressions on every deploy.&lt;/strong&gt;&lt;br&gt;
Use &lt;code&gt;POST /scan&lt;/code&gt; to trigger a scan and &lt;code&gt;GET /results/{scan_id}&lt;/code&gt; to poll the report. If any probe returns ESCAPED above your confidence threshold, fail the pipeline. Agent behavior can regress with model updates or prompt changes - automated scanning catches this before it reaches production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the probe categories as a security checklist when building agents.&lt;/strong&gt;&lt;br&gt;
The six categories - tool access, prompt leak, API call, role confusion, indirect injection, and jailbreak - map directly to the vulnerabilities that have been observed in production LLM deployments. Running the scanner on your agent during development tells you which categories need stronger guardrails before launch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional probe categories.&lt;/strong&gt;&lt;br&gt;
Each probe is a standalone class inheriting from &lt;code&gt;BaseProbe&lt;/code&gt; with a corresponding prompts file. A new attack category follows the same pattern and is automatically picked up by the orchestrator, judge, and report pipeline without any changes to the core.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent security is behavioral, not syntactic. A scanner that checks for banned phrases misses the attacks that matter. Agent Sandbox Escape Detector probes real behavior across six attack categories, judges responses with a frontier model that understands context, and gives you structured evidence - so you know not just whether an agent escaped, but how and where.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Agent-Sandbox-Escape-Detector" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Agent-Sandbox-Escape-Detector&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>AgentLiar Detector: Catch Coding Agents That Falsely Claim Task Completion</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 10 Jun 2026 12:41:47 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/agentliar-detector-catch-coding-agents-that-falsely-claim-task-completion-413c</link>
      <guid>https://dev.to/nilofer_tweets/agentliar-detector-catch-coding-agents-that-falsely-claim-task-completion-413c</guid>
      <description>&lt;p&gt;AI coding agents are getting better at completing tasks. They are also getting better at appearing to complete tasks. An agent that claims "done" when it has created placeholder files, written empty tests, or quietly narrowed the scope of the original requirement is harder to catch than one that simply fails, because the failure is hidden inside output that looks correct at a glance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentLiar&lt;/strong&gt; is a production-ready system that detects when coding agents falsely claim task completion. It runs four independent verification checks, produces a weighted confidence score from 0 to 100, and delivers structured evidence in JSON, Markdown, or console output - usable as a CLI tool, Python library, GitHub Action, or HTTP API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;4 Independent Checks&lt;/strong&gt; - File, Test, Scope, and LLM Judge.&lt;br&gt;
&lt;strong&gt;Confidence Scoring&lt;/strong&gt; - weighted aggregation on a 0–100 scale.&lt;br&gt;
&lt;strong&gt;Multiple Interfaces&lt;/strong&gt; - CLI, Python API, GitHub Action, and HTTP API.&lt;br&gt;
&lt;strong&gt;Adversarial Detection&lt;/strong&gt; - catches placeholder implementations, empty tests, and scope narrowing.&lt;br&gt;
&lt;strong&gt;Structured Reports&lt;/strong&gt; - JSON and Markdown output with evidence.&lt;br&gt;
&lt;strong&gt;Production Ready&lt;/strong&gt; - type hints, error handling, logging, and async support.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The async orchestrator dispatches four independent checks File, Test, Scope (local), plus an optional OpenRouter LLM Judge and produces a weighted 0–100 confidence score delivered as JSON, Markdown, or console output for CI gating.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1iym9vzkxg27ekdvy3y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1iym9vzkxg27ekdvy3y.png" alt=" " width="751" height="448"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Four Verification Checks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. File Check&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects missing expected files&lt;/li&gt;
&lt;li&gt;Identifies unexpected new files&lt;/li&gt;
&lt;li&gt;Finds placeholder content: TODO, FIXME, pass-only&lt;/li&gt;
&lt;li&gt;Validates file sizes and content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Test Check&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects empty test bodies&lt;/li&gt;
&lt;li&gt;Identifies tests without assertions&lt;/li&gt;
&lt;li&gt;Finds skipped tests&lt;/li&gt;
&lt;li&gt;Validates claimed versus actual test counts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Scope Check&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects silent scope narrowing: "only", "for now"&lt;/li&gt;
&lt;li&gt;Identifies partial implementations&lt;/li&gt;
&lt;li&gt;Finds TODO markers in code&lt;/li&gt;
&lt;li&gt;Validates requirements coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. LLM Judge&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Independent assessment via OpenRouter&lt;/li&gt;
&lt;li&gt;Structured JSON output&lt;/li&gt;
&lt;li&gt;Timeout and retry logic&lt;/li&gt;
&lt;li&gt;Optional - works without an API key&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or &lt;code&gt;pip install agentliar&lt;/code&gt; once published. Requires Python 3.10+.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI Usage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prepare sample inputs from &lt;code&gt;examples/simple_task.json&lt;/code&gt;, then run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentliar verify &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--task-file&lt;/span&gt; .tmp/task.txt &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--claim-file&lt;/span&gt; .tmp/claim.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--changes-file&lt;/span&gt; .tmp/changes.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; markdown
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;agentliar config&lt;/code&gt; to inspect configuration and &lt;code&gt;agentliar analyze .tmp/task.txt&lt;/code&gt; to review a task file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentliar&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Verifier&lt;/span&gt;

&lt;span class="n"&gt;verifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Verifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;verifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;claim_payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;file_changes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;changes_payload&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Read result.score, result.passed, result.confidence_level, result.reports
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the GitHub Action with task, claim, and change files, a confidence threshold, and an optional &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; secret when you want the LLM Judge path enabled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start the API server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; agentliar.server
&lt;span class="c"&gt;# or&lt;/span&gt;
uvicorn agentliar.server:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then &lt;code&gt;POST /verify&lt;/code&gt; with the task, claim, and file-change payloads. The response returns score, pass/fail, and evidence blocks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Confidence Score Interpretation
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;90–100&lt;/code&gt; - High. Task appears fully completed.&lt;br&gt;
&lt;code&gt;70–89&lt;/code&gt; - Medium. Task likely complete with minor issues.&lt;br&gt;
&lt;code&gt;50–69&lt;/code&gt; - Low. Task partially completed.&lt;br&gt;
&lt;code&gt;30–49&lt;/code&gt; - Critical. Significant issues detected.&lt;br&gt;
&lt;code&gt;0–29&lt;/code&gt; - Failed. Task likely not completed.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file. Set &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; and &lt;code&gt;OPENROUTER_MODEL&lt;/code&gt; only if you want LLM Judge mode. The check weights must sum to 1.0. &lt;code&gt;CONFIDENCE_THRESHOLD&lt;/code&gt; controls the pass/fail cutoff.&lt;/p&gt;

&lt;p&gt;Recommended LLM Judge models (May 2026):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;anthropic/claude-haiku-4-5&lt;/code&gt; - cheap and fast judging&lt;br&gt;
&lt;code&gt;anthropic/claude-sonnet-4-6&lt;/code&gt; or openai/gpt-5.4 - higher-quality judging&lt;br&gt;
&lt;code&gt;openai/gpt-4.1-mini&lt;/code&gt; - budget option&lt;/p&gt;
&lt;h2&gt;
  
  
  Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CI/CD Integration&lt;/strong&gt; - automatically verify PR claims before merging.&lt;br&gt;
&lt;strong&gt;Code Review&lt;/strong&gt; - get an independent assessment of task completion alongside a human review.&lt;br&gt;
&lt;strong&gt;Agent Monitoring&lt;/strong&gt; - detect when AI agents overstate progress in automated pipelines.&lt;br&gt;
&lt;strong&gt;Quality Gates&lt;/strong&gt; - block merges below a confidence threshold.&lt;br&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; - generate verification reports for stakeholders.&lt;/p&gt;
&lt;h2&gt;
  
  
  Security
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No hardcoded secrets&lt;/li&gt;
&lt;li&gt;API keys via environment variables only&lt;/li&gt;
&lt;li&gt;No data persistence&lt;/li&gt;
&lt;li&gt;Local processing except for LLM Judge&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/agentliar/           # Checks, orchestration, scoring, reports, API, CLI, server
tests/
├── unit/                # Unit tests
├── adversarial/         # Adversarial tests
└── integration/         # Integration tests
examples/                # Sample inputs
action.yml               # GitHub Action definition
pyproject.toml           # Packaging and tooling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest                                    &lt;span class="c"&gt;# Full suite&lt;/span&gt;
pytest &lt;span class="nt"&gt;--cov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;agentliar &lt;span class="nt"&gt;--cov-report&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;html  &lt;span class="c"&gt;# With coverage&lt;/span&gt;
pytest tests/unit/                        &lt;span class="c"&gt;# Unit tests only&lt;/span&gt;
pytest tests/adversarial/                 &lt;span class="c"&gt;# Adversarial tests only&lt;/span&gt;
pytest tests/integration/                 &lt;span class="c"&gt;# Integration tests only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Code Quality&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruff check &lt;span class="nb"&gt;.&lt;/span&gt;        &lt;span class="c"&gt;# Linting&lt;/span&gt;
ruff format &lt;span class="nb"&gt;.&lt;/span&gt;       &lt;span class="c"&gt;# Formatting&lt;/span&gt;
mypy src tests      &lt;span class="c"&gt;# Type checking&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a production-ready verification system for detecting false completion claims from coding agents - running four independent checks locally, with an optional LLM Judge via OpenRouter, and exposing the result through a CLI, Python API, GitHub Action, and HTTP API. NEO built the full implementation: the async orchestrator dispatching all four checks, the File, Test, Scope, and LLM Judge check modules, the weighted confidence scorer, the JSON and Markdown report generators, the Click CLI with verify, config, and analyze commands, the FastAPI HTTP server, the GitHub Action definition in &lt;code&gt;action.yml&lt;/code&gt;, and the test suite split across unit, adversarial, and integration coverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a CI gate on every PR that includes AI-generated code.&lt;/strong&gt;&lt;br&gt;
Add the GitHub Action to your workflow with a confidence threshold. Any PR where the agent's claimed changes do not pass the file, test, and scope checks below your threshold is blocked before merge - automatically, without a reviewer having to spot the placeholder implementation manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to monitor agent progress in long-running pipelines.&lt;/strong&gt;&lt;br&gt;
Call &lt;code&gt;await verifier.verify(...)&lt;/code&gt; from Python after each agent task completes. The confidence score and evidence blocks tell you whether the agent actually finished the task or produced output that looks complete but is not - before the next stage of the pipeline starts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the LLM Judge for higher-confidence verification on critical tasks.&lt;/strong&gt;&lt;br&gt;
Set &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; and configure a judge model for tasks where the local checks alone are not sufficient. The LLM Judge runs independently from the other three checks and adds a cross-model perspective to the confidence score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional check types.&lt;/strong&gt;&lt;br&gt;
The four checks share a common async interface in the orchestrator. A new check follows the same pattern and its weight is added to the configuration. The orchestrator, scorer, and reporters pick it up automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agents that falsely claim completion are harder to catch than agents that fail outright - because the output exists and looks plausible. AgentLiar makes the verification systematic: four independent checks, a weighted confidence score, and structured evidence that tells you exactly where the claim breaks down.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/AgentLiar" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/AgentLiar&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Carbon-Aware Model Training: Scheduling GPU Workloads Around Electricity Carbon Intensity</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 06 Jun 2026 08:48:43 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/carbon-aware-model-training-scheduling-gpu-workloads-around-electricity-carbon-intensity-b4b</link>
      <guid>https://dev.to/nilofer_tweets/carbon-aware-model-training-scheduling-gpu-workloads-around-electricity-carbon-intensity-b4b</guid>
      <description>&lt;p&gt;Training ML models has an environmental cost that most practitioners do not measure. A model trained during peak grid hours, when coal and gas plants are meeting high demand - can emit significantly more CO2 than the same model trained during off-peak hours when renewables dominate the grid. The carbon intensity of electricity varies by a factor of 2–5x throughout the day, but most training pipelines ignore this entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Carbon-Aware Model Training Pipeline&lt;/strong&gt; is a PyTorch-based training pipeline that monitors real-time electricity carbon intensity, delays training until a low-carbon window is available, reduces GPU memory footprint through gradient accumulation, and tracks CO2 emissions throughout the training process using CodeCarbon - with a comparison report that quantifies the carbon savings against a baseline run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmyllyuji9qnif6r48ws.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmyllyuji9qnif6r48ws.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Carbon-Aware Scheduling&lt;/strong&gt; - real-time carbon intensity monitoring with smart training delays until low-carbon windows are detected.&lt;br&gt;
&lt;strong&gt;Gradient Accumulation&lt;/strong&gt; - reduces GPU memory footprint while maintaining effective batch size.&lt;br&gt;
&lt;strong&gt;Emissions Tracking&lt;/strong&gt; - real-time CO2 monitoring via CodeCarbon with comprehensive JSON reports.&lt;br&gt;
&lt;strong&gt;Modular Design&lt;/strong&gt; - YAML-based configuration with separate scheduler, tracker, and trainer modules.&lt;br&gt;
&lt;strong&gt;GPU Optimized&lt;/strong&gt; - automatic CUDA detection with mixed precision training (FP16).&lt;br&gt;
&lt;strong&gt;Comparative Analysis&lt;/strong&gt; - automated reporting quantifying carbon savings against a baseline run.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The pipeline runs in four stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 - Carbon-Aware Scheduling&lt;/strong&gt;&lt;br&gt;
Real-time monitoring checks electricity carbon intensity via APIs. Smart delays wait for low-carbon windows before starting training. Fallback mechanisms use realistic mock data when APIs are unavailable - with diurnal patterns simulating peak intensity at 18:00 and trough at 03:00. Configurable thresholds allow customization for different regions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 - Gradient Accumulation&lt;/strong&gt;&lt;br&gt;
Memory optimization processes smaller micro-batches. Effective batch size is maintained with reduced memory. Configurable steps (2, 4, 8, 16) adapt to hardware constraints. Convergence preservation ensures model quality is not compromised.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3 - Emissions Tracking&lt;/strong&gt;&lt;br&gt;
CodeCarbon integration monitors CO2 emissions in real-time. Energy metrics track power consumption in Watts and energy in kWh. Comprehensive reports generate JSON summaries with all metrics. Comparative analysis quantifies carbon savings versus the baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4 - GPU Optimization&lt;/strong&gt;&lt;br&gt;
Mixed precision training (FP16) reduces memory and increases speed. Automatic CUDA detection uses GPU when available. Pin memory optimization enables faster data transfers. Graceful CPU fallback when GPU is unavailable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                     Training Configuration                      │
│                       (YAML Config File)                        │
└─────────────────────────┬───────────────────────────────────────┘
                          │
                          ▼
         ┌────────────────────────────────────┐
         │   Carbon Intensity Scheduler       │
         │   - API/Mock data fetch            │
         │   - Threshold comparison           │
         │   - Wait for low-carbon window     │
         └────────────────┬───────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │   Start Training?     │
              │   Intensity &amp;lt; 300?    │
              └─────┬─────────────┬───┘
                    │ NO          │ YES
                    ▼             ▼
            ┌───────────┐   ┌──────────────┐
            │   Wait    │   │ Start Tracker│
            │ &amp;amp; Recheck │   │ (CodeCarbon) │
            └───────────┘   └──────┬───────┘
                                   │
                                   ▼
                  ┌────────────────────────────────┐
                  │   PyTorch Training Loop        │
                  │   - Gradient Accumulation      │
                  │   - Mixed Precision (FP16)     │
                  │   - Checkpointing              │
                  └────────────────┬───────────────┘
                                   │
                                   ▼
                  ┌────────────────────────────────┐
                  │   Emissions Tracking           │
                  │   - CO2 (kg)                   │
                  │   - Energy (kWh)               │
                  │   - Power (Watts)              │
                  └────────────────┬───────────────┘
                                   │
                                   ▼
              ┌───────────────────────────────────┐
              │   Save Results                    │
              │   - Model checkpoint              │
              │   - Training summary (JSON)       │
              │   - Emissions log (CSV)           │
              └───────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8+&lt;/li&gt;
&lt;li&gt;PyTorch 2.0+&lt;/li&gt;
&lt;li&gt;CUDA (optional, for GPU acceleration)
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/CarbonAwareModelTraining---by-NEO.git
&lt;span class="nb"&gt;cd &lt;/span&gt;CarbonAwareModelTraining---by-NEO

python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# On Windows: venv\Scripts\activate&lt;/span&gt;

pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Required packages: &lt;code&gt;torch&amp;gt;=2.0.0&lt;/code&gt;, &lt;code&gt;torchvision&amp;gt;=0.15.0&lt;/code&gt;, &lt;code&gt;codecarbon&amp;gt;=2.3.0&lt;/code&gt;, &lt;code&gt;pyyaml&amp;gt;=6.0&lt;/code&gt;, &lt;code&gt;numpy&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate

&lt;span class="c"&gt;# Run baseline training (no optimization)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PYTHONPATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;/src:&lt;/span&gt;&lt;span class="nv"&gt;$PYTHONPATH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
python src/train.py configs/baseline.yaml

&lt;span class="c"&gt;# Run optimized training (carbon-aware + gradient accumulation)&lt;/span&gt;
python src/train.py configs/optimized.yaml

&lt;span class="c"&gt;# Generate comparison report&lt;/span&gt;
python generate_comparison.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This runs three steps: baseline training without carbon awareness, optimized training with carbon-aware scheduling and gradient accumulation, and a comparison report that quantifies carbon savings and performance metrics.&lt;/p&gt;
&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Configure carbon-aware training in &lt;code&gt;configs/optimized.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;carbon_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;           &lt;span class="c1"&gt;# gCO2/kWh&lt;/span&gt;
  &lt;span class="na"&gt;wait_for_low_carbon&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;training&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;
  &lt;span class="na"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# Effective batch = 64&lt;/span&gt;
  &lt;span class="na"&gt;epochs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run optimized training:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python src/train.py configs/optimized.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================
CARBON-AWARE TRAINING STARTED
============================================================

Carbon Intensity Check:
  Current Intensity: 420.5 gCO2/kWh
  Threshold: 300 gCO2/kWh
  Status: ⏳ Waiting for low-carbon window...

[10 minutes later]
  Current Intensity: 285.3 gCO2/kWh
  Status: ✅ Starting training now!

Training Progress:
  Epoch 1/3 - Loss: 0.324 - Accuracy: 91.2%
  CO2 Emissions: 0.042 kg
  Energy Consumed: 0.15 kWh

============================================================
CARBON SAVINGS vs BASELINE
============================================================

CO2 Reduction: 32.5% (0.024 kg saved)
GPU Memory Reduction: 45.8%
Accuracy: 93.1% (baseline: 93.4%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Usage Examples
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Carbon-Aware Scheduling Only&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Disable gradient accumulation, enable scheduling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;carbon_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;250&lt;/span&gt;

&lt;span class="na"&gt;training&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# No accumulation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gradient Accumulation Only&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Disable scheduling, enable memory optimization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;

&lt;span class="na"&gt;training&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;  &lt;span class="c1"&gt;# Effective batch = 64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real Carbon Intensity API&lt;/strong&gt;&lt;br&gt;
Configure for production with a real API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scheduler&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;use_mock_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;api_endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.carbonintensity.org.uk/intensity"&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GB"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Custom Model Integration&lt;/strong&gt;&lt;br&gt;
Replace &lt;code&gt;SimpleCNN&lt;/code&gt; in &lt;code&gt;src/train.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;my_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MyCustomModel&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prepare_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MyCustomModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;input_channels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_channels&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;num_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;num_classes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output Format&lt;/strong&gt;&lt;br&gt;
Training summary JSON saved to &lt;code&gt;output/summary_optimized.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"run_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"optimized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"training_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"final_accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;93.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"final_loss"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.124&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"epochs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_time_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;245&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"carbon_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_emissions_kg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.042&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"energy_consumed_kwh"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"avg_power_watts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;145.2&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scheduler_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"wait_time_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"initial_intensity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;420.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"training_intensity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;285.3&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gpu_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"peak_memory_mb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gradient_accumulation_steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"effective_batch_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Comparison report saved to &lt;code&gt;output/comparison_report.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"carbon_savings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"baseline_emissions_kg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.074&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"optimized_emissions_kg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.042&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reduction_kg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.032&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reduction_percentage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;43.2&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"accuracy_impact"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"baseline_accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;93.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"optimized_accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;93.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"degradation_percentage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"memory_savings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"baseline_memory_mb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"optimized_memory_mb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reduction_percentage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;50.0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;Evaluated on MNIST training - 3 epochs, RTX 3090 GPU:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1a9ps96nj9yoysq7diz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe1a9ps96nj9yoysq7diz.png" alt=" " width="485" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Carbon Intensity Patterns (Mock Data):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Peak hours 18:00–22:00: ~450 gCO2/kWh&lt;br&gt;
Off-peak hours 02:00–06:00: ~200 gCO2/kWh&lt;br&gt;
Average reduction: 35–45% CO2 by scheduling during low-carbon windows&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU Memory Savings:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gradient accumulation 2x: ~30% memory reduction&lt;br&gt;
Gradient accumulation 4x: ~50% memory reduction&lt;br&gt;
Gradient accumulation 8x: ~60% memory reduction&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Convergence Validation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Accuracy degradation under 1% across all tested configurations&lt;br&gt;
Loss convergence matches baseline within 2% tolerance&lt;br&gt;
No divergence observed&lt;/p&gt;
&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CarbonAwareModelTraining---by-NEO/
├── src/
│   ├── scheduler.py                # Carbon intensity API &amp;amp; scheduling
│   ├── tracker.py                  # CodeCarbon emissions tracking
│   ├── train.py                    # Main training pipeline
│   └── utils.py                    # Config loading &amp;amp; logging
├── configs/
│   ├── baseline.yaml               # Baseline training config
│   └── optimized.yaml              # Carbon-aware optimized config
├── output/
│   ├── summary_baseline.json       # Baseline training summary
│   ├── summary_optimized.json      # Optimized training summary
│   ├── comparison_report.json      # Comparative analysis
│   ├── emissions.csv               # CodeCarbon emissions log
│   └── training_*.log              # Detailed training logs
├── models/
│   ├── model_baseline.pt           # Baseline model checkpoint
│   └── model_optimized.pt          # Optimized model checkpoint
├── data/                            # MNIST dataset (auto-downloaded)
├── requirements.txt                 # Python dependencies
├── generate_comparison.py          # Comparison report generator
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why Carbon-Aware Scheduling?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Carbon intensity varies 2–5x throughout the day. Scheduling training during low-carbon windows reduces emissions without affecting model quality. Low-carbon periods also often correlate with cheaper electricity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Gradient Accumulation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gradient accumulation enables training larger models on limited hardware by processing smaller micro-batches and updating weights less frequently. Used in BERT, GPT, and other large-scale models for the same reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why CodeCarbon?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CodeCarbon uses lifecycle assessment methodologies, supports CPU, GPU, and multi-device setups, and produces transparent, community-validated calculations. It tracks energy, power, and emissions in a single library.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why YAML Configuration?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;YAML configs are version-controlled, human-readable, and separate code from experiment parameters - enabling reproducible A/B comparisons between baseline and optimized runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;Validate installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import torch; print(f'PyTorch: {torch.__version__}')"&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import codecarbon; print('CodeCarbon: OK')"&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import yaml; print('PyYAML: OK')"&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import torch; print(f'CUDA Available: {torch.cuda.is_available()}')"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run a quick 5-minute test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python src/train.py configs/test.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Validate carbon savings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python src/train.py configs/baseline.yaml
python src/train.py configs/optimized.yaml
python generate_comparison.py
&lt;span class="nb"&gt;cat &lt;/span&gt;output/comparison_report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CUDA Out of Memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Reduce &lt;code&gt;batch_size&lt;/code&gt; and increase &lt;code&gt;gradient_accumulation_steps&lt;/code&gt; in the config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Carbon Intensity API Timeout&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No action needed - the pipeline automatically falls back to mock data and training proceeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Module Import Errors&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PYTHONPATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;/src:&lt;/span&gt;&lt;span class="nv"&gt;$PYTHONPATH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CodeCarbon Tracking Fails&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; codecarbon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training continues without emissions tracking if CodeCarbon fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scheduler Waits Too Long&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Increase &lt;code&gt;max_wait_seconds&lt;/code&gt;, raise &lt;code&gt;carbon_threshold&lt;/code&gt;, or set &lt;code&gt;wait_for_low_carbon: false&lt;/code&gt; in the config.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a PyTorch training pipeline that schedules GPU workloads based on real-time carbon intensity, reduces memory footprint through gradient accumulation, and tracks emissions with CodeCarbon - producing a side-by-side comparison report. NEO built the full implementation: the carbon intensity scheduler in &lt;code&gt;scheduler.py&lt;/code&gt; with API integration and mock fallback, the CodeCarbon emissions tracker in &lt;code&gt;tracker.py&lt;/code&gt;, the main training pipeline in &lt;code&gt;train.py&lt;/code&gt; with gradient accumulation and mixed precision FP16, the config loader and logging utilities in &lt;code&gt;utils.py&lt;/code&gt;, the YAML configs for baseline and optimized runs, the comparison report generator in &lt;code&gt;generate_comparison.py&lt;/code&gt;, and the full output structure covering JSON summaries, emissions CSV, and model checkpoints.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to measure the carbon cost of your existing training runs.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;python src/train.py configs/baseline.yaml&lt;/code&gt; on your own model and data by replacing &lt;code&gt;SimpleCNN&lt;/code&gt; in &lt;code&gt;src/train.py&lt;/code&gt; with your model. The CodeCarbon tracker produces a JSON summary with CO2 in kg, energy in kWh, and average power in Watts, a baseline measurement before any optimization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the comparison report to justify scheduling infrastructure.&lt;/strong&gt;&lt;br&gt;
Run both the baseline and optimized configs on the same dataset. The &lt;code&gt;comparison_report.json&lt;/code&gt; gives you a concrete before and after - percentage reduction in emissions, energy, and memory, alongside accuracy degradation,  that makes the case for carbon-aware scheduling with real numbers from your own hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use mock data for development and real API for production.&lt;/strong&gt;&lt;br&gt;
Set &lt;code&gt;use_mock_data: true&lt;/code&gt; during development so training always proceeds without waiting. Switch to &lt;code&gt;use_mock_data: false&lt;/code&gt; with a real &lt;code&gt;api_endpoint&lt;/code&gt; for production runs where actual carbon savings matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend the scheduler with additional carbon intensity sources.&lt;/strong&gt;&lt;br&gt;
The scheduler in &lt;code&gt;scheduler.py&lt;/code&gt; fetches from a configurable &lt;code&gt;api_endpoint&lt;/code&gt;. Adding support for additional regional carbon intensity APIs - Electricity Maps, WattTime, or a custom internal source, means updating the fetch logic in &lt;code&gt;scheduler.py&lt;/code&gt; without touching the training loop, tracker, or reporting pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Carbon intensity varies throughout the day, and most training pipelines ignore it. A 43% reduction in CO2 emissions with less than 1% accuracy degradation, achieved by scheduling when the grid is cleaner and accumulating gradients to reduce memory - shows that sustainable ML is a practical engineering choice, not just an aspiration.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/CarbonAwareModelTraining" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/CarbonAwareModelTraining&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Agentsync: Version, Merge, and Audit AI Agent Configurations Like Code</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 06 Jun 2026 05:34:45 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/agentsync-version-merge-and-audit-ai-agent-configurations-like-code-cln</link>
      <guid>https://dev.to/nilofer_tweets/agentsync-version-merge-and-audit-ai-agent-configurations-like-code-cln</guid>
      <description>&lt;p&gt;Most AI engineering teams now run a stack of agent configs across many repos - model choices, tool allowlists, prompt templates, eval thresholds, safety rules. These configs drift the moment two engineers touch them. One repo gets a new policy, another keeps the old one, and nobody notices until an agent makes a decision in production that no one signed off on. Merging configs by hand is error-prone, and there is rarely an audit trail of what changed, when, or why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentsync&lt;/strong&gt; is a Node.js CLI tool that makes agent configuration something you can version, merge, and audit like code. Load JSON, YAML, or INI configs from any repo, three-way merge with conflict detection, run a 52-point compliance rubric on every change, and keep a full merge history you can revert. The point is that "which config is the source of truth for the agent in production?" should always have a clear, auditable answer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54zy6yzchwy8msi8e5uq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54zy6yzchwy8msi8e5uq.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7 Core Commands&lt;/strong&gt; - init, push, pull, diff, audit, status, revert&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Merging&lt;/strong&gt; - three-way merge algorithm with automatic conflict detection, manual resolution support, and conflict tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance Auditing&lt;/strong&gt; - 52-point security and compliance rubric covering security, compliance, structure, performance, and documentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git Integration&lt;/strong&gt; - seamless push and pull with git-based version control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge History&lt;/strong&gt; - full audit trail with revert capability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format Support&lt;/strong&gt; - JSON, YAML, and INI configs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                         agentsync CLI                           │
├──────────────┬──────────────┬────────────┬───────────┬──────────┤
│ init         │ push         │ pull       │ diff      │ audit    │
│ Initialize   │ Push changes │ Merge      │ Compare   │ Validate │
│ repository   │ to remote    │ remote     │ configs   │ configs  │
└──────────────┴──────────────┴────────────┴───────────┴──────────┘
        │           │                 │
        └───────────┴─────────────────┘
                    │
        ┌───────────┴───────────┐
        │                       │
   ┌────▼─────┐        ┌─────────▼──┐
   │   Git    │        │   Config   │
   │ Manager  │        │   Loader   │
   └────┬─────┘        └─────┬──────┘
        │                    │
        │   ┌────────────────┘
        │   │
   ┌────▼───▼─────────┐
   │  Merge Engine    │
   │  - 3-way merge   │
   │  - Conflict Mgmt │
   └────┬─────────────┘
        │
   ┌────▼──────────────────┐
   │  Audit Engine         │
   │  - Security scoring   │
   │  - Compliance audit   │
   │  - 52-point rubric    │
   └───────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The workflow follows a clear sequence. You initialize agentsync in your repository, which sets up local storage at &lt;code&gt;~/.agentsync/&lt;/code&gt; and connects to a central git remote. From there:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Push&lt;/strong&gt; - local config changes are committed and pushed to the remote with a message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pull&lt;/strong&gt; - remote configs are fetched and merged into the local state using the three-way merge algorithm. Changes that only one side made are merged automatically. Conflicts - where both sides changed the same key - are surfaced for resolution. Manual resolution mode (&lt;code&gt;--manual&lt;/code&gt;) enables interactive conflict handling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diff&lt;/strong&gt; - shows configuration differences between any two refs, letting you see what changed between versions before committing to a merge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit&lt;/strong&gt; - runs the 52-point compliance rubric against any config directory. The rubric checks security (hardcoded credentials, encryption, secrets), compliance (audit logs, access control, data retention), structure (proper hierarchy, no duplicates, versioning), performance (object sizes, caching, connection pooling), and documentation (comments, examples, change logs). Every config gets a score from 0 to 100.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revert&lt;/strong&gt; - restores configuration from any point in the merge history. Every merge is stored as a timestamped JSON file in &lt;code&gt;~/.agentsync/history/&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Node.js 16+. Git integration expects a repository with a remote named &lt;code&gt;origin&lt;/code&gt; and a default branch of &lt;code&gt;main&lt;/code&gt; with write access.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Initialize&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync init &lt;span class="nt"&gt;-r&lt;/span&gt; https://github.com/org/configs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Push Changes&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync push &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Update API configs"&lt;/span&gt;
agentsync push &lt;span class="nt"&gt;--directory&lt;/span&gt; ./configs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pull and Merge&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync pull
agentsync pull &lt;span class="nt"&gt;--manual&lt;/span&gt;  &lt;span class="c"&gt;# Interactive conflict resolution&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;View Differences&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync diff &lt;span class="nt"&gt;--from&lt;/span&gt; HEAD~1 &lt;span class="nt"&gt;--to&lt;/span&gt; HEAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run Audit&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync audit &lt;span class="nt"&gt;--directory&lt;/span&gt; ./configs
agentsync audit &lt;span class="nt"&gt;--directory&lt;/span&gt; ./configs &lt;span class="nt"&gt;--report&lt;/span&gt;  &lt;span class="c"&gt;# Generate report&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check Status&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Restore from History&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agentsync revert                    &lt;span class="c"&gt;# List recent merges&lt;/span&gt;
agentsync revert 2026-05-13T12:30   &lt;span class="c"&gt;# Revert to specific merge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results and Output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Status Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Git Status ===
Branch: main
Modified files: 2
Untracked files: 0

=== Agentsync Config ===
Initialized: true
Version: 1.0.0
Repository: https://github.com/dakshjain-1616/agentsync-configs

=== Merge History ===
- 2026-05-13T12:30:45.123Z: Update configurations
- 2026-05-13T12:25:30.456Z: Sync team configs
- 2026-05-13T12:20:15.789Z: Initial merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Audit Report Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== AUDIT RESULTS ===

config.json: 95/100
  - Missing version specification
  - Potential hardcoded credentials detected

api-config.yaml: 88/100
  - Config not properly documented
  - Missing compliance metadata
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Merge Report Example&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Merge Report&lt;/span&gt;

&lt;span class="gs"&gt;**Date:**&lt;/span&gt; 2026-05-13T12:30:45Z
&lt;span class="gs"&gt;**Message:**&lt;/span&gt; Update API configurations

&lt;span class="gu"&gt;## Merged Configurations&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; api-keys.json
&lt;span class="p"&gt;-&lt;/span&gt; database.yaml
&lt;span class="p"&gt;-&lt;/span&gt; cache-config.json (⚠️ CONFLICT)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Compliance Scoring
&lt;/h2&gt;

&lt;p&gt;Configs are scored from 0 to 100:&lt;br&gt;
&lt;code&gt;100&lt;/code&gt; - perfect configuration&lt;br&gt;
&lt;code&gt;75–99&lt;/code&gt; - minor issues&lt;br&gt;
&lt;code&gt;50–74&lt;/code&gt; - moderate concerns&lt;br&gt;
&lt;code&gt;&amp;lt; 50&lt;/code&gt; - serious compliance issues&lt;/p&gt;

&lt;p&gt;Common violations that trigger score deductions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardcoded API keys or passwords&lt;/li&gt;
&lt;li&gt;Missing version specification&lt;/li&gt;
&lt;li&gt;Improper config structure&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Key Capabilities
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3-Way Merge&lt;/strong&gt; - Intelligent conflict detection. Changes on only one side merge automatically.&lt;br&gt;
&lt;strong&gt;52-Point Audit&lt;/strong&gt; - Catches security issues: hardcoded credentials, missing encryption, compliance gaps.&lt;br&gt;
&lt;strong&gt;Format Support&lt;/strong&gt; - Works with JSON, YAML, and INI configs seamlessly.&lt;br&gt;
&lt;strong&gt;Full History&lt;/strong&gt; - Complete audit trail - who changed what and when.&lt;br&gt;
&lt;strong&gt;Revert Support&lt;/strong&gt; - Roll back to any previous state instantly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknyj1xc49l7msr3kkpm7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknyj1xc49l7msr3kkpm7.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  When to Use agentsync
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Perfect for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed AI engineering teams&lt;/li&gt;
&lt;li&gt;Multi-stage deployment pipelines&lt;/li&gt;
&lt;li&gt;Compliance-heavy organizations&lt;/li&gt;
&lt;li&gt;Configuration-driven microservices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not ideal for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-person projects (use git directly)&lt;/li&gt;
&lt;li&gt;Non-text binary configs&lt;/li&gt;
&lt;li&gt;Real-time streaming configs&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Configuration Formats
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;YAML:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;..."&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;INI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[database]&lt;/span&gt;
&lt;span class="py"&gt;host&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="err"&gt;localhost&lt;/span&gt;
&lt;span class="py"&gt;port&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Storage
&lt;/h2&gt;

&lt;p&gt;Local data stored in &lt;code&gt;~/.agentsync/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/.agentsync/
├── config/              # Saved configurations
│   └── agentsync.json
└── history/             # Merge audit trail
    └── {timestamp}.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Config parsing&lt;/code&gt; - O(n) where n = file size&lt;br&gt;
&lt;code&gt;3-way merge&lt;/code&gt; - O(k) where k = number of keys&lt;br&gt;
&lt;code&gt;Audit scoring&lt;/code&gt; - O(m) where m = config size&lt;br&gt;
&lt;code&gt;Typical operation&lt;/code&gt; - under 100ms&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/
├── index.js                    # CLI entry point
├── modules/
│   ├── errors.js              # Custom error classes
│   ├── logger.js              # Logging utility
│   ├── config-loader.js       # Load configs (JSON, YAML, INI)
│   ├── config-writer.js       # Write configs with backup
│   ├── git-manager.js         # Git operations
│   ├── local-storage.js       # ~/.agentsync persistence
│   ├── merge-engine.js        # 3-way merge algorithm
│   ├── merge-history.js       # Merge audit trail
│   ├── audit-engine.js        # Compliance scoring
│   └── report-generator.js    # Report generation
└── commands/
    ├── init.js
    ├── push.js
    ├── pull.js
    ├── diff.js
    ├── audit.js
    ├── status.js
    └── revert.js
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Error Handling
&lt;/h2&gt;

&lt;p&gt;Custom error types handle every failure mode cleanly:&lt;br&gt;
&lt;code&gt;AgentsyncError&lt;/code&gt; - base error class&lt;br&gt;
&lt;code&gt;ConfigError&lt;/code&gt; - config file issues&lt;br&gt;
&lt;code&gt;GitError&lt;/code&gt; - git operation failures&lt;br&gt;
&lt;code&gt;MergeError&lt;/code&gt; - merge conflicts or invalid operations&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Single branch syncing (main only)&lt;/li&gt;
&lt;li&gt;No binary file support (text configs only)&lt;/li&gt;
&lt;li&gt;Conflict resolution is text-based only&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;30 tests covering all core modules:&lt;/p&gt;

&lt;p&gt;Error handling - 4 tests&lt;br&gt;
Logging - 3 tests&lt;br&gt;
Config loading and writing - 12 tests&lt;br&gt;
Merge engine - 6 tests&lt;br&gt;
Audit engine - 5 tests&lt;/p&gt;

&lt;p&gt;All code is test-driven - write test first, implement to pass, refactor for clarity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;All code is test-driven:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write test first&lt;/li&gt;
&lt;li&gt;Implement to pass test&lt;/li&gt;
&lt;li&gt;Refactor for clarity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a CLI tool for synchronizing AI team configurations across repositories - with three-way merge, a 52-point compliance audit, git integration, merge history, and revert capability, all supporting JSON, YAML, and INI formats. NEO built the full implementation: the CLI entry point, all seven command modules, the merge engine with three-way merge and conflict management, the audit engine with the 52-point rubric, the config loader and writer, the git manager via simple-git, the local storage layer at &lt;code&gt;~/.agentsync/&lt;/code&gt;, the merge history tracker, the report generator, and the 30-test test suite covering all core modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to enforce compliance before configs reach production.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;agentsync audit --directory ./configs --report&lt;/code&gt; as part of your deployment pipeline. Any config scoring below your threshold fails the pipeline before it can introduce hardcoded credentials or compliance gaps into production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the merge history as a compliance audit trail.&lt;/strong&gt;&lt;br&gt;
Every merge is stored as a timestamped JSON file in &lt;code&gt;~/.agentsync/history/&lt;/code&gt;. For teams with compliance requirements, this gives you a complete record of what changed, when, and under what commit message - queryable and revertable at any point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use revert to recover from bad merges instantly.&lt;/strong&gt;&lt;br&gt;
When a config change causes unexpected agent behavior, &lt;code&gt;agentsync revert 2026-05-13T12:30&lt;/code&gt; restores the full config state to any point in history. No manual git archaeology needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional compliance checks.&lt;/strong&gt;&lt;br&gt;
The audit engine in &lt;code&gt;audit-engine.js&lt;/code&gt; implements the 52-point rubric. New compliance checks for domain-specific requirements follow the same scoring pattern and surface automatically in audit reports and scores.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent configuration drift is a silent production risk. agentsync makes it manageable by treating configs the way engineers already treat code - versioned, merged with conflict detection, audited for compliance, and fully revertable. The 52-point rubric catches what manual review misses. The merge history means there is always a clear answer to "what is the source of truth?"&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/agentsync" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/agentsync&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>machinelearning</category>
      <category>node</category>
      <category>cli</category>
    </item>
    <item>
      <title>CostGuard: A Real-Time Circuit Breaker That Stops AI Spend Before It Gets Out of Control</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 04 Jun 2026 11:24:25 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/costguard-a-real-time-circuit-breaker-that-stops-ai-spend-before-it-gets-out-of-control-48oe</link>
      <guid>https://dev.to/nilofer_tweets/costguard-a-real-time-circuit-breaker-that-stops-ai-spend-before-it-gets-out-of-control-48oe</guid>
      <description>&lt;p&gt;AI API costs can spiral faster than anyone expects. A runaway loop, a misconfigured batch job, or a forgotten test that fires thousands of requests - by the time you see the bill, the damage is done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CostGuard&lt;/strong&gt; is a production-ready local proxy that enforces hard spending limits before AI API requests are sent. It sits between your applications and AI providers - OpenAI, Anthropic, and OpenRouter calculating the cost of every request before it goes out, and blocking it if any limit would be exceeded. Per-session, per-hour, per-day, and per-project circuit breakers, a real-time terminal dashboard, and multi-channel alerts, all running locally with no data leaving your machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hard Circuit Breakers&lt;/strong&gt; - Per-session, per-hour, per-day, and per-project spending limits.&lt;br&gt;
&lt;strong&gt;Real-Time Cost Estimation&lt;/strong&gt; - Pre-call cost calculation using tiktoken before the request is sent.&lt;br&gt;
&lt;strong&gt;Safe Mode&lt;/strong&gt; - Require explicit confirmation for expensive requests above a configurable threshold.&lt;br&gt;
&lt;strong&gt;Real-Time Dashboard&lt;/strong&gt; - Terminal-based dashboard with WebSocket updates.&lt;br&gt;
&lt;strong&gt;Multi-Channel Alerts&lt;/strong&gt; - Console, webhook, and file-based alerting.&lt;br&gt;
&lt;strong&gt;OpenAI-Compatible API&lt;/strong&gt; - Drop-in replacement for the OpenAI SDK.&lt;br&gt;
&lt;strong&gt;Local SQLite&lt;/strong&gt; - All data stays on your machine.&lt;br&gt;
&lt;strong&gt;Async Architecture&lt;/strong&gt; - High-performance concurrent request handling.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Client SDKs hit the OpenAI-compatible FastAPI proxy. The cost estimator pre-prices the request, then the circuit breaker evaluates limits in order: session, then hour, then day, then project. Allowed traffic forwards to the provider. Tripped limits return a 429 and fire alerts. Spend and pricing data live in local SQLite, and the terminal dashboard streams over WebSocket.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zarqwjdmxigqjoqiqje.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0zarqwjdmxigqjoqiqje.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clone the repository, create and activate a virtual environment, and install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add at least one provider API key and any optional budget overrides - session, hour, day, project, or safe-mode thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running the Server&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;costguard server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with uvicorn directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn costguard.server:create_app &lt;span class="nt"&gt;--factory&lt;/span&gt; &lt;span class="nt"&gt;--reload&lt;/span&gt; &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Using the Proxy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Point your OpenAI SDK at &lt;code&gt;http://localhost:8000/v1&lt;/code&gt;, keep the provider API key in the client, and send the usual chat-completions request with session and project headers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-Time Dashboard&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;costguard dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this in a separate terminal. Set COSTGUARD_SESSION_ID=my-session before launching to scope the dashboard to a specific session.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenAI-Compatible Endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /v1/chat/completions&lt;/code&gt; - chat completions with cost tracking&lt;br&gt;
&lt;code&gt;GET /v1/models&lt;/code&gt; - list available models with pricing&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CostGuard-Specific Endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /v1/estimate&lt;/code&gt; - get cost estimate without making a request&lt;br&gt;
&lt;code&gt;GET /v1/status/{session_id}&lt;/code&gt; - get circuit breaker status&lt;br&gt;
&lt;code&gt;POST /v1/safe-mode/confirm&lt;/code&gt; - confirm a paused safe mode request&lt;br&gt;
&lt;code&gt;GET /health&lt;/code&gt; - health check&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebSocket&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;WS /v1/dashboard/ws&lt;/code&gt; - real-time dashboard updates&lt;/p&gt;
&lt;h2&gt;
  
  
  Circuit Breaker Behavior
&lt;/h2&gt;

&lt;p&gt;Limits are evaluated in this deterministic order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session Limit&lt;/strong&gt; - most restrictive, resets on new session&lt;br&gt;
&lt;strong&gt;Hour Limit&lt;/strong&gt; - rolling 1-hour window&lt;br&gt;
&lt;strong&gt;Day Limit&lt;/strong&gt; - resets at midnight UTC&lt;br&gt;
&lt;strong&gt;Project Limit&lt;/strong&gt; - least restrictive, tracks all-time project spend&lt;/p&gt;

&lt;p&gt;When any limit is exceeded, the request is blocked with a structured error, an alert fires immediately, the circuit breaker status changes to OPEN, and subsequent requests are blocked until the limit resets.&lt;/p&gt;
&lt;h2&gt;
  
  
  Safe Mode
&lt;/h2&gt;

&lt;p&gt;When a request's estimated cost exceeds &lt;code&gt;COSTGUARD_SAFE_MODE_THRESHOLD&lt;/code&gt;, the request is paused and an alert is sent to configured channels. Confirm the request with &lt;code&gt;POST /v1/safe-mode/confirm&lt;/code&gt; - the original request proceeds if confirmed.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx7p97r5rxfbj43el28q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx7p97r5rxfbj43el28q.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Running Tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest                                        &lt;span class="c"&gt;# Full suite&lt;/span&gt;
pytest &lt;span class="nt"&gt;--cov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;costguard &lt;span class="nt"&gt;--cov-report&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;html      &lt;span class="c"&gt;# With coverage&lt;/span&gt;
pytest tests/test_circuit_breaker.py          &lt;span class="c"&gt;# Focused run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Quality&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruff format src tests                                              &lt;span class="c"&gt;# Formatting&lt;/span&gt;
ruff check src tests                                               &lt;span class="c"&gt;# Linting&lt;/span&gt;
mypy src/costguard                                                 &lt;span class="c"&gt;# Type checking&lt;/span&gt;
ruff format &lt;span class="nt"&gt;--check&lt;/span&gt; src tests &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ruff check src tests &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; mypy src/costguard  &lt;span class="c"&gt;# Full gate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a local circuit-breaker proxy for AI spend control - one that estimates request cost before sending it, enforces session, hour, day, and project limits, supports safe mode for expensive requests, and exposes an OpenAI-compatible API so existing SDKs work without changes. NEO built the full implementation: the FastAPI proxy server with OpenAI-compatible endpoints, the tiktoken-based pre-call cost estimator, the circuit breaker with four limit tiers evaluated in deterministic order, the safe mode flow with confirmation endpoint, the multi-channel alert system covering console, webhook, and file, the terminal dashboard streaming over WebSocket, the local SQLite persistence layer, the pricing tables for OpenAI, Anthropic, and OpenRouter, and the full test suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to protect any AI application from runaway costs.&lt;/strong&gt;&lt;br&gt;
Point your OpenAI SDK at &lt;code&gt;http://localhost:8000/v1&lt;/code&gt;. Every request is pre-priced and checked against your configured limits before it leaves your machine. A misconfigured loop or an unexpected spike in usage trips the circuit breaker and fires an alert before the billing damage reaches your provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use safe mode for high-stakes production requests.&lt;/strong&gt;&lt;br&gt;
Set &lt;code&gt;COSTGUARD_SAFE_MODE_THRESHOLD&lt;/code&gt; to the cost above which you want human confirmation. Expensive requests are paused and alerted before proceeding. This is particularly useful for batch jobs or agent workflows where a single request can be unexpectedly large.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the estimate endpoint to build cost-aware UIs.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;POST /v1/estimate&lt;/code&gt; returns the cost of a request without sending it. This lets you show users the expected cost of a query before they submit it or build dashboards that surface real-time spend across sessions and projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional model pricing.&lt;/strong&gt;&lt;br&gt;
The pricing tables cover OpenAI, Anthropic, and OpenRouter. Custom pricing can be added via &lt;code&gt;PricingManager(custom_pricing_file=...)&lt;/code&gt;. Any model not yet in the built-in tables can be priced by adding it to a JSON file - no code changes required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;AI API costs are easy to lose track of and expensive to discover late. CostGuard enforces limits before requests go out, not after the bill arrives. Pre-call cost estimation, four-tier circuit breaking, safe mode for expensive requests, and a real-time dashboard all running locally with no data leaving your machine.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/cost-Guard" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/cost-Guard&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>ArchGuard: Detect Architecture Drift Before It Becomes Technical Debt</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 02 Jun 2026 09:44:09 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/archguard-detect-architecture-drift-before-it-becomes-technical-debt-5b11</link>
      <guid>https://dev.to/nilofer_tweets/archguard-detect-architecture-drift-before-it-becomes-technical-debt-5b11</guid>
      <description>&lt;p&gt;Architecture degrades gradually. A circular dependency here, a god class there, a controller reaching directly into the database layer. Each violation is small on its own. Over time they compound into a codebase that is expensive to change and expensive to understand.&lt;/p&gt;

&lt;p&gt;Most teams discover this in retrospect when a refactor takes three times as long as expected, or when a seemingly isolated change breaks something unrelated. By then the drift is already embedded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArchGuard&lt;/strong&gt; is a production-ready Python static analysis tool that detects architecture degradation patterns in codebases over time. It runs six built-in detectors, compares architecture health between branches, tracks drift over the last 10 commits, and integrates into CI/CD through a GitHub Action or git hooks - all without any AI model dependency, using deterministic local static analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;6 Built-in Detectors&lt;/strong&gt; - circular dependencies, god classes, service layer bypasses, magic values, cyclomatic complexity, and layer violations.&lt;br&gt;
&lt;strong&gt;Per-PR Analysis&lt;/strong&gt; - compare architecture health between branches to catch regressions before they merge.&lt;br&gt;
&lt;strong&gt;Trend Analysis&lt;/strong&gt; - track architecture health over the last 10 commits to see drift over time.&lt;br&gt;
&lt;strong&gt;Multiple Output Formats&lt;/strong&gt; - table, JSON, YAML, Markdown, and HTML.&lt;br&gt;
&lt;strong&gt;CLI and Git Hooks&lt;/strong&gt; - command-line tool with pre-commit and pre-push hooks.&lt;br&gt;
&lt;strong&gt;GitHub Action&lt;/strong&gt; - CI/CD integration for automated architecture checks.&lt;br&gt;
&lt;strong&gt;YAML Configuration&lt;/strong&gt; - flexible, project-specific configuration via &lt;code&gt;.archguard.yml&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The CLI and YAML config feed the core engine - an AST parser, dependency graph, and base analyzer which fans out to six detectors. Findings are graded by severity, rendered as Table, JSON, YAML, Markdown, or HTML, and delivered through the CLI, git hooks, or the GitHub Action.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uj8xg6cbkzervxd4ge9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8uj8xg6cbkzervxd4ge9.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;From PyPI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;archguard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From Source&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/Arch-Guard
&lt;span class="nb"&gt;cd &lt;/span&gt;archguard
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.10+.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;Initialize a configuration file in the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scan the current tree or point it at a specific path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard scan
archguard scan ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For machine-readable results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard scan &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--output&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review architecture drift over the last 10 commits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard trend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Commands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;scan - Analyze Codebase&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard scan &lt;span class="o"&gt;[&lt;/span&gt;PATH] &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key flags: &lt;code&gt;--format&lt;/code&gt;, &lt;code&gt;--output&lt;/code&gt;, &lt;code&gt;--detectors&lt;/code&gt;, &lt;code&gt;--severity&lt;/code&gt;, &lt;code&gt;--fail-on-violations&lt;/code&gt;. Global flags: &lt;code&gt;--config&lt;/code&gt;, &lt;code&gt;--verbose&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;trend - Analyze Trends&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard trend &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags: &lt;code&gt;--commits&lt;/code&gt;, &lt;code&gt;--format&lt;/code&gt;, &lt;code&gt;--output&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;init - Create Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard init &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--path&lt;/code&gt; selects the config file location.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;config - Manage Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;archguard config                          &lt;span class="c"&gt;# Show active configuration&lt;/span&gt;
archguard config output_format            &lt;span class="c"&gt;# Read a value&lt;/span&gt;
archguard config output_format json       &lt;span class="c"&gt;# Update a value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Six Detectors
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Circular Dependency&lt;/strong&gt;&lt;br&gt;
Detects circular import dependencies between modules.&lt;br&gt;
&lt;code&gt;min_cycle_length&lt;/code&gt; - minimum cycle length to report, default 2&lt;br&gt;
&lt;code&gt;max_cycles&lt;/code&gt; - maximum cycles to report, default 100&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;God Class&lt;/strong&gt;&lt;br&gt;
Detects classes with too many methods, attributes, or lines.&lt;br&gt;
&lt;code&gt;max_methods&lt;/code&gt; - maximum methods per class, default 20&lt;br&gt;
&lt;code&gt;max_attributes&lt;/code&gt; - maximum attributes per class, default 15&lt;br&gt;
&lt;code&gt;max_lines&lt;/code&gt; - maximum lines per class, default 500&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service Layer Bypass&lt;/strong&gt;&lt;br&gt;
Detects when controller or presentation layers bypass service layers to access repositories directly.&lt;br&gt;
&lt;code&gt;controller_patterns&lt;/code&gt; - regex patterns for controller files&lt;br&gt;
&lt;code&gt;service_patterns&lt;/code&gt; - regex patterns for service files&lt;br&gt;
&lt;code&gt;repository_patterns&lt;/code&gt; - regex patterns for repository files&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Magic Value&lt;/strong&gt;&lt;br&gt;
Detects hardcoded literals that should be named constants.&lt;br&gt;
&lt;code&gt;min_string_length&lt;/code&gt; - minimum string length to flag, default 3&lt;br&gt;
&lt;code&gt;max_string_length&lt;/code&gt; - maximum string length to check, default 100&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cyclomatic Complexity&lt;/strong&gt;&lt;br&gt;
Detects functions and methods with high cyclomatic complexity.&lt;br&gt;
&lt;code&gt;thresholds&lt;/code&gt; - complexity thresholds for each severity level&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer Violation&lt;/strong&gt;&lt;br&gt;
Detects violations of layered architecture, such as the presentation layer importing from the repository layer.&lt;br&gt;
&lt;code&gt;layers&lt;/code&gt; - layer definitions with patterns and allowed calls&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Create a &lt;code&gt;.archguard.yml&lt;/code&gt; file in your project root. The config supports project metadata, include and exclude patterns, and per-detector options such as cycle length, maximum class size, and complexity thresholds. Output behavior, Git integration, and trend analysis are all controlled through the same file.&lt;/p&gt;
&lt;h2&gt;
  
  
  Git Hooks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python hooks/install.py                        &lt;span class="c"&gt;# Install pre-commit hook&lt;/span&gt;
python hooks/install.py &lt;span class="nt"&gt;--pre-commit&lt;/span&gt; &lt;span class="nt"&gt;--pre-push&lt;/span&gt;  &lt;span class="c"&gt;# Install both hooks&lt;/span&gt;
python hooks/install.py &lt;span class="nt"&gt;--force&lt;/span&gt;                &lt;span class="c"&gt;# Overwrite existing hooks&lt;/span&gt;
python hooks/install.py &lt;span class="nt"&gt;--uninstall&lt;/span&gt;            &lt;span class="c"&gt;# Remove hooks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pre-commit hook&lt;/strong&gt; - runs ArchGuard on staged Python files before committing.&lt;br&gt;
&lt;strong&gt;Pre-push hook&lt;/strong&gt; - runs trend analysis before pushing to remote.&lt;/p&gt;
&lt;h2&gt;
  
  
  GitHub Action
&lt;/h2&gt;

&lt;p&gt;The GitHub Action integrates ArchGuard into CI/CD pipelines. Basic usage runs on push or pull request workflows, checks out the repository with full history, and passes path, format, severity, and fail-on-violations settings as action inputs. Advanced configuration enables trend mode, selects Markdown output, sets the commit window, and uploads the generated report as an artifact.&lt;/p&gt;
&lt;h2&gt;
  
  
  Acknowledgments
&lt;/h2&gt;

&lt;p&gt;Built with Click for CLI, Python's built-in &lt;code&gt;ast&lt;/code&gt; module for AST parsing, NetworkX for dependency graph analysis, Rich for terminal output, and GitPython for Git integration.&lt;/p&gt;
&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/Arch-Guard
&lt;span class="nb"&gt;cd &lt;/span&gt;archguard
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
pre-commit &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running Tests&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest                                                        &lt;span class="c"&gt;# Full suite&lt;/span&gt;
pytest &lt;span class="nt"&gt;--cov&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;src/archguard &lt;span class="nt"&gt;--cov-report&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;html                 &lt;span class="c"&gt;# With coverage&lt;/span&gt;
pytest tests/unit/test_detectors.py                          &lt;span class="c"&gt;# Targeted detector check&lt;/span&gt;
pytest tests/integration/                                    &lt;span class="c"&gt;# Integration coverage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Quality&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ruff check src/ tests/               &lt;span class="c"&gt;# Linting&lt;/span&gt;
ruff check &lt;span class="nt"&gt;--fix&lt;/span&gt; src/ tests/         &lt;span class="c"&gt;# Auto-fix&lt;/span&gt;
pyright src/                         &lt;span class="c"&gt;# Type checking&lt;/span&gt;
ruff check src/ tests/ &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pyright src/  &lt;span class="c"&gt;# Combined gate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;Fork the repository. Create a feature branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/amazing-feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make your changes, run tests with &lt;code&gt;pytest&lt;/code&gt;, run linting with &lt;code&gt;ruff check src/ tests/&lt;/code&gt;, commit, push to the branch, and open a Pull Request.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a production-ready static analysis tool that detects architecture drift in Python codebases over time - with six built-in detectors, trend analysis over git history, multiple output formats, git hook integration, and a GitHub Action for CI/CD. NEO built the full implementation: the core engine with AST parser, dependency graph via NetworkX, and base analyzer; all six detector modules; the formatter layer covering table, JSON, YAML, Markdown, and HTML output; the git integration via GitPython; the CLI built on Click; the YAML configuration layer; the git hook installer and pre-commit and pre-push hooks; the GitHub Action; and the full test suite covering unit and integration tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a quality gate in every pull request.&lt;/strong&gt;&lt;br&gt;
Add the GitHub Action to your workflow with &lt;code&gt;--fail-on-violations&lt;/code&gt; and the severity threshold you care about. Every PR gets checked for new circular dependencies, god classes, layer violations, and complexity regressions before it merges automatically, without any manual review step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use trend analysis to measure the health of an inherited codebase.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;archguard&lt;/code&gt; trend on a codebase you have just taken over. The last 10 commits give you a picture of whether the architecture is improving or degrading, and which detectors are firing most frequently - useful context before making any changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use git hooks to enforce standards locally before code reaches CI.&lt;/strong&gt;&lt;br&gt;
Install the pre-commit hook with &lt;code&gt;python hooks/install.py&lt;/code&gt;. Staged files are checked on every commit. The pre-push hook runs trend analysis before anything reaches the remote. Issues are caught at the developer's machine, not in CI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional detectors.&lt;/strong&gt;&lt;br&gt;
The six detectors share a common base analyzer interface. A new detector for a project-specific architecture rule follows the same pattern - implement the detection logic, add per-detector configuration to &lt;code&gt;.archguard.yml&lt;/code&gt;, and register it. It appears automatically in scan output, trend analysis, and all output formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Architecture drift is invisible until it is expensive. ArchGuard makes it visible at every commit, every PR, and every push - with deterministic static analysis that requires no API keys, no model downloads, and no network calls. Six detectors, trend tracking over git history, and CI/CD integration in one tool.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Arch-Guard" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Arch-Guard&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>Prepush-Guardian: Catch Secrets and Broken Tests Before They Reach Git History</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 01 Jun 2026 12:13:22 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/prepush-guardian-catch-secrets-and-broken-tests-before-they-reach-git-history-fpc</link>
      <guid>https://dev.to/nilofer_tweets/prepush-guardian-catch-secrets-and-broken-tests-before-they-reach-git-history-fpc</guid>
      <description>&lt;p&gt;You are about to push. There is a hardcoded API key buried in one of 30 changed files. Or you forgot to write a test for that new module. Or the test suite is silently failing. You will not know until it is already in git history.&lt;/p&gt;

&lt;p&gt;Prepush-Guardian catches all of this before the push lands. It is a production-grade Git pre-push hook that scans staged files for secrets, auto-generates missing tests, runs your full test suite, and blocks the push if anything fails before it ever reaches the remote.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck3j18nwpzfqr6fvekc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck3j18nwpzfqr6fvekc0.png" alt=" " width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Tool
&lt;/h2&gt;

&lt;p&gt;Manual review - Misses things, does not scale, no enforcement&lt;br&gt;
CI/CD only - Finds it after the push, already in history&lt;br&gt;
prepush-guardian - Blocked at push time, before it ever reaches remote&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scans every staged file for 20+ secret patterns: AWS, GitHub PATs, private keys, database URLs, bearer tokens, and more&lt;/li&gt;
&lt;li&gt;Shannon entropy scanner catches novel secrets not matched by patterns&lt;/li&gt;
&lt;li&gt;Auto-generates missing tests using OpenRouter AI, with a template fallback if no API key is set&lt;/li&gt;
&lt;li&gt;Runs your full test suite and blocks the push if coverage drops below threshold&lt;/li&gt;
&lt;li&gt;Writes a markdown report at &lt;code&gt;.neo/prepush-report.md&lt;/code&gt; for every push&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Clone and install the hook into your repo
git clone https://github.com/neo-ai/prepush-guardian
cd your-target-repo

# Install the pre-push hook
python3 /path/to/prepush-guardian/install.py

# Optional: set API key for AI test generation
cp .env.example .env   # fill in OPENROUTER_API_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The hook runs automatically on every &lt;code&gt;git push&lt;/code&gt;. To run manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 prepush_guardian.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cp .env.example .env
# Required only for AI-based test generation
# Free key at: https://openrouter.ai/keys
OPENROUTER_API_KEY=your_openrouter_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without an API key, the tool falls back to template-based test generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Commands
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tgi2wz9gn2conp5x8g0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0tgi2wz9gn2conp5x8g0.png" alt=" " width="725" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Detection Patterns
&lt;/h2&gt;

&lt;p&gt;The secret scanner covers 20+ patterns across four severity levels:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg6hga45t7gdnq2le8w0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg6hga45t7gdnq2le8w0y.png" alt=" " width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Shannon entropy scanner runs alongside the pattern matcher. It catches novel secrets - API keys or tokens not yet covered by a named pattern by flagging high-entropy strings assigned to variables named KEY, TOKEN, or SECRET.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scoring and Thresholds
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8flyrugjykeqciqwonh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8flyrugjykeqciqwonh.png" alt=" " width="382" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;.neo/config.json&lt;/code&gt; to customize behavior. It is auto-created with defaults if absent:&lt;br&gt;
&lt;code&gt;coverage_warn_threshold&lt;/code&gt; - default 70. Warn if coverage drops below this percentage.&lt;br&gt;
&lt;code&gt;coverage_block_threshold&lt;/code&gt; - default 50. Block push if coverage drops below this percentage.&lt;br&gt;
&lt;code&gt;block_on_low_severity&lt;/code&gt; - default false. Also hard-block on LOW findings.&lt;br&gt;
&lt;code&gt;auto_fix_gitignore&lt;/code&gt; - default true. Add sensitive filenames to &lt;code&gt;.gitignore&lt;/code&gt; automatically.&lt;br&gt;
&lt;code&gt;generate_missing_tests&lt;/code&gt; - default true. Auto-generate tests for untested source files.&lt;br&gt;
&lt;code&gt;skip_test_check_for&lt;/code&gt; - default &lt;code&gt;["migrations/", "scripts/", "docs/"]&lt;/code&gt;. Directories excluded from test generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exit Codes
&lt;/h2&gt;

&lt;p&gt;0 : All checks passed - push proceeding&lt;br&gt;
1 : Push blocked - CRITICAL/HIGH findings or test failures&lt;/p&gt;

&lt;h2&gt;
  
  
  File Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prepush-guardian/
├── prepush_guardian.py      # Main orchestrator
├── leak_detector.py         # Phase 1: secret &amp;amp; entropy detection
├── test_generator.py        # Phase 2: AI test generation
├── test_runner.py           # Phase 2: test execution + coverage
├── reporter.py              # Phase 3: markdown report
├── install.py               # Hook installer
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
└── tests/
    ├── test_leak_detector.py
    └── fixtures/
        ├── sample_with_secrets.py
        └── sample_clean.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three-phase structure maps cleanly to the file names - &lt;code&gt;leak_detector.py&lt;/code&gt; handles Phase 1, &lt;code&gt;test_generator.py&lt;/code&gt; and &lt;code&gt;test_runner.py&lt;/code&gt; handle Phase 2, and &lt;code&gt;reporter.py&lt;/code&gt; handles Phase 3. &lt;code&gt;prepush_guardian.py&lt;/code&gt; orchestrates all three phases in sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a production-grade Git pre-push hook that catches secrets, validates test coverage, and auto-generates missing tests - blocking the push before anything problematic reaches the remote. NEO planned, wrote, tested, and verified every file in this repository without human intervention: the main orchestrator in &lt;code&gt;prepush_guardian.py&lt;/code&gt;, the secret and entropy scanner in &lt;code&gt;leak_detector.py&lt;/code&gt; covering 20+ patterns, the AI test generator in &lt;code&gt;test_generator.py&lt;/code&gt; with OpenRouter integration and template fallback, the test runner and coverage checker in &lt;code&gt;test_runner.py&lt;/code&gt;, the markdown report generator in &lt;code&gt;reporter.py&lt;/code&gt;, the hook installer in &lt;code&gt;install.py&lt;/code&gt;, and the test suite with fixtures.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install it into every repo your team pushes from.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;python3 install.py&lt;/code&gt; once in each repository. From that point, every &lt;code&gt;git push&lt;/code&gt; runs the full three-phase check automatically, no CI changes, no developer workflow changes. Secrets and test failures are blocked before they reach the remote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tune the thresholds to match your team's standards.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;.neo/config.json&lt;/code&gt; file controls coverage warn and block thresholds, whether LOW-severity findings hard-block the push, and which directories are excluded from test generation. These can be committed to the repo so the same standards apply across the whole team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the markdown report as a push audit trail.&lt;/strong&gt;&lt;br&gt;
Every push writes a report to &lt;code&gt;.neo/prepush-report.md&lt;/code&gt;.This gives you a record of what was scanned, what was found, and what was blocked, useful for teams with compliance requirements or for debugging why a push was blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend the detection patterns in &lt;code&gt;leak_detector.py&lt;/code&gt;.&lt;/strong&gt;&lt;br&gt;
The secret scanner covers 20+ named patterns. Adding a new pattern for a domain-specific secret type means adding it to the pattern list in &lt;code&gt;leak_detector.py&lt;/code&gt;. It is immediately active on the next push with no other changes needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;The gap between "I think this is clean" and "I know this is clean" is where prepush-guardian lives. Secrets get committed because no one checked. Tests go missing because there was no enforcement. prepush-guardian closes both gaps at the moment they matter most before the push lands.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/prepush-guardian" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/prepush-guardian&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>git</category>
      <category>opensource</category>
      <category>api</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Fine-Tuning Qwen2.5-0.5B to Write SRE Post-Mortem Summaries</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 30 May 2026 04:43:37 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/fine-tuning-qwen25-05b-to-write-sre-post-mortem-summaries-2jem</link>
      <guid>https://dev.to/nilofer_tweets/fine-tuning-qwen25-05b-to-write-sre-post-mortem-summaries-2jem</guid>
      <description>&lt;p&gt;Writing post-mortem root-cause summaries is time-consuming and inconsistent. Junior SREs miss contributing factors. Senior SREs write summaries that vary in depth and structure. Zero-shot LLMs produce verbose, generic output that does not follow SRE conventions.&lt;br&gt;
Fine-tuning a small model on real incident data produces structured, concise summaries that follow your organisation's format at a fraction of the cost of a large model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo10j1ff2xwcquhknpwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo10j1ff2xwcquhknpwo.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why This Approach
&lt;/h2&gt;

&lt;p&gt;Diffrent type of approaches and what you get: &lt;/p&gt;

&lt;p&gt;Manual SRE writing : Inconsistent, time-consuming, expertise-dependent&lt;br&gt;
Zero-shot large model : Generic format, verbose, high cost per call&lt;br&gt;
Qwen2.5-0.5B fine-tuned : SRE-format outputs, fast, cheap, runs on CPU or consumer GPU&lt;/p&gt;

&lt;p&gt;The key advantages of this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;700-sample training set of real incident timelines mapped to root-cause summaries&lt;/li&gt;
&lt;li&gt;4-bit quantized LoRA training, runs on a single consumer GPU with 8GB VRAM or more&lt;/li&gt;
&lt;li&gt;Evaluated against a structured rubric covering timeline reference, contributing factors, specific component, and prevention action&lt;/li&gt;
&lt;li&gt;Compared against &lt;code&gt;qwen3.6-plus:free&lt;/code&gt; and &lt;code&gt;gpt-5.4-nano&lt;/code&gt; baselines&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The HuggingFace Model
&lt;/h2&gt;

&lt;p&gt;The fine-tuned adapter is published at: &lt;code&gt;daksh-neo/postmortem-qwen2.5-0.5b-lora&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After training, the LoRA weights are saved to &lt;code&gt;models/postmortem-lora/hf_export/&lt;/code&gt; and pushed to HuggingFace.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# fill in OPENROUTER_API_KEY&lt;/span&gt;
&lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; .env | xargs&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Required for baseline evaluation with OpenRouter&lt;/span&gt;
&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_openrouter_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; is required only for running baseline evaluations against zero-shot models via OpenRouter. The fine-tuning and local evaluation steps run without it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline
&lt;/h2&gt;

&lt;p&gt;The full pipeline runs in four steps:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnoprfc6o8jtctn9wsrn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnoprfc6o8jtctn9wsrn.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each step is independent, you can run baseline evaluation before fine-tuning to establish the gap the fine-tuned model closes, and run evaluation again after to measure the improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Configuration
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb66xdhw38x54dtb6pbtu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb66xdhw38x54dtb6pbtu.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation Rubric&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every generated summary is scored against a four-criterion rubric. Each criterion carries equal weight:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F459x0v53vxhgjlelbwvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F459x0v53vxhgjlelbwvj.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pass threshold: 0.60 weighted score or above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expected Results
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;qwen/qwen3.6-plus:free&lt;/code&gt; (zero-shot) - 20–35%&lt;br&gt;
&lt;code&gt;openai/gpt-5.4-nano&lt;/code&gt; (zero-shot) - 35–50%&lt;br&gt;
Qwen2.5-0.5B (fine-tuned, 3 epochs) - &amp;gt; 60%&lt;/p&gt;

&lt;p&gt;The fine-tuned 0.5B model outperforms both zero-shot baselines on rubric compliance because it has been trained specifically on the output format the rubric measures, not on general-purpose tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  File Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml_project_0901/
├── scrape_postmortems.py    # Data collection
├── baseline.py              # Zero-shot baseline via OpenRouter
├── finetune.py              # LoRA fine-tuning
├── eval.py                  # Evaluation + comparison
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
├── data/
│   ├── train.jsonl          # 700 training examples
│   ├── test_100.jsonl       # 100 held-out test examples
│   ├── rubric.json          # Scoring rubric
│   └── baseline_results.jsonl
└── models/
    └── postmortem-lora/
        └── hf_export/       # Push to HuggingFace after training
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a complete fine-tuning pipeline for a small model on SRE post-mortem data, with data scraping, zero-shot baseline comparison, 4-bit LoRA fine-tuning, and structured rubric-based evaluation. NEO planned, wrote, tested, and verified every file in the repository without human intervention: the data scraper producing 700 training examples and 100 held-out test examples, the baseline evaluator running zero-shot prompts against OpenRouter models, the LoRA fine-tuning script with the full model configuration, the rubric-based evaluator producing the comparison table, and the HuggingFace export pipeline pushing the trained adapter to &lt;code&gt;daksh-neo/postmortem-qwen2.5-0.5b-lora&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to replace inconsistent manual post-mortem writing in your team.&lt;/strong&gt;&lt;br&gt;
Train on your own organisation's incident data by replacing &lt;code&gt;data/train.jsonl&lt;/code&gt; with your own incident timeline to root-cause summary pairs. The rubric in &lt;code&gt;data/rubric.json&lt;/code&gt; can be adapted to match your org's specific post-mortem format the evaluation pipeline measures compliance against whatever criteria you define.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the baseline comparison to justify the fine-tuning investment.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;python baseline.py&lt;/code&gt; before fine-tuning to measure what zero-shot models produce on your data. Run &lt;code&gt;python eval.py&lt;/code&gt; after fine-tuning to see the improvement. The comparison table gives you a concrete before-and-after that makes the case for domain-specific fine-tuning over general-purpose models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the published adapter directly without retraining.&lt;/strong&gt;&lt;br&gt;
The fine-tuned LoRA adapter is available at daksh-neo/postmortem-qwen2.5-0.5b-lora on HuggingFace. You can load it directly without running the training pipeline - useful for teams that want to evaluate the output before committing to their own fine-tuning run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it to other structured generation tasks.&lt;/strong&gt;&lt;br&gt;
The four-step pipeline - scrape, baseline, fine-tune, evaluate is domain-agnostic. Any task where structured output format matters more than general knowledge is a candidate: alert triage summaries, change request descriptions, deployment notes. Swap the training data and rubric criteria, and the rest of the pipeline runs unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Zero-shot large models produce verbose, generic post-mortem summaries that do not follow SRE conventions. A fine-tuned 0.5B model trained on 700 domain-specific examples outperforms them on every criterion of the rubric  - timeline reference, contributing factors, specific component identification, and concrete prevention actions, while running on a consumer GPU and costing a fraction per call.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/postmortem-finetune" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/postmortem-finetune&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Morph: AST-Level Refactoring Where the LLM Describes Intent, Not Code</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 23 May 2026 11:04:25 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/morph-ast-level-refactoring-where-the-llm-describes-intent-not-code-1hh6</link>
      <guid>https://dev.to/nilofer_tweets/morph-ast-level-refactoring-where-the-llm-describes-intent-not-code-1hh6</guid>
      <description>&lt;p&gt;When an LLM generates source code for a refactor, the output is a diff a reviewer must read line by line and trust blindly. There is no way to know if the model missed a reference, broke an import, or introduced a subtle logic change without reading every line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Morph&lt;/strong&gt; takes a different approach. Instead of asking the LLM to generate code, it asks the LLM to describe what to change as a structured plan of typed operations - RenameSymbol, MoveFunction, ExtractModule, and more. A reviewer reads ten structured operations in seconds and knows exactly what will change, why, and in what order. The transformation engine then validates the plan against the real codebase dependency graph, applies each operation atomically using tree-sitter AST manipulation, runs the test suite to confirm correctness, and stages clean changes for review. Failed transformations roll back automatically.&lt;/p&gt;

&lt;p&gt;The LLM's job is intent declaration, not code writing. Morph's engine handles everything else.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fandmnivkji8ox3d2p3l4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fandmnivkji8ox3d2p3l4.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Typed Plans Beat Source Code Generation
&lt;/h2&gt;

&lt;p&gt;When a refactoring is expressed as a typed plan, every operation is verifiable before it runs. The plan validator checks file existence, symbol existence, dependency conflicts, and operation conflicts against a real dependency graph. The transformer applies operations in dependency order. The verifier runs pytest after every apply - any failure triggers automatic rollback.&lt;/p&gt;

&lt;p&gt;Source code generation has none of these guarantees. A typed plan does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;p&gt;A natural language goal enters the LLM Planner, which outputs a validated &lt;code&gt;TransformationPlan&lt;/code&gt;. The Plan Validator checks file existence, symbol existence, dependency conflicts, and operation conflicts against a NetworkX dependency graph. The Transformer applies operations in dependency order using tree-sitter AST manipulation, creating a file backup first. The Verifier runs pytest - any failure triggers automatic rollback. Clean changes are handed off to the Staging Manager via GitPython and summarised in a Report.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23xuq96apm3jeuukb56b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F23xuq96apm3jeuukb56b.png" alt=" " width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Operations
&lt;/h2&gt;

&lt;p&gt;Each operation is a typed Pydantic model. The LLM populates the fields — Morph validates and executes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6q4h80qjm5jbn4yb1y8d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6q4h80qjm5jbn4yb1y8d.png" alt=" " width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Dependency Graph Works
&lt;/h2&gt;

&lt;p&gt;Before validating any plan, Morph parses the entire codebase with tree-sitter and builds a NetworkX dependency graph. This graph is used to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detect files that import the symbol being moved or renamed&lt;/li&gt;
&lt;li&gt;Sort operations so dependencies are updated before dependents&lt;/li&gt;
&lt;li&gt;Warn when a move will cascade across downstream files&lt;/li&gt;
&lt;li&gt;Prevent circular dependency introduction from module extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what makes Morph safe to run on real codebases - the plan is validated against the actual dependency structure before a single file is touched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rollback Guarantee
&lt;/h2&gt;

&lt;p&gt;Every non-dry-run apply call snapshots all affected files before touching them. If pytest reports failures after transformation, Morph restores from the snapshot automatically. The workspace is always left in a clean, known-good state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Live Results
&lt;/h2&gt;

&lt;p&gt;A real dry-run against &lt;code&gt;anthropic/claude-haiku-4-5&lt;/code&gt; via OpenRouter - the LLM parsed a natural language rename goal and produced a validated &lt;code&gt;RenameSymbol&lt;/code&gt; plan in under 5 seconds. Full output and reproduction steps are in &lt;code&gt;RESULTS.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzbng722a9nge2uwlrj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flzbng722a9nge2uwlrj7.png" alt=" " width="799" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -e .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For local inference, install Ollama and pull a model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ollama pull gemma4:e4b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For cloud backends, set the relevant environment variable:&lt;br&gt;
&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; - OpenRouter (recommended)&lt;br&gt;
&lt;code&gt;OPENAI_API_KEY&lt;/code&gt; - OpenAI&lt;br&gt;
&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; - Anthropic&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;Describe what you want in plain English. Morph figures out the operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph refactor --goal "rename calculate_total to compute_total" ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Preview the plan without touching any files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph refactor --goal "extract validation logic into validate_input()" ./src --dry-run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate and save the plan for inspection before applying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph plan --goal "add type annotations to all functions in utils.py" ./src --output plan.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply a saved plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph refactor --plan plan.json ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the codebase passes its own test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph verify ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate a Markdown report of the last run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;morph report ./src --format markdown --output REFACTOR_REPORT.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Supported Models
&lt;/h2&gt;

&lt;p&gt;Morph works with any provider. OpenRouter is the recommended starting point - one API key routes to every model below without separate accounts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpo8i7vwd5oric8b85qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpo8i7vwd5oric8b85qr.png" alt=" " width="798" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The planner uses &lt;code&gt;temperature=0.1&lt;/code&gt; - low randomness produces more consistent structured output. Unknown model strings are automatically routed through OpenRouter with no &lt;code&gt;--backend&lt;/code&gt; flag required.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;morph refactor --goal "..." PATH&lt;/code&gt; - Generate plan from goal and apply it&lt;br&gt;
&lt;code&gt;morph refactor --plan FILE PATH&lt;/code&gt; - Apply a previously saved plan&lt;br&gt;
&lt;code&gt;morph refactor ... --dry-run&lt;/code&gt; - Show plan without modifying files&lt;br&gt;
&lt;code&gt;morph plan --goal "..." PATH&lt;/code&gt; - Generate and display plan only&lt;br&gt;
&lt;code&gt;morph verify PATH&lt;/code&gt; - Run the test suite and report pass/fail&lt;br&gt;
&lt;code&gt;morph report PATH&lt;/code&gt; - Generate Markdown/JSON report of last run&lt;/p&gt;

&lt;p&gt;Key flags: &lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--backend&lt;/code&gt;, &lt;code&gt;--dry-run&lt;/code&gt;, &lt;code&gt;--no-rollback&lt;/code&gt;, &lt;code&gt;--output&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;Clone and install in editable mode with dev dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/morph
cd morph
pip install -e ".[dev]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the full test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pytest tests/ -v
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lint and type-check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ruff check morph/ &amp;amp;&amp;amp; mypy morph/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a refactoring CLI where the LLM describes intent as a structured typed plan rather than generating raw code with AST-level execution, dependency graph validation, automatic rollback on test failure, and support for multiple LLM backends. NEO built the full implementation: the LLM Planner producing typed &lt;code&gt;TransformationPlan&lt;/code&gt; outputs with &lt;code&gt;temperature=0.1&lt;/code&gt;, the seven typed Pydantic operation models, the Plan Validator checking file existence, symbol existence, and dependency conflicts against a NetworkX graph, the Transformer applying operations in dependency order via tree-sitter AST manipulation with file backup, the Verifier running pytest with automatic snapshot rollback on failure, the Staging Manager via GitPython, the report generator, and the full CLI with all six commands and their key flags.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to refactor production codebases safely.&lt;/strong&gt;&lt;br&gt;
Instead of asking an LLM to rewrite files, describe the refactoring goal in plain English. Morph validates the plan against the real dependency graph, applies it atomically, and rolls back automatically if tests fail. The dry-run mode lets you inspect exactly what will happen before anything is touched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the saved plan workflow for team review.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;morph plan --goal "..." --output plan.json&lt;/code&gt; to generate the structured plan without applying it. Share the JSON with your team for review before running the apply step. Reviewers see ten typed operations instead of a raw diff - faster to review, easier to reason about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as a refactoring step in CI/CD pipelines.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;morph verify PATH&lt;/code&gt; runs the test suite and reports pass/fail with an exit code, making it composable as a CI step. Combined with &lt;code&gt;morph refactor&lt;/code&gt; and &lt;code&gt;--dry-run&lt;/code&gt;, you can build a pipeline that proposes, reviews, and applies refactors with automated test verification at every stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional operation types.&lt;/strong&gt;&lt;br&gt;
Each operation is a typed Pydantic model in the operations layer. A new operation follows the same pattern: define the Pydantic model, implement the transformer logic, and register it. The LLM Planner, Plan Validator, and CLI all pick it up automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Morph shifts refactoring from code generation to intent declaration. The LLM describes what to change in a structured, validated plan. The engine does the mechanical work. Tests confirm correctness. The result is refactoring that is auditable before it runs, verifiable after it runs, and automatically reversible if it breaks anything.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Morph" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Morph&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>ToolRouter: Switch AI Coding Tools Freely Without Losing Context</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Fri, 22 May 2026 11:13:41 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/toolrouter-switch-ai-coding-tools-freely-without-losing-context-2bo</link>
      <guid>https://dev.to/nilofer_tweets/toolrouter-switch-ai-coding-tools-freely-without-losing-context-2bo</guid>
      <description>&lt;p&gt;Every AI coding tool has its strengths. Claude Code is strong for complex multi-step tasks. Cursor is fast for inline edits. Gemini CLI is useful for quick questions. Most developers use more than one but every time you switch, the context is gone. The new tool has no idea what you just did, what you decided, or which files are in a partial state.&lt;/p&gt;

&lt;p&gt;On top of that, there is no clear picture of what different AI tools actually cost per session, per project, or per week. You are guessing at efficiency rather than measuring it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ToolRouter&lt;/strong&gt; is a local proxy daemon that solves both problems. It maintains shared session state across multiple AI coding tools, generates Handoff Briefs when you switch between them, and tracks real token spend per tool and model all transparently, without changing your API keys or your tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyua07jyge5sqcy6360y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyua07jyge5sqcy6360y.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;ToolRouter sits between your AI tools and their APIs as a local proxy on port 7863. Here is what happens at each stage:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - All traffic routes through the proxy.&lt;/strong&gt;&lt;br&gt;
You point each AI tool's API base URL at &lt;code&gt;localhost:7863&lt;/code&gt;. From that point, every request your tool makes passes through ToolRouter first. The proxy forwards it transparently to the real API, your API keys are unchanged, your tools behave exactly as before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - The proxy captures what matters as a side effect.&lt;/strong&gt;&lt;br&gt;
As AI responses come back through the proxy, ToolRouter reads the token counts and extracts decisions and task state from the response text using pattern matching. Statements like "let's use bcrypt" are classified as decisions. Lines like "implemented JWT validation" are classified as completed tasks. "Still need to finish the refresh logic" becomes an in-progress item. Everything is written to the SQLite state store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 - The file tracker watches the filesystem independently.&lt;/strong&gt;&lt;br&gt;
Alongside the proxy, a Watchdog-based file tracker monitors your project directories. It computes file hashes before and after each session to build an accurate list of what changed. It also scans for syntax errors, merge conflict markers, and unresolved TODOs to detect files that are in a partial state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 - When you switch tools, a Handoff Brief is generated and injected.&lt;/strong&gt;&lt;br&gt;
The Handoff Generator reads from the state store and assembles a brief - partial files first since they carry the highest risk, then in-progress tasks, then decisions and completed items. This brief is automatically injected into the first message of your new session. The receiving tool sees exactly where the last tool left off, before it writes a single line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5 - Spend is tracked on every proxied response.&lt;/strong&gt;&lt;br&gt;
Token counts from every response are accumulated and costed against current model pricing. No separate setup needed, spend tracking is a byproduct of the same proxy pass.&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -e .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - Start the daemon&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;toolrouter start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts the proxy on port 7863 and the dashboard on port 7864.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 - Point your AI tools at the proxy&lt;/strong&gt;&lt;br&gt;
Each tool needs its API base URL pointed at the local proxy. This is a one-time configuration per tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code: export ANTHROPIC_API_URL=http://localhost:7863/v1
Cursor: Set OpenAI API base URL to http://localhost:7863/v1 in Settings → AI
Gemini CLI: export OPENAI_API_BASE=http://localhost:7863/v1
Ollama: export OLLAMA_HOST=http://localhost:7863/api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 - Work normally&lt;/strong&gt;&lt;br&gt;
Switch tools whenever you like. ToolRouter handles handoffs automatically.&lt;/p&gt;
&lt;h2&gt;
  
  
  Handoff Brief
&lt;/h2&gt;

&lt;p&gt;When you switch tools on the same project, ToolRouter injects a brief like this into the first message of your new session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ToolRouter Handoff — from claude-code, 5 minutes ago]

Files changed this session:
✓ src/auth.py — implemented JWT token validation
✓ src/models.py — added User model
⚠ src/api.py — PARTIALLY MODIFIED, do not use as-is

Completed:
✓ Set up authentication middleware
✓ Created database schema

In progress:
→ Implementing refresh token logic
→ Writing API documentation

Decisions made:
- Using bcrypt for password hashing
- JWT tokens expire after 24 hours
- Refresh tokens stored in Redis

⚠ Do not touch:
- src/api.py (has syntax errors)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The brief is generated from the state store - real file changes tracked by the watchdog, decisions extracted from AI responses, and partial-state detection on modified files. The receiving tool sees this at the start of the session and can immediately continue where the last tool left off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Spend Tracking
&lt;/h2&gt;

&lt;p&gt;ToolRouter reads token counts from every proxied response and calculates cost using current model pricing. Spend reports run directly from the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;toolrouter spend           # Today's report
toolrouter spend --week    # This week
toolrouter spend --month   # This month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dashboard at &lt;code&gt;http://localhost:7864&lt;/code&gt; shows daily spend bar charts per tool, session lists with per-session cost, per-tool and per-project breakdowns, which tool is most cost-efficient measured by cost per file changed, and projected monthly costs based on current pace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Pricing (May 2026)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek6ifmbv4fh7j4jlh78l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek6ifmbv4fh7j4jlh78l.png" alt=" " width="515" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Commands
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ksgtubtf0jy30c29svt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ksgtubtf0jy30c29svt.png" alt=" " width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;State Store -&lt;/strong&gt; SQLite with WAL mode for concurrent read/write. Stores sessions, per-session file changes with MD5 hashes, extracted decisions and tasks, and generated handoff briefs. Every table links back to a session ID so the full history is queryable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Tracker -&lt;/strong&gt; Watchdog-based monitoring of project directories. Computes file hashes before and after each session to build an accurate change list. Detects partial states by scanning for syntax errors, merge conflict markers, and unresolved TODOs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision Extractor -&lt;/strong&gt; Pattern matching over AI responses to classify statements into decisions, completed tasks, in-progress work, and blockers. Phrases like "let's use" and "we'll go with" are decisions. Words like "done", "implemented", and "✓" signal completed tasks. "I've started" and "still need to" mark in-progress work. "Blocked by" and "waiting for" identify blockers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handoff Generator -&lt;/strong&gt; Assembles the brief from state store data, ordering by recency and priority: partial files first as they carry the highest risk, then in-progress tasks, then decisions and completed items.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Configuration is stored at &lt;code&gt;~/.toolrouter/config.json&lt;/code&gt;. The key settings are:&lt;br&gt;
&lt;code&gt;injection_enabled&lt;/code&gt; - whether to prepend handoff briefs&lt;br&gt;
&lt;code&gt;proxy_port&lt;/code&gt; - default 7863&lt;br&gt;
&lt;code&gt;dashboard_port&lt;/code&gt; - default 7864&lt;br&gt;
&lt;code&gt;log_level&lt;/code&gt; - logging verbosity&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;toolrouter config set &amp;lt;key&amp;gt; &amp;lt;value&amp;gt;&lt;/code&gt; to change any setting without editing the file directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a local proxy daemon that could sit transparently between AI coding tools and their APIs, maintain shared session state across tool switches, generate structured handoff briefs automatically, and track real token spend per tool and model - all without requiring any changes to the tools themselves or the user's API keys.&lt;/p&gt;

&lt;p&gt;NEO built the full implementation: the proxy daemon running on port 7863, the SQLite state store with WAL mode, the Watchdog-based file tracker with MD5 hashing and partial state detection, the pattern-matching decision extractor, the handoff brief generator with priority ordering, the spend tracker reading token counts from proxied responses, the dashboard on port 7864, and the full CLI with all twelve commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to switch between Claude Code, Cursor, and Gemini CLI on the same project without losing context.&lt;/strong&gt;&lt;br&gt;
Point each tool at the proxy once, and every subsequent tool switch gets an automatic handoff brief. The receiving tool knows which files changed, which tasks are in progress, and which files should not be touched - without you writing a single summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the spend dashboard to measure which AI tool is most cost-efficient for your workflow.&lt;/strong&gt;&lt;br&gt;
The dashboard breaks down cost per tool, per project, and per session. The "cost per file changed" metric tells you which tool delivers the most work per dollar - a data-driven way to decide which tool to reach for on different task types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the handoff brief preview before switching.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;toolrouter handoff&lt;/code&gt; before switching tools to see exactly what brief the next tool will receive. This lets you verify the context is accurate before handing off on a complex task where a wrong assumption by the next tool could cause real damage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional tool integrations.&lt;/strong&gt;&lt;br&gt;
The proxy currently supports Claude Code, Cursor, Gemini CLI, and Ollama via their respective API base URL environment variables. Any tool that accepts an OpenAI-compatible API base URL can be pointed at the proxy using the same pattern - no changes to ToolRouter needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;ToolRouter makes multi-tool AI development practical. Context persists across tool switches through automatically generated handoff briefs. Spend is tracked in real time with model-accurate pricing. The proxy is transparent - your tools and API keys are unchanged.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Tool-Router" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Tool-Router&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Context Time Machine: Forensic Investigation of What Your Agent Actually Saw</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 16 May 2026 11:10:19 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/contexttimemachine-forensic-investigation-of-what-your-agent-actually-saw-joo</link>
      <guid>https://dev.to/nilofer_tweets/contexttimemachine-forensic-investigation-of-what-your-agent-actually-saw-joo</guid>
      <description>&lt;p&gt;Long-running agent sessions fail in a specific way that is hard to debug. The agent runs 40 turns. At turn 38, it gives a wrong answer that ignores something it decided at turn 12. You look at the logs, the turn 12 decision is there. The turn 38 response is there. But you cannot see what the context window looked like at turn 38. Was the turn 12 decision still in context? Was it evicted? Was it there but semantically overwhelmed by 25 other turns?&lt;/p&gt;

&lt;p&gt;This is the forensic problem that ContextTimeMachine solves. It is different from real-time session monitoring, it is for deep post-hoc investigation of what happened during a session, after it has already run. The key insight it is built on: the context window at any given turn is deterministic given the conversation history. You can reconstruct exactly what the model saw at turn 38, render it interactively, and query it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcatqc3xd1wiridxycd3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcatqc3xd1wiridxycd3.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Investigation Modes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mode 1 - Timeline Navigator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The primary view is a vertical timeline of all turns in the session. Each turn shows the turn number, agent name if available, turn type, token count at that turn, and a sparkline showing how the context composition changed.&lt;/p&gt;

&lt;p&gt;Click any turn to travel to it - the context window at that exact point reconstructs and renders in the main panel. You see exactly what the model saw: every message in order, with token counts, with a red line showing where the context would have been truncated if it exceeded the model's limit. Scrub through turns with keyboard arrows. Watch the context window evolve turn by turn. See turns disappear as eviction happens. See tool results arrive and push older content further back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode 2 - Fact Tracker&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You know something specific, a decision made at turn 5, a fact retrieved at turn 15, a user instruction given at turn 3. You want to know: at what turn did this fact leave the context window?&lt;/p&gt;

&lt;p&gt;Enter any text snippet in the Fact Tracker search box. ContextTimeMachine embeds it locally using sentence-transformers, then searches every turn's context snapshot for the nearest matching content. It renders a presence chart, a horizontal bar across all turns colored green when the fact is present or red when absent and shows the exact turn where the fact entered context and the exact turn where it left.&lt;/p&gt;

&lt;p&gt;This answers the most common debugging question for long agent sessions: "When exactly did the agent stop knowing X?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode 3 - Divergence Finder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have two agent sessions that started identically but ended differently. One succeeded, one failed. Load both sessions and ContextTimeMachine finds the earliest turn where their context windows diverged where they started seeing different content and highlights that turn as the likely root cause of the different outcomes.&lt;/p&gt;

&lt;p&gt;It shows a side-by-side comparison of the two context windows at the divergence point with diffed content highlighted. This is the automated version of the manual debugging process every team does when comparing "the run that worked" against "the run that didn't."&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    ContextTimeMachine                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Frontend (React)                                               │
│  ├─ TimelineNavigator    — Turn-by-turn timeline scrubber       │
│  ├─ ContextPanel         — Renders reconstructed context        │
│  ├─ FactTracker          — Fact presence chart                  │
│  └─ DivergenceFinder     — Two-session comparison               │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FastAPI Backend                                                │
│  ├─ /api/session/load          — Load session from file         │
│  ├─ /api/session/{id}/profile  — Get token profile              │
│  ├─ /api/session/{id}/turn/{n} — Reconstruct context at turn    │
│  ├─ /api/session/{id}/fact     — Track fact presence            │
│  ├─ /api/divergence            — Find divergence point          │
│  └─ /api/sessions              — List all sessions              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Core Analysis Modules                                          │
│  ├─ SessionLoader        — Load from multiple formats           │
│  ├─ ContextReconstructor — Reconstruct at any turn              │
│  ├─ FactTracker          — Track presence via embeddings        │
│  ├─ DivergenceFinder     — Find divergence points               │
│  ├─ TokenAnalyzer        — Token budget analysis                │
│  └─ EmbeddingService     — Local embeddings (all-MiniLM)        │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage                                                        │
│  └─ SQLite DB            — Session snapshots &amp;amp; metadata         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;pip&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quick Start&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/dakshjain-1616/context-time-machine.git
&lt;span class="nb"&gt;cd &lt;/span&gt;context-time-machine

&lt;span class="c"&gt;# Create virtual environment&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# On Windows: venv\Scripts\activate&lt;/span&gt;

&lt;span class="c"&gt;# Install package&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Start the server&lt;/span&gt;
timemachine serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8000&lt;/code&gt; in your browser. The server will automatically open your browser if it can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Loading Sessions&lt;/strong&gt;&lt;br&gt;
Sessions can be loaded from two formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From LiveContext SQLite Export:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From Generic JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generic JSON format expects a &lt;code&gt;turns&lt;/code&gt; array where each turn contains a &lt;code&gt;messages&lt;/code&gt; list, a &lt;code&gt;model_id&lt;/code&gt;, and a &lt;code&gt;timestamp&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"turn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are helpful."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-09T10:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CLI Commands&lt;/strong&gt;&lt;br&gt;
The CLI covers the full workflow from loading sessions to querying them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the web interface&lt;/span&gt;
timemachine serve

&lt;span class="c"&gt;# Load a session&lt;/span&gt;
timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.json

&lt;span class="c"&gt;# Track fact across session&lt;/span&gt;
timemachine fact &lt;span class="nt"&gt;--session&lt;/span&gt; &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--fact&lt;/span&gt; &lt;span class="s2"&gt;"the user prefers JSON output"&lt;/span&gt;

&lt;span class="c"&gt;# Find divergence between two sessions&lt;/span&gt;
timemachine diverge &lt;span class="nt"&gt;--session-a&lt;/span&gt; &amp;lt;id-a&amp;gt; &lt;span class="nt"&gt;--session-b&lt;/span&gt; &amp;lt;id-b&amp;gt;

&lt;span class="c"&gt;# List all stored sessions&lt;/span&gt;
timemachine sessions

&lt;span class="c"&gt;# Clear all sessions&lt;/span&gt;
timemachine clear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every capability the CLI and web interface expose is also available as a Python library. This makes it straightforward to integrate ContextTimeMachine into evaluation pipelines or automated debugging scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;context_time_machine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SessionLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ContextReconstructor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FactTracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DivergenceFinder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TokenAnalyzer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load session
&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SessionLoader&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Reconstruct context at turn 10
&lt;/span&gt;&lt;span class="n"&gt;reconstructor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContextReconstructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reconstructor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reconstruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context at turn 10: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Messages: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Utilization: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utilization_percent&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Track a fact
&lt;/span&gt;&lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FactTracker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific decision from turn 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact first appeared: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_appeared_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact last present: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_present_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disappeared at: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disappeared_at_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Analyze token budget
&lt;/span&gt;&lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenAnalyzer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Peak tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; at turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Eviction turns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eviction_turns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Find divergence between sessions
&lt;/span&gt;&lt;span class="n"&gt;session_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_b.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;finder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DivergenceFinder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;finder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Divergence at turn: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;divergence_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Supported Session Formats
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynbb2elgx53qdn920c8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynbb2elgx53qdn920c8.png" alt=" " width="468" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Context Reconstruction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each turn N, ContextTimeMachine loads all messages from turns 0 to N and counts the total tokens using tiktoken. If the total exceeds the model's context limit, it simulates eviction using a model-specific strategy: GPT and Claude use left-truncation (oldest messages first), DeepSeek uses a sliding window with a recency bias, and Gemma uses local-global attention sampling from the middle. System messages are never evicted regardless of which strategy applies. The result is a reconstructed context with a full token breakdown exactly what the model would have seen at that turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fact Tracking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each turn, ContextTimeMachine embeds the fact text using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;. It then computes cosine similarity between that embedding and every message in the turn's reconstructed context. A fact is considered present if any message has a similarity above 0.75. Embeddings are cached for performance so repeated queries against the same session do not recompute embeddings. The output is a presence chart showing the fact's full lifecycle across the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Divergence Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For two sessions, ContextTimeMachine aligns turns and analyzes up to the minimum length of the two sessions. At each turn it reconstructs the context for both sessions, embeds all messages, and computes an average maximum cosine similarity between the two context windows. When this similarity drops below 0.85, the turn is flagged as the divergence point. The output includes a message diff at the divergence point and a summary of what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Session Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/session/load&lt;/code&gt; - load session from file or JSON&lt;br&gt;
&lt;code&gt;GET /api/sessions&lt;/code&gt; - list all stored sessions&lt;br&gt;
&lt;code&gt;DELETE /api/session/{id}&lt;/code&gt; - delete a session&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /api/session/{id}/profile&lt;/code&gt; - get token profile for session&lt;br&gt;
&lt;code&gt;GET /api/session/{id}/turn/{num}&lt;/code&gt; - reconstruct context at turn&lt;br&gt;
&lt;code&gt;POST /api/session/{id}/fact&lt;/code&gt; - track fact presence&lt;br&gt;
&lt;code&gt;POST /api/divergence&lt;/code&gt; - find divergence between sessions&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Context Reconstruction: &amp;lt; 100ms for typical sessions&lt;/li&gt;
&lt;li&gt;Fact Tracking: ~1-5 seconds for full session (includes embedding)&lt;/li&gt;
&lt;li&gt;Divergence Detection: ~2-10 seconds for 2 sessions&lt;/li&gt;
&lt;li&gt;Memory: ~50-200MB per stored session (depending on size)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Dependencies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Core&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;fastapi&lt;/strong&gt; - Web framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;uvicorn&lt;/strong&gt; - ASGI server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pydantic&lt;/strong&gt; - Data validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;click&lt;/strong&gt; - CLI framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tiktoken&lt;/strong&gt; - Token counting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sentence-transformers&lt;/strong&gt; - Local embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;numpy&lt;/strong&gt; - Numerical operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlalchemy&lt;/strong&gt; - Database ORM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;aiofiles&lt;/strong&gt; - Async file operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;br&gt;
React, Tailwind CSS, Framer Motion, Recharts&lt;/p&gt;

&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Frontend is a React stub - core analysis is fully functional&lt;/li&gt;
&lt;li&gt;LangSmith format not yet implemented&lt;/li&gt;
&lt;li&gt;No streaming support for very large sessions (&amp;gt;10k turns)&lt;/li&gt;
&lt;li&gt;Embedding cache cleared on restart&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Future Enhancements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Complete React frontend with real-time updates&lt;/li&gt;
&lt;li&gt;WebSocket streaming for large sessions&lt;/li&gt;
&lt;li&gt;LangSmith format support&lt;/li&gt;
&lt;li&gt;Multi-session comparison UI&lt;/li&gt;
&lt;li&gt;Export to markdown/HTML&lt;/li&gt;
&lt;li&gt;Attention visualization&lt;/li&gt;
&lt;li&gt;Custom eviction strategy support&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a forensic debugging tool for long-running agent sessions, one that could reconstruct the exact context window at any historical turn, track when specific facts entered and left context using semantic embeddings, and find the earliest point where two divergent sessions started seeing different content. The tool needed to support multiple session formats, expose a Python API alongside the web interface, and work entirely offline with local embeddings.&lt;/p&gt;

&lt;p&gt;NEO handled all 12 specification steps autonomously, building the &lt;code&gt;SessionLoader&lt;/code&gt; with support for LiveContext SQLite, generic JSON, and raw conversation formats, the &lt;code&gt;ContextReconstructor&lt;/code&gt; with model-specific eviction strategies for GPT, Claude, DeepSeek, and Gemma, the &lt;code&gt;FactTracker&lt;/code&gt; with &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; embeddings and cosine similarity scoring, the &lt;code&gt;DivergenceFinder&lt;/code&gt; with turn-aligned context comparison, the &lt;code&gt;TokenAnalyzer&lt;/code&gt; for peak token and eviction turn detection, the FastAPI backend with all six API endpoints, the SQLite storage layer via SQLAlchemy, the Click CLI with all six commands, and the full 58-test suite covering all core modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to find the root cause of long-session failures.&lt;/strong&gt;&lt;br&gt;
When an agent gives a wrong answer deep into a long session, load the session into ContextTimeMachine, travel to the failure turn in the Timeline Navigator, and see exactly what was in context at that point. The reconstructed view shows every message the model saw, in order, with token counts, so you can see immediately whether the relevant context was present or had been evicted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Fact Tracker to measure context retention across your agent design.&lt;/strong&gt;&lt;br&gt;
Before settling on a context management strategy for your agent, run Fact Tracker against a set of real sessions. The presence chart for key decisions and instructions tells you at what turn they reliably drop out of context giving you a data-driven basis for choosing context window sizes, eviction strategies, or compression approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Divergence Finder to debug non-deterministic agent behaviour.&lt;/strong&gt;&lt;br&gt;
When two runs of the same agent with the same input produce different outcomes, load both into Divergence Finder. The tool identifies the exact turn where their context windows started differing and shows a diff of what changed, turning a difficult debugging problem into a specific, actionable finding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional session format parsers.&lt;/strong&gt;&lt;br&gt;
SessionLoader already handles three formats following a common interface. Adding a new format - LangSmith is listed as planned, means implementing the same loader interface for the new format. It is then immediately available in the CLI, the Python API, and the web interface without touching any of the analysis modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;ContextTimeMachine makes the context window visible. Instead of inferring what the model saw from its outputs, you can reconstruct and inspect the exact context at any turn, track when specific information entered and left the window, and find where two sessions diverged. For teams debugging long-running agents, that visibility is the difference between guessing and knowing.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/ContextTimeMachine" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/ContextTimeMachine&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
