<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community:  HJS Foundation</title>
    <description>The latest articles on DEV Community by  HJS Foundation (@hjs-foundation).</description>
    <link>https://dev.to/hjs-foundation</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3799777%2F7741370c-7b73-4cda-ac68-88f6912fd9ce.png</url>
      <title>DEV Community:  HJS Foundation</title>
      <link>https://dev.to/hjs-foundation</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hjs-foundation"/>
    <language>en</language>
    <item>
      <title>My AI Agent Could See 167 Tools. Then I Told It to shutup.</title>
      <dc:creator> HJS Foundation</dc:creator>
      <pubDate>Mon, 13 Apr 2026 11:24:18 +0000</pubDate>
      <link>https://dev.to/hjs-foundation/my-ai-agent-could-see-167-tools-then-i-told-it-to-shutup-1kp9</link>
      <guid>https://dev.to/hjs-foundation/my-ai-agent-could-see-167-tools-then-i-told-it-to-shutup-1kp9</guid>
      <description>&lt;p&gt;&lt;strong&gt;Token usage dropped. Accuracy improved. And I built a 200-line Python proxy to prove it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) was supposed to be the universal remote for AI agents. Connect once, and your agent can interact with GitHub, Jira, Slack, filesystems, databases—you name it.&lt;/p&gt;

&lt;p&gt;But here's what nobody tells you: &lt;strong&gt;connect four MCP servers, and your agent burns 60,000 tokens before you even say "hello."&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Redis ran the numbers. A typical setup with Redis, GitHub, Jira, and Grafana—four servers, 167 tools—consumes &lt;strong&gt;~60,000 tokens upfront&lt;/strong&gt; just loading tool descriptions. In production, it's often &lt;strong&gt;150,000+ tokens&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Atlassian found their own MCP server alone consumes &lt;strong&gt;~10,000 tokens&lt;/strong&gt; for Jira and Confluence. GitHub's official server exposes &lt;strong&gt;94 tools&lt;/strong&gt; and chews through &lt;strong&gt;~17,600 tokens&lt;/strong&gt; per request. Combine several, and you hit &lt;strong&gt;30,000+ tokens&lt;/strong&gt; of pure metadata—before your agent solves anything. &lt;/p&gt;

&lt;p&gt;Every extra tool is a chance to pick the wrong one. Redis measured &lt;strong&gt;42% tool selection accuracy&lt;/strong&gt; without filtering. The model gets lost in the noise, grabs the wrong tool, overwrites data, or sends requests into the void. &lt;/p&gt;

&lt;p&gt;We gave agents unlimited power. And they became &lt;strong&gt;slower, dumber, and more expensive&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solutions (and Why They're Not Enough)
&lt;/h2&gt;

&lt;p&gt;The industry noticed. Multiple solutions emerged:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Core Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Regex-based filtering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;mcpwrapped&lt;/code&gt;, &lt;code&gt;Tool Filter MCP&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;You must manually configure which tools to hide. 167 tools? Good luck.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema compression&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Atlassian &lt;code&gt;mcp-compressor&lt;/code&gt; (97% reduction)&lt;/td&gt;
&lt;td&gt;Strips descriptions to save tokens, but accuracy drops—models can't tell &lt;code&gt;create_jira_issue&lt;/code&gt; from &lt;code&gt;create_confluence_page&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool Search (Anthropic)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Code built-in&lt;/td&gt;
&lt;td&gt;85% token reduction, but only &lt;strong&gt;34% selection accuracy&lt;/strong&gt; in independent testing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector search (Redis)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Redis Tool Filtering&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;98% token reduction, 8x faster, 2x accuracy&lt;/strong&gt;—but requires Redis infrastructure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hybrid search (Stacklok)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MCP Optimizer&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;94% accuracy&lt;/strong&gt; on 2,792 tools, but closed-source commercial product.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All of them fall into one of two traps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Manual configuration&lt;/strong&gt;: You have to know in advance which tools to hide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy infrastructure&lt;/strong&gt;: You need Redis, a cloud service, or a commercial license.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What I wanted was simple: &lt;strong&gt;zero-config, 100% local, and smart enough to figure out what tools I actually need.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So I built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Introducing &lt;code&gt;shutup-mcp&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;shutup&lt;/code&gt; is an MCP proxy that shows your agent only the tools it actually needs—zero config, 100% local, no API keys.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;shutup &lt;span class="nt"&gt;--config&lt;/span&gt; ~/claude_desktop_config.json &lt;span class="nt"&gt;--intent&lt;/span&gt; &lt;span class="s2"&gt;"read and write files"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Behind the scenes, &lt;code&gt;shutup&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reads your MCP config&lt;/strong&gt; and discovers all connected servers—filesystem, GitHub, Jira, whatever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fetches all tool definitions&lt;/strong&gt; and builds a local embedding index using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; (~80MB, runs entirely offline).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watches for changes&lt;/strong&gt;—add a new MCP server, &lt;code&gt;shutup&lt;/code&gt; rebuilds the index automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filters tools by intent&lt;/strong&gt;—when your agent requests tools, &lt;code&gt;shutup&lt;/code&gt; intercepts and returns only the top-K most relevant ones.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your agent never knows the other 79,997 tools exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Approach Wins
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Zero Config, Actually&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;No regex. No YAML. No manual whitelists. You already have a &lt;code&gt;claude_desktop_config.json&lt;/code&gt;. &lt;code&gt;shutup&lt;/code&gt; reads it directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filesystem"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@modelcontextprotocol/server-filesystem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/tmp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"github"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@modelcontextprotocol/server-github"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fetch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@modelcontextprotocol/server-fetch"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;shutup&lt;/code&gt; connects to all three, aggregates their tools, and filters them intelligently. No extra configuration files needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Intent-Based Filtering&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Most proxies hide tools based on names or regex patterns. &lt;code&gt;shutup&lt;/code&gt; hides tools based on &lt;strong&gt;what you're actually trying to do&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Say "read and write files"—&lt;code&gt;shutup&lt;/code&gt; returns filesystem tools, hiding GitHub and fetch tools.&lt;/p&gt;

&lt;p&gt;Say "create a GitHub issue"—&lt;code&gt;shutup&lt;/code&gt; surfaces GitHub tools while hiding filesystem operations.&lt;/p&gt;

&lt;p&gt;It treats tool selection as a &lt;strong&gt;retrieval problem&lt;/strong&gt;, not a reasoning one—the same insight that drove Redis to 98% token reduction. &lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Multi-Server Aggregation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is where &lt;code&gt;shutup&lt;/code&gt; differs from most open-source alternatives. It doesn't just filter one MCP server—it &lt;strong&gt;aggregates all of them&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When Stacklok analyzed 2,792 tools, they found &lt;strong&gt;94% selection accuracy&lt;/strong&gt; using hybrid search. But their Optimizer is a commercial product. &lt;code&gt;shutup&lt;/code&gt; brings the same pattern—semantic retrieval across multiple servers—to an &lt;strong&gt;open-source, zero-dependency&lt;/strong&gt; tool. &lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Privacy-First, 100% Local&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Two embedding backends:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sentence-transformers&lt;/code&gt;&lt;/strong&gt; (default): Downloads &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; once (~80MB), runs entirely offline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ollama&lt;/code&gt;&lt;/strong&gt;: Use &lt;code&gt;nomic-embed-text&lt;/code&gt; or any Ollama embedding model. Completely air-gapped.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No API keys. No telemetry. No cloud dependencies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Context (Why This Matters)
&lt;/h2&gt;

&lt;p&gt;Let's put numbers to the problem.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Tools Loaded&lt;/th&gt;
&lt;th&gt;Token Overhead (Est.)&lt;/th&gt;
&lt;th&gt;Selection Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single MCP server (GitHub)&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;~17,600&lt;/td&gt;
&lt;td&gt;79-88% (Opus 4.5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Four servers (Redis+GitHub+Jira+Grafana)&lt;/td&gt;
&lt;td&gt;167&lt;/td&gt;
&lt;td&gt;~60,000&lt;/td&gt;
&lt;td&gt;~42% (without filtering)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise setup (10+ servers)&lt;/td&gt;
&lt;td&gt;500+&lt;/td&gt;
&lt;td&gt;150,000+&lt;/td&gt;
&lt;td&gt;&amp;lt; 30%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sources: Atlassian, Redis, Stacklok, Anthropic &lt;/p&gt;

&lt;p&gt;Now look at what filtering achieves:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;th&gt;Token Reduction&lt;/th&gt;
&lt;th&gt;Selection Accuracy&lt;/th&gt;
&lt;th&gt;Infrastructure Required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Tool Search&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;34% (2,792 tools)&lt;/td&gt;
&lt;td&gt;Built into Claude&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Atlassian mcp-compressor&lt;/td&gt;
&lt;td&gt;70-97%&lt;/td&gt;
&lt;td&gt;Drops at high compression&lt;/td&gt;
&lt;td&gt;Proxy only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis Tool Filtering&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;td&gt;85%&lt;/td&gt;
&lt;td&gt;Redis + vector DB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stacklok MCP Optimizer&lt;/td&gt;
&lt;td&gt;60-85%&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;Commercial platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;shutup-mcp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~98% (projected)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TBD (benchmarking)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Zero&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;shutup&lt;/code&gt; uses the same architectural pattern as Redis (vector embeddings + semantic search) but without the Redis dependency. It's the "Redis approach" in a &lt;strong&gt;single pip install&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works (Under the Hood)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent (Claude Code / Cursor / Windsurf)
    ↓
shutup-mcp (stdio proxy)
    ↓
┌─────────────────────────┐
│ ServerManager           │
│ - Parses mcp.json       │
│ - Manages connections   │
│ - Watches for changes   │
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│ ToolEmbedder            │
│ - Builds local index    │
│ - Cosine similarity     │
│ - Returns top-K tools   │
└─────────────────────────┘
    ↓
Upstream MCP Servers (filesystem, github, fetch, …)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Core Loop
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Startup&lt;/strong&gt;: Parse &lt;code&gt;claude_desktop_config.json&lt;/code&gt;, connect to each MCP server, fetch tool definitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embed&lt;/strong&gt;: For each tool, create text &lt;code&gt;"{name}: {description}"&lt;/code&gt; and embed using chosen backend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request&lt;/strong&gt;: User provides intent (e.g., &lt;code&gt;--intent "read and write files"&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter&lt;/strong&gt;: Compute cosine similarity, return top-K tools (default K=5).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy&lt;/strong&gt;: Forward &lt;code&gt;tools/list&lt;/code&gt; and &lt;code&gt;tools/call&lt;/code&gt; requests transparently.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;shutup &lt;span class="nt"&gt;--config&lt;/span&gt; ~/Library/Application&lt;span class="se"&gt;\ &lt;/span&gt;Support/Claude/claude_desktop_config.json &lt;span class="se"&gt;\&lt;/span&gt;
         &lt;span class="nt"&gt;--intent&lt;/span&gt; &lt;span class="s2"&gt;"create a GitHub issue about the API outage"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
         &lt;span class="nt"&gt;--top-k&lt;/span&gt; 3

&lt;span class="o"&gt;[&lt;/span&gt;shutup] Loading config: claude_desktop_config.json
&lt;span class="o"&gt;[&lt;/span&gt;shutup] Connected to 3 MCP servers &lt;span class="o"&gt;(&lt;/span&gt;filesystem, github, fetch&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;shutup] Fetched 47 total tools
&lt;span class="o"&gt;[&lt;/span&gt;shutup] Intent: &lt;span class="s2"&gt;"create a GitHub issue about the API outage"&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;shutup] Returning 3/47 tools:
  - github__create_issue
  - github__list_issues
  - github__get_repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent only sees 3 tools. Token overhead drops from ~25,000 to ~300.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;shutup-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Run
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Default: sentence-transformers (auto-downloads model)&lt;/span&gt;
shutup &lt;span class="nt"&gt;--config&lt;/span&gt; ~/Library/Application&lt;span class="se"&gt;\ &lt;/span&gt;Support/Claude/claude_desktop_config.json &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--intent&lt;/span&gt; &lt;span class="s2"&gt;"your task description"&lt;/span&gt;

&lt;span class="c"&gt;# Privacy mode: use Ollama&lt;/span&gt;
shutup &lt;span class="nt"&gt;--config&lt;/span&gt; ~/Library/Application&lt;span class="se"&gt;\ &lt;/span&gt;Support/Claude/claude_desktop_config.json &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--intent&lt;/span&gt; &lt;span class="s2"&gt;"read and write files"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--embedder&lt;/span&gt; ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Integrate with Claude Code
&lt;/h3&gt;

&lt;p&gt;In your &lt;code&gt;claude_desktop_config.json&lt;/code&gt;, replace direct MCP server entries with &lt;code&gt;shutup&lt;/code&gt; as a proxy, or run &lt;code&gt;shutup&lt;/code&gt; as a standalone gateway. Full integration docs are on the GitHub repo.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;v0.1.0&lt;/strong&gt;—a minimal, functional proxy that proves the pattern works. I'm actively working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Benchmarking&lt;/strong&gt;: Head-to-head comparison with Anthropic Tool Search, mcp-compressor, and Stacklok Optimizer (public dataset, reproducible).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid search&lt;/strong&gt;: BM25 + embeddings for better exact-match performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust rewrite&lt;/strong&gt;: Move embedding and similarity computation to Rust for sub-millisecond latency at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool usage analytics&lt;/strong&gt;: Show which tools your agent &lt;em&gt;actually&lt;/em&gt; uses vs. what gets filtered out.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I was tired of watching my agent burn tokens on tools it would never use. Tired of "pick the wrong tool" errors. Tired of configuring regex filters every time I added a new MCP server.&lt;/p&gt;

&lt;p&gt;The Redis team proved the pattern: &lt;strong&gt;treat tool selection as retrieval&lt;/strong&gt;. 98% token reduction. 8x faster. Double the accuracy.&lt;/p&gt;

&lt;p&gt;But their solution required Redis. Stacklok's required a commercial platform. Anthropic's couldn't reliably find the right tools.&lt;/p&gt;

&lt;p&gt;I wanted something that worked &lt;strong&gt;out of the box, completely local, with zero configuration&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I built it. In 200 lines of Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/hjs-spec/shutup-mcp" rel="noopener noreferrer"&gt;github.com/hjs-spec/shutup-mcp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyPI&lt;/strong&gt;: &lt;code&gt;pip install shutup-mcp&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Star the repo if this solves a problem for you. PRs welcome—especially if you want to help with benchmarking or the Rust rewrite.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Your agent doesn't need 167 tools. It needs 3. Tell it to &lt;code&gt;shutup&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>mcp</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>I Logged Every Decision My AI Agent Made for a Week. Here's What I Learned.</title>
      <dc:creator> HJS Foundation</dc:creator>
      <pubDate>Sat, 11 Apr 2026 02:51:26 +0000</pubDate>
      <link>https://dev.to/hjs-foundation/i-logged-every-decision-my-ai-agent-made-for-a-week-heres-what-i-learned-2cp5</link>
      <guid>https://dev.to/hjs-foundation/i-logged-every-decision-my-ai-agent-made-for-a-week-heres-what-i-learned-2cp5</guid>
      <description>&lt;p&gt;&lt;em&gt;10,847 decision events. 3 surprising insights. And one $23 wake-up call that changed how I think about agent observability.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The $23 Mystery
&lt;/h2&gt;

&lt;p&gt;I run a multi-agent system that does market research. Three agents, one goal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scout&lt;/strong&gt;: Gathers data from APIs and web sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyst&lt;/strong&gt;: Processes raw data into insights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writer&lt;/strong&gt;: Produces the final report&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It worked fine. Until it didn't.&lt;/p&gt;

&lt;p&gt;One Monday morning, I found a report that was &lt;strong&gt;48 hours late&lt;/strong&gt; and cost &lt;strong&gt;$23 in API credits&lt;/strong&gt;. Normal runs take 2 hours and cost around $4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I checked everything. API rate limits? No. Model downtime? No. LangSmith traces showed the chain completed successfully. Each agent reported "task done." Every log line was green.&lt;/p&gt;

&lt;p&gt;But somewhere between "task done" and "report ready," &lt;strong&gt;46 hours vanished&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's when I realized: I had no idea what my agents were actually &lt;em&gt;deciding&lt;/em&gt; to do. I only knew what they &lt;em&gt;did&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;So I ran an experiment.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Experiment: 50 Lines of Code, One Week, Every Decision
&lt;/h2&gt;

&lt;p&gt;I added a lightweight decision logger to my agent orchestrator. Not tracing API calls—we already have that. I wanted to log &lt;strong&gt;decisions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;J&lt;/code&gt; (Judge): An agent initiates a new task or makes a determination&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;D&lt;/code&gt; (Delegate): An agent hands off work to another agent&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T&lt;/code&gt; (Terminate): An agent ends a task, successfully or not&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;V&lt;/code&gt; (Verify): An agent validates someone else's output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the core code (simplified—full version on GitHub):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid4&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hash_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create a content-addressable hash for the decision payload.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_decision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;who&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;what&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;verb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;who&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;who&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;hash_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;what&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;when&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nonce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ref&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-pipeline-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;# Async write—doesn't block the agent
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;write_to_ndjson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I deployed it on a Tuesday. One week later, I had &lt;strong&gt;10,847 decision events&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  Discovery #1: 35% of Delegations Were Circular
&lt;/h2&gt;

&lt;p&gt;My agents delegate work to each other constantly. Scout hands raw data to Analyst. Analyst hands insights to Writer. Writer asks Scout for clarification. Normal.&lt;/p&gt;

&lt;p&gt;But when I graphed the &lt;code&gt;D&lt;/code&gt; (Delegate) events by &lt;code&gt;ref&lt;/code&gt; chain, I saw something unexpected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scout → Analyst → Scout → Analyst → Scout (terminates)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;1,203 times&lt;/strong&gt; in one week, agents created delegation loops of length ≥ 2. Each loop burned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~2 seconds of compute time&lt;/li&gt;
&lt;li&gt;One extra LLM call for the handoff reasoning&lt;/li&gt;
&lt;li&gt;Token costs for the delegation message itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total waste: &lt;strong&gt;~40 minutes of compute and $3.20 in API costs&lt;/strong&gt;. Not catastrophic. But completely invisible until I logged the &lt;code&gt;D&lt;/code&gt; events with their &lt;code&gt;ref&lt;/code&gt; chains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: I added a simple rule—if an agent receives a delegation from someone it already delegated to in the current chain, break and escalate. Loops dropped to near zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  Discovery #2: Failed Tool Calls Retried 7 Times Before Giving Up
&lt;/h2&gt;

&lt;p&gt;One of Scout's jobs is scraping competitor pricing from public websites. Occasionally, a site times out. Normal.&lt;/p&gt;

&lt;p&gt;What wasn't normal: &lt;strong&gt;the retry behavior&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a tool call failed, my agent retried—on average—&lt;strong&gt;7 times&lt;/strong&gt; before terminating. The worst offender was that scraping tool. One timeout at 11:23 PM turned into:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;11:23 PM - Tool call fails (timeout)
11:24 PM - Retry 1 fails
11:26 PM - Retry 2 fails
11:29 PM - Retry 3 fails
...
03:17 AM - Retry 11 fails, agent finally terminates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four hours. Eleven retries. Each one a fresh API call with a new browser instance. Cost of that single failure chain: &lt;strong&gt;$1.87&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Across the week, excessive retries wasted &lt;strong&gt;~$9.40&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: I capped retries at 3 for non-critical tools. If it still fails, the agent logs a &lt;code&gt;T&lt;/code&gt; with &lt;code&gt;reason: "tool_unavailable"&lt;/code&gt; and moves on with partial data. The report might be slightly less complete, but it arrives on time and under budget.&lt;/p&gt;




&lt;h2&gt;
  
  
  Discovery #3: The 3 AM Termination Storm
&lt;/h2&gt;

&lt;p&gt;At 3:14 AM on Wednesday, I saw something strange in the logs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;47 &lt;code&gt;T&lt;/code&gt; (Terminate) events within 90 seconds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Normal rate is ~10 per hour.&lt;/p&gt;

&lt;p&gt;Every single one had &lt;code&gt;reason: "empty_response"&lt;/code&gt;. Turns out, a data provider's API had a brief outage, returning &lt;code&gt;200 OK&lt;/code&gt; with an empty body. Every parallel agent hit it simultaneously, received nothing, and terminated immediately.&lt;/p&gt;

&lt;p&gt;No alert fired. From the orchestrator's perspective, all tasks "completed successfully"—they just completed with &lt;strong&gt;zero data&lt;/strong&gt;. The final report that morning was 40% shorter than usual, and I had no idea why until I dug through the decision logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: I added a simple monitor—if &lt;code&gt;T&lt;/code&gt; events with &lt;code&gt;reason: "empty_response"&lt;/code&gt; exceed 5 per minute, pause the pipeline and alert. The next time that API flaked, I knew within 60 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Discovery #4: Verification Was Silently Slowing Down
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;V&lt;/code&gt; (Verify) event happens when one agent checks another's output. Analyst produces insights; Writer verifies they're coherent before including them.&lt;/p&gt;

&lt;p&gt;I noticed something in the timestamps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Day&lt;/th&gt;
&lt;th&gt;Avg Time Between &lt;code&gt;J&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tuesday&lt;/td&gt;
&lt;td&gt;1.2 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wednesday&lt;/td&gt;
&lt;td&gt;1.8 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thursday&lt;/td&gt;
&lt;td&gt;2.9 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Friday&lt;/td&gt;
&lt;td&gt;4.1 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monday&lt;/td&gt;
&lt;td&gt;4.7 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The verification service was &lt;strong&gt;drifting&lt;/strong&gt;. Not enough to break anything yet, but a clear trend. Turned out the vector database used for fact-checking had accumulated 6 months of stale embeddings and queries were getting slower.&lt;/p&gt;

&lt;p&gt;Without decision-level timestamps, I would have found out only after it started timing out and breaking the pipeline. Instead, I scheduled a re-indexing job over the weekend. Tuesday morning: back to 1.3 seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Changed (And What You Can Steal)
&lt;/h2&gt;

&lt;p&gt;I didn't build a complex observability platform. I added &lt;strong&gt;three rules&lt;/strong&gt; to my orchestrator based on what the decision logs revealed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Circular delegation guard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;D&lt;/code&gt; chain contains duplicate &lt;code&gt;who&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Break loop, escalate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retry cap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool call fails &amp;gt; 3 times&lt;/td&gt;
&lt;td&gt;Log &lt;code&gt;T&lt;/code&gt;, continue with partial data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Termination storm alert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;gt; 5 &lt;code&gt;T&lt;/code&gt; with same reason in 1 min&lt;/td&gt;
&lt;td&gt;Pause pipeline, notify&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The next week's run: &lt;strong&gt;1.8 hours. $3.70.&lt;/strong&gt; And I caught an API outage before it silently corrupted a report.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters (And Why Most Agent Logs Are Useless)
&lt;/h2&gt;

&lt;p&gt;Here's what I learned: &lt;strong&gt;there's a difference between logging actions and logging decisions.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action Log&lt;/th&gt;
&lt;th&gt;Decision Log&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Called API X"&lt;/td&gt;
&lt;td&gt;"Delegated to Analyst because confidence &amp;lt; 0.7"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Task completed"&lt;/td&gt;
&lt;td&gt;"Terminated with partial data due to tool timeout"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Received response"&lt;/td&gt;
&lt;td&gt;"Verified output—hash matches, coherence score 0.82"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Action logs tell you &lt;em&gt;what&lt;/em&gt; happened. Decision logs tell you &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Without the "why," debugging multi-agent systems is just guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Get Started (Without Adopting a New Protocol)
&lt;/h2&gt;

&lt;p&gt;You don't need to rebuild your entire stack. Start simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Structured handoff logs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add one JSON line every time an agent hands off work to another. Include &lt;code&gt;from&lt;/code&gt;, &lt;code&gt;to&lt;/code&gt;, &lt;code&gt;reason&lt;/code&gt;, and a &lt;code&gt;hash&lt;/code&gt; of the payload. That alone will catch delegation loops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Add decision verbs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tag each log with what kind of decision it represents: &lt;code&gt;initiated&lt;/code&gt;, &lt;code&gt;delegated&lt;/code&gt;, &lt;code&gt;terminated&lt;/code&gt;, &lt;code&gt;verified&lt;/code&gt;. This makes it searchable and graphable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Chain them together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use a &lt;code&gt;ref&lt;/code&gt; field to link events. Now you have a trace of the entire decision chain, not just isolated events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 4: Add signatures (if you need non-repudiation)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're building something where audit trails matter—compliance, finance, multi-party systems—you'll want cryptographic signatures. The format I used above is compatible with &lt;a href="https://github.com/hjs-spec/jep" rel="noopener noreferrer"&gt;JEP&lt;/a&gt; (Judgment Event Protocol), which adds signing and anti-replay protection out of the box. But you can get 80% of the value with plain JSON and a &lt;code&gt;ref&lt;/code&gt; field.&lt;/p&gt;




&lt;h2&gt;
  
  
  I Open-Sourced the Logger
&lt;/h2&gt;

&lt;p&gt;The logger I used for this experiment is now open-source. It's 200 lines of Python, works with any agent framework, and writes to NDJSON so you can &lt;code&gt;cat&lt;/code&gt; and &lt;code&gt;grep&lt;/code&gt; like it's 1999.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/hjs-spec/agent-decision-logger" rel="noopener noreferrer"&gt;GitHub: agent-decision-logger&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The core logger with all four decision verbs&lt;/li&gt;
&lt;li&gt;A Mermaid visualization script (see your agents' decision chains as flowcharts)&lt;/li&gt;
&lt;li&gt;Analysis tools to detect delegation loops and termination storms&lt;/li&gt;
&lt;li&gt;Complete examples and tests&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Your Turn
&lt;/h2&gt;

&lt;p&gt;My agent wasn't broken. It was just making expensive decisions I couldn't see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the weirdest thing you've found in your agent's logs?&lt;/strong&gt; Or are you flying blind?&lt;/p&gt;

&lt;p&gt;Drop a comment—I'd love to hear what you're seeing (or not seeing) in your own systems.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Built a "Blame Finder" for AI Agents – So You Never Have to Guess Who Broke Production</title>
      <dc:creator> HJS Foundation</dc:creator>
      <pubDate>Tue, 07 Apr 2026 14:42:03 +0000</pubDate>
      <link>https://dev.to/hjs-foundation/i-built-a-blame-finder-for-ai-agents-so-you-never-have-to-guess-who-broke-production-252g</link>
      <guid>https://dev.to/hjs-foundation/i-built-a-blame-finder-for-ai-agents-so-you-never-have-to-guess-who-broke-production-252g</guid>
      <description>&lt;h2&gt;
  
  
  The 3 AM Slack Message We All Fear
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Hey, the multi-agent pipeline just deleted the staging database. Any idea which agent did it?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your PM Agent says it passed a clean requirement.&lt;br&gt;&lt;br&gt;
Your Coder Agent says it followed the spec perfectly.&lt;br&gt;&lt;br&gt;
Your Verifier Agent says it never even got the output.  &lt;/p&gt;

&lt;p&gt;You spend the next 4 hours grepping through thousands of lines of logs. You find nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the Accountability Vacuum.&lt;/strong&gt; And it's a nightmare.&lt;/p&gt;

&lt;p&gt;So I built a cure: &lt;strong&gt;&lt;a href="https://github.com/hjs-spec/agent-blame-finder" rel="noopener noreferrer"&gt;Agent Blame-Finder&lt;/a&gt;&lt;/strong&gt; – an open‑source cryptographic black box for multi‑agent systems.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Does It Do?
&lt;/h2&gt;

&lt;p&gt;In 3 seconds, it tells you &lt;strong&gt;exactly which agent messed up&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;blame-finder blame incident-abc123

🎯 Verdict: Coder-Agent
💡 Reason: Input requirement was correct, but output didn&lt;span class="s1"&gt;'t match expectations
🔗 Chain:
   ✅ PM-Agent – success
   ❌ Coder-Agent – failed
   ⏳ Verifier-Agent – not reached
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No more finger‑pointing. No more log spelunking. Just a verifiable, signed receipt of every decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works (The 10‑Second Technical)
&lt;/h2&gt;

&lt;p&gt;Under the hood, it implements two IETF Internet‑Drafts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JEP (Judgment Event Protocol)&lt;/strong&gt; – a minimal, cryptographically signed log format for agent decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JAC (Judgment Accountability Chain)&lt;/strong&gt; – a &lt;code&gt;task_based_on&lt;/code&gt; field that links every decision to its parent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each time an agent does something, a JEP receipt is created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"verb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"J"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"who"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Coder-Agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"when"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1742345678&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"what"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sha256:..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_based_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"parent-task-hash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sig"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ed25519 signature"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four verbs – &lt;strong&gt;J&lt;/strong&gt; (Judge), &lt;strong&gt;D&lt;/strong&gt; (Delegate), &lt;strong&gt;T&lt;/strong&gt; (Terminate), &lt;strong&gt;V&lt;/strong&gt; (Verify) – are all you need to model any accountability flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integration: One Decorator
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;blame_finder&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BlameFinder&lt;/span&gt;

&lt;span class="n"&gt;finder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BlameFinder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./blackbox_logs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@finder.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Coder-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;requirement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Your existing logic – no changes needed
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;print(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hello world&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Later, when something breaks:
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. The decorator handles hashing, signing, storage, and chain linking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Without Blame‑Finder&lt;/th&gt;
&lt;th&gt;With Blame‑Finder&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hours of log hunting&lt;/td&gt;
&lt;td&gt;&lt;code&gt;blame-finder blame &amp;lt;id&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Maybe Agent X?" finger‑pointing&lt;/td&gt;
&lt;td&gt;Cryptographic proof&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No audit trail&lt;/td&gt;
&lt;td&gt;JEP receipts (immutable, signed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broken causality&lt;/td&gt;
&lt;td&gt;Full &lt;code&gt;task_based_on&lt;/code&gt; tree&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It’s like &lt;code&gt;git blame&lt;/code&gt; but for AI agents.&lt;br&gt;&lt;br&gt;
And because it’s based on IETF drafts, it’s not another walled garden – it’s infrastructure.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Road Ahead
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ Rust core engine (fast)&lt;/li&gt;
&lt;li&gt;✅ Python &amp;amp; TypeScript SDKs&lt;/li&gt;
&lt;li&gt;🚧 LangChain / CrewAI native adapters&lt;/li&gt;
&lt;li&gt;🚧 Visual dashboard (&lt;code&gt;blame-finder dashboard&lt;/code&gt; – already works!)&lt;/li&gt;
&lt;li&gt;🚧 One‑click PDF/HTML blame reports&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Try It Right Now
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agent-blame-finder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then launch the dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;blame-finder dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’ll see a causality tree visualizer that looks like a Git graph – but for agent decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Contribute
&lt;/h2&gt;

&lt;p&gt;MIT licensed. We need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrations with popular agent frameworks&lt;/li&gt;
&lt;li&gt;More tests&lt;/li&gt;
&lt;li&gt;Documentation improvements&lt;/li&gt;
&lt;li&gt;Your crazy ideas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/hjs-spec/Agent-Blackbox" rel="noopener noreferrer"&gt;https://github.com/hjs-spec/Agent-Blackbox&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Stop the guessing game. Start the Blame‑Finder.&lt;/em&gt; 🔍&lt;/p&gt;




&lt;p&gt;&lt;em&gt;P.S. The name is intentionally provocative. Your PM will hate it. Your CTO will love it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
      <category>agents</category>
    </item>
    <item>
      <title>Stop Debugging Black Boxes: How jac-agent Solves the 3 Hardest Pain Points in Training Production-Grade AI Agents</title>
      <dc:creator> HJS Foundation</dc:creator>
      <pubDate>Wed, 01 Apr 2026 02:23:59 +0000</pubDate>
      <link>https://dev.to/hjs-foundation/stop-debugging-black-boxes-how-jac-agent-solves-the-3-hardest-pain-points-in-training-1j34</link>
      <guid>https://dev.to/hjs-foundation/stop-debugging-black-boxes-how-jac-agent-solves-the-3-hardest-pain-points-in-training-1j34</guid>
      <description>&lt;h2&gt;
  
  
  Subtitle
&lt;/h2&gt;

&lt;p&gt;If your training data is messy, your logs are useless, and you can’t prove why your agent failed — this is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Training AI agents isn’t just about prompt engineering anymore.&lt;br&gt;&lt;br&gt;
If you’re building anything that touches production, you’re already hitting these walls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You can’t trace failures.&lt;/strong&gt; A bad decision derailed your pipeline — but you have no idea which step caused it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your training data is garbage.&lt;/strong&gt; You’re scraping unstructured logs to build SFT/RL datasets, wasting hours cleaning noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can’t deploy safely.&lt;/strong&gt; Regulators and auditors want proof your agent isn’t making harmful choices, but you have no way to show it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today, I’m releasing &lt;code&gt;jac-agent&lt;/code&gt;: an open-source SDK built on IETF standards, designed to solve exactly these problems — while adding zero overhead to your training loop.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/hjs-spec/jac-agent" rel="noopener noreferrer"&gt;github.com/hjs-spec/jac-agent&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The 3 Pain Points of Training Production Agents
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. The "Black Box" Debugging Nightmare
&lt;/h3&gt;

&lt;p&gt;You run an agent for 100 steps. It makes 99 good decisions, then one catastrophic call.&lt;br&gt;&lt;br&gt;
Your logs look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO: Processing user request
INFO: Calling tool
INFO: Tool response received
ERROR: Pipeline failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You have no causal link between the steps. No way to know &lt;em&gt;why&lt;/em&gt; it failed, only that it did.&lt;/p&gt;

&lt;p&gt;This isn’t just annoying — it makes training slow, risky, and impossible to validate.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Training Data That Costs You Hours to Clean
&lt;/h3&gt;

&lt;p&gt;To fine-tune your agent, you need structured trajectories.&lt;br&gt;&lt;br&gt;
But raw logs are unstructured, inconsistent, and often missing context.&lt;br&gt;&lt;br&gt;
You end up writing brittle scripts to parse free-text outputs, only to find half the data is corrupted or incomplete.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. The "How Do We Prove It’s Safe?" Compliance Gap
&lt;/h3&gt;

&lt;p&gt;Regulators and enterprise clients are already asking:&lt;br&gt;&lt;br&gt;
&lt;em&gt;"Can you show us exactly why your agent made that decision?"&lt;/em&gt;&lt;br&gt;&lt;br&gt;
If you can’t, you can’t deploy.&lt;/p&gt;


&lt;h2&gt;
  
  
  How &lt;code&gt;jac-agent&lt;/code&gt; Fixes All 3 Problems
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;jac-agent&lt;/code&gt; isn’t just another logging library. It’s built on three open standards (JEP/HJS/JAC) to turn your agent’s decisions into &lt;strong&gt;provable, structured, and training-ready data&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Trace Failures to the Exact Step (No More Black Boxes)
&lt;/h3&gt;

&lt;p&gt;Every decision your agent makes is recorded in an immutable, cryptographically verified chain.&lt;br&gt;&lt;br&gt;
You get a clear causal path from the root task to the final action — no guesswork, no missing links.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jac_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;show_trace_chain&lt;/span&gt;

&lt;span class="c1"&gt;# Record decisions in your agent loop
&lt;/span&gt;&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Route selection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judgment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Choose Route A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low congestion, high safety&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cost check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judgment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approve Route A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Under budget, valid tolls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Print the full causal trace
&lt;/span&gt;&lt;span class="nf"&gt;show_trace_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You see exactly which decision caused a failure, in seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Turn Logs Into Training Data — Automatically
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;jac-agent&lt;/code&gt;’s &lt;code&gt;task_based_on&lt;/code&gt; field structures every decision into causal chains.&lt;br&gt;&lt;br&gt;
When you’re ready to train, one call exports a clean, ready-to-use dataset for SFT/RL/DPO.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jac_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;enable_training_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;export_training_dataset&lt;/span&gt;

&lt;span class="c1"&gt;# Enable zero-overhead mode for training
&lt;/span&gt;&lt;span class="nf"&gt;enable_training_mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run your training loop as usual — logging happens in memory
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judgment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Action &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent observation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Export structured causal trajectories
&lt;/span&gt;&lt;span class="nf"&gt;export_training_dataset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No parsing. No cleaning. Just high-quality training data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Build a Provable, Auditable Safety Layer
&lt;/h3&gt;

&lt;p&gt;Every record is cryptographically signed, timestamped, and linked to the previous step.&lt;br&gt;&lt;br&gt;
You can export a formal audit report at any time to prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your agent followed its rules.&lt;/li&gt;
&lt;li&gt;Decisions were made in order.&lt;/li&gt;
&lt;li&gt;No logs were altered after the fact.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jac_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;export_audit_report&lt;/span&gt;
&lt;span class="nf"&gt;export_audit_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_audit_2026-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get compliance-ready evidence without changing your agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Under the Hood: Built on Open Standards
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;jac-agent&lt;/code&gt; isn’t proprietary. It’s the first reference implementation of three IETF specifications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JEP&lt;/strong&gt;: Standard event format for agent decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HJS&lt;/strong&gt;: Immutable accountability layer with privacy controls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JAC&lt;/strong&gt;: Causal chain linking via &lt;code&gt;task_based_on&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No vendor lock-in.&lt;/li&gt;
&lt;li&gt;Interoperable with any agent framework.&lt;/li&gt;
&lt;li&gt;Built to evolve with open standards, not closed tools.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It in 2 Minutes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;jac-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;jac_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;show_trace_chain&lt;/span&gt;

&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judgment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approve action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Policy check passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;show_trace_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’re already recording verified decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Training production-grade agents requires more than good prompts. It requires visibility, safety, and proof.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;jac-agent&lt;/code&gt;, you don’t have to choose between training speed and auditability — you get both.&lt;/p&gt;

&lt;p&gt;I’d love your feedback. Star the repo, open an issue, or drop a comment below.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/hjs-spec/jac-agent" rel="noopener noreferrer"&gt;github.com/hjs-spec/jac-agent&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>opensource</category>
      <category>agents</category>
    </item>
    <item>
      <title>Full-Link Accountability for AI Agents</title>
      <dc:creator> HJS Foundation</dc:creator>
      <pubDate>Fri, 27 Mar 2026 04:11:57 +0000</pubDate>
      <link>https://dev.to/hjs-foundation/full-link-accountability-for-ai-agents-41h6</link>
      <guid>https://dev.to/hjs-foundation/full-link-accountability-for-ai-agents-41h6</guid>
      <description>&lt;h2&gt;
  
  
  Core Event Primitives
&lt;/h2&gt;

&lt;p&gt;Four standard event types (J, D, V, T) cover the full accountability lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;J: Judge – Create and initiate a judgment/decision&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;D: Delegate – Transfer authority or assign a task&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;V: Verify – Review and validate a record&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;T: Terminate – End a judgment or task lifecycle&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Pain Points &amp;amp; Technical Solutions
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Pain Point 1: Broken chain in multi-agent workflows, unable to trace root cause
&lt;/h2&gt;

&lt;p&gt;Trigger Primitives: J + D&lt;/p&gt;

&lt;p&gt;Solution: Add a task_based_on field to every record to enforce a hash reference to the parent task. A null value indicates the start of a chain; a populated value links to a preceding action, ensuring full end-to-end traceability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pain Point 2: Unclear accountability, hard to assign fault for errors
&lt;/h2&gt;

&lt;p&gt;Trigger Primitives: D + V&lt;/p&gt;

&lt;p&gt;Solution: Permanently include a who field in every record, bound to the actor’s DID or public key hash. Combined with cryptographic signing, records become non-repudiable and tamper-proof, enabling precise accountability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pain Point 3: Lack of compliant audit evidence for regulatory requirements
&lt;/h2&gt;

&lt;p&gt;Trigger Primitives: V + T&lt;/p&gt;

&lt;p&gt;Solution: Equip every record with a timestamp, unique nonce, and signature verification. Full audit trails with replay protection are natively supported, directly satisfying compliance requirements under the EU AI Act and Singapore IMDA frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Data Structure (Ready for Use)
&lt;/h2&gt;

&lt;p&gt;{&lt;br&gt;
  "jep": "1",&lt;br&gt;
  "verb": "J",&lt;br&gt;
  "who": "did:example:agent-789",&lt;br&gt;
  "when": 1742345678,&lt;br&gt;
  "what": "122059e8878aa9a38f4d123456789abcdef01234",&lt;br&gt;
  "nonce": "f47ac10b-58cc-4372-a567-0e02b2c3d479",&lt;br&gt;
  "aud": "&lt;a href="https://platform.example.com" rel="noopener noreferrer"&gt;https://platform.example.com&lt;/a&gt;",&lt;br&gt;
  "task_based_on": "hash-of-parent-task",&lt;br&gt;
  "ref": "",&lt;br&gt;
  "sig": "eyJhbGciOiJFZERTQSJ9..."&lt;br&gt;
}&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Field Definitions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;verb: Required; one of [J, D, V, T]&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;who: Required; unique identifier of the actor&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;when: Required; Unix timestamp to prevent stale/tampered records&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;nonce: Required; UUIDv4 to prevent replay attacks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;task_based_on: Traceability field; hash of the parent task&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;ref: For verification events only; references the ID of the event being checked&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;sig: Required; JWS digital signature&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Verification Logic (Pseudocode)
&lt;/h2&gt;

&lt;p&gt;def verify_record(record):&lt;br&gt;
    # 1. Verify signature integrity&lt;br&gt;
    if not verify_jws_signature(record):&lt;br&gt;
        return "INVALID"&lt;br&gt;
    # 2. Ensure nonce uniqueness to prevent replay&lt;br&gt;
    if not is_valid_nonce(record["nonce"]):&lt;br&gt;
        return "INVALID"&lt;br&gt;
    # 3. Validate timestamp within acceptable window&lt;br&gt;
    if not is_within_time_window(record["when"]):&lt;br&gt;
        return "INVALID"&lt;br&gt;
    # 4. Verify parent task chain integrity&lt;br&gt;
    if record.get("task_based_on") and not task_exist(record["task_based_on"]):&lt;br&gt;
        return "INVALID"&lt;br&gt;
    # 5. Verify events must include a reference&lt;br&gt;
    if record["verb"] == "V" and not record.get("ref"):&lt;br&gt;
        return "INVALID"&lt;br&gt;
    return "VALID"&lt;/p&gt;

&lt;h2&gt;
  
  
  Critical Security Rules
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;All records must be signed; any tampering invalidates the signature&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Nonces must be globally unique; duplicate requests are rejected&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Timestamp tolerance: ±5 minutes to account for clock skew&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Verify (V) events must include a ref field to avoid circular validation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ed25519 recommended; support for SM2, ECDSA P-256, and post-quantum algorithms&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Optional Extensions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Task State: Add status field (pending, executing, completed, terminated)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Assignment Log: Record DID of assigner and assignee&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Result Validation: Include confidence score and human review flag&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fault Handling: Log missing parent tasks and failure reasons for chain breaks&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation based on these IETF Drafts:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;draft-wang-jep-judgment-event-protocol-01&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;draft-wang-jac-00&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Beyond App-Level Harness: A Technical Analysis of Native Underlying AI Constraints</title>
      <dc:creator> HJS Foundation</dc:creator>
      <pubDate>Mon, 23 Mar 2026 09:48:13 +0000</pubDate>
      <link>https://dev.to/hjs-foundation/beyond-app-level-harness-a-technical-analysis-of-native-underlying-ai-constraints-2d54</link>
      <guid>https://dev.to/hjs-foundation/beyond-app-level-harness-a-technical-analysis-of-native-underlying-ai-constraints-2d54</guid>
      <description>&lt;p&gt;As AI engineering evolves, a key technical distinction in Harness design has become increasingly clear. Current Harness implementations focus on app-level, post-execution adjustments, while a more foundational approach—built into the protocol layer—offers distinct advantages in AI control and reliability.&lt;/p&gt;

&lt;p&gt;This analysis focuses on the technical differences between these two approaches, using protocol-level designs for AI boundary and accountability as a framework for comparison.&lt;/p&gt;

&lt;p&gt;Technical Characteristics of App-Level Harness Implementations&lt;/p&gt;

&lt;p&gt;Existing Harness solutions deliver practical value through a set of operational adjustments, all implemented as layers built on top of pre-existing models. Core technical components include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Context engineering to curate and deliver relevant information to AI agents during execution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CI/CD linting and structured testing to identify and correct errors after execution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Behavioral guideline documents to establish operational parameters for agents&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tool curation to limit agent capabilities to predefined scopes&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These components effectively translate raw model capability into usable output—with documented improvements in performance metrics when optimized Harnesses are applied. However, their technical limitation lies in being soft constraints: they operate as external guidance rather than inherent controls, creating potential for agent drift or boundary bypass under complex operational conditions.&lt;/p&gt;

&lt;p&gt;Technical Advantages of Protocol-Level Harness Design&lt;/p&gt;

&lt;p&gt;A protocol-level approach differs fundamentally by embedding control mechanisms into the core operational layer, rather than adding them as external wrappers. This design prioritizes inherent constraints and accountability, with three key technical differentiators:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Native Isolation vs. Post-Execution Constraints
&lt;/h2&gt;

&lt;p&gt;Protocol-level designs establish hard, inherently enforced boundaries between distinct entities from the outset. Instead of relying on external prompts or linting to guide behavior, they define separate execution domains, identity isolation, and permission boundaries that are technically impossible to bypass at the protocol layer. This shifts control from reactive adjustment to proactive prevention, eliminating the technical possibility of boundary breach.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Accountability as a Core Technical Primitive
&lt;/h2&gt;

&lt;p&gt;Unlike app-level Harnesses that focus on error correction after occurrence, protocol-level designs embed accountability into the foundational architecture. This includes a technical framework for tracking agent actions, linking them to verifiable identities, and enabling full traceability—all integrated natively into the protocol. This moves beyond feedback loops to create a persistent, auditable system for AI behavior accountability.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Open Source Interoperability as a Technical Priority
&lt;/h2&gt;

&lt;p&gt;Protocol-level Harness designs prioritize open source principles to enable interoperability across diverse model architectures and toolchains. By avoiding proprietary lock-in, they create a universal foundation that can be adopted, extended, and integrated into varied AI workflows. This technical design choice addresses a critical challenge as AI scales: preventing fragmentation across different Harness implementations.&lt;/p&gt;

&lt;p&gt;The Technical Case for Depth in Harness Design&lt;/p&gt;

&lt;p&gt;A simple technical analogy illustrates the core difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;App-level Harnesses operate like external safety features—effective for standard conditions but vulnerable to bypass under complex scenarios.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Protocol-level Harnesses function as inherent structural controls—integrated into the operational foundation to eliminate the technical possibility of drift or bypass.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The growing recognition of Harness importance in AI engineering is driving a shift toward deeper, more integrated control mechanisms. Soft, external constraints are sufficient for small-scale, well-defined use cases, but as AI systems become more autonomous and complex, a protocol-level approach becomes technically necessary. It ensures that control, isolation, and accountability scale proportionally with AI capability, rather than relying on external adjustments that may fail under stress.&lt;/p&gt;

&lt;p&gt;The value of protocol-level Harness design lies in its ability to create a foundational layer for reliable, controllable AI at scale. By embedding control mechanisms into the protocol itself, it addresses the technical limitations of app-level implementations, offering a more robust solution for increasingly complex AI systems.&lt;/p&gt;

&lt;p&gt;Further technical discussion and collaboration around protocol-level Harness design are encouraged to advance the reliability and controllability of AI systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>security</category>
      <category>blockchain</category>
    </item>
  </channel>
</rss>
