DEV Community: HJS Foundation

My AI Agent Could See 167 Tools. Then I Told It to shutup.

HJS Foundation — Mon, 13 Apr 2026 11:24:18 +0000

Token usage dropped. Accuracy improved. And I built a 200-line Python proxy to prove it.

The Problem Nobody Talks About

MCP (Model Context Protocol) was supposed to be the universal remote for AI agents. Connect once, and your agent can interact with GitHub, Jira, Slack, filesystems, databases—you name it.

But here's what nobody tells you: connect four MCP servers, and your agent burns 60,000 tokens before you even say "hello."

Redis ran the numbers. A typical setup with Redis, GitHub, Jira, and Grafana—four servers, 167 tools—consumes ~60,000 tokens upfront just loading tool descriptions. In production, it's often 150,000+ tokens.

Atlassian found their own MCP server alone consumes ~10,000 tokens for Jira and Confluence. GitHub's official server exposes 94 tools and chews through ~17,600 tokens per request. Combine several, and you hit 30,000+ tokens of pure metadata—before your agent solves anything.

Every extra tool is a chance to pick the wrong one. Redis measured 42% tool selection accuracy without filtering. The model gets lost in the noise, grabs the wrong tool, overwrites data, or sends requests into the void.

We gave agents unlimited power. And they became slower, dumber, and more expensive.

The Solutions (and Why They're Not Enough)

The industry noticed. Multiple solutions emerged:

Approach	Example	Core Problem
Regex-based filtering	`mcpwrapped`, `Tool Filter MCP`	You must manually configure which tools to hide. 167 tools? Good luck.
Schema compression	Atlassian `mcp-compressor` (97% reduction)	Strips descriptions to save tokens, but accuracy drops—models can't tell `create_jira_issue` from `create_confluence_page`.
Tool Search (Anthropic)	Claude Code built-in	85% token reduction, but only 34% selection accuracy in independent testing.
Vector search (Redis)	Redis Tool Filtering	98% token reduction, 8x faster, 2x accuracy—but requires Redis infrastructure.
Hybrid search (Stacklok)	MCP Optimizer	94% accuracy on 2,792 tools, but closed-source commercial product.

All of them fall into one of two traps:

Manual configuration: You have to know in advance which tools to hide.
Heavy infrastructure: You need Redis, a cloud service, or a commercial license.

What I wanted was simple: zero-config, 100% local, and smart enough to figure out what tools I actually need.

So I built it.

Introducing `shutup-mcp`

shutup is an MCP proxy that shows your agent only the tools it actually needs—zero config, 100% local, no API keys.

shutup --config ~/claude_desktop_config.json --intent "read and write files"

That's it. Behind the scenes, shutup:

Reads your MCP config and discovers all connected servers—filesystem, GitHub, Jira, whatever.
Fetches all tool definitions and builds a local embedding index using all-MiniLM-L6-v2 (~80MB, runs entirely offline).
Watches for changes—add a new MCP server, shutup rebuilds the index automatically.
Filters tools by intent—when your agent requests tools, shutup intercepts and returns only the top-K most relevant ones.

Your agent never knows the other 79,997 tools exist.

Why This Approach Wins

1. Zero Config, Actually

No regex. No YAML. No manual whitelists. You already have a claude_desktop_config.json. shutup reads it directly.

{
  "mcpServers": {
    "filesystem": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"] },
    "github": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-github"] },
    "fetch": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-fetch"] }
  }
}

shutup connects to all three, aggregates their tools, and filters them intelligently. No extra configuration files needed.

2. Intent-Based Filtering

Most proxies hide tools based on names or regex patterns. shutup hides tools based on what you're actually trying to do.

Say "read and write files"—shutup returns filesystem tools, hiding GitHub and fetch tools.

Say "create a GitHub issue"—shutup surfaces GitHub tools while hiding filesystem operations.

It treats tool selection as a retrieval problem, not a reasoning one—the same insight that drove Redis to 98% token reduction.

3. Multi-Server Aggregation

This is where shutup differs from most open-source alternatives. It doesn't just filter one MCP server—it aggregates all of them.

When Stacklok analyzed 2,792 tools, they found 94% selection accuracy using hybrid search. But their Optimizer is a commercial product. shutup brings the same pattern—semantic retrieval across multiple servers—to an open-source, zero-dependency tool.

4. Privacy-First, 100% Local

Two embedding backends:

sentence-transformers (default): Downloads all-MiniLM-L6-v2 once (~80MB), runs entirely offline.
ollama: Use nomic-embed-text or any Ollama embedding model. Completely air-gapped.

No API keys. No telemetry. No cloud dependencies.

Benchmark Context (Why This Matters)

Let's put numbers to the problem.

Scenario	Tools Loaded	Token Overhead (Est.)	Selection Accuracy
Single MCP server (GitHub)	94	~17,600	79-88% (Opus 4.5)
Four servers (Redis+GitHub+Jira+Grafana)	167	~60,000	~42% (without filtering)
Enterprise setup (10+ servers)	500+	150,000+	< 30%

Sources: Atlassian, Redis, Stacklok, Anthropic

Now look at what filtering achieves:

Solution	Token Reduction	Selection Accuracy	Infrastructure Required
Anthropic Tool Search	85%	34% (2,792 tools)	Built into Claude
Atlassian mcp-compressor	70-97%	Drops at high compression	Proxy only
Redis Tool Filtering	98%	85%	Redis + vector DB
Stacklok MCP Optimizer	60-85%	94%	Commercial platform
shutup-mcp	~98% (projected)	TBD (benchmarking)	Zero

shutup uses the same architectural pattern as Redis (vector embeddings + semantic search) but without the Redis dependency. It's the "Redis approach" in a single pip install.

How It Works (Under the Hood)

Architecture

Agent (Claude Code / Cursor / Windsurf)
    ↓
shutup-mcp (stdio proxy)
    ↓
┌─────────────────────────┐
│ ServerManager           │
│ - Parses mcp.json       │
│ - Manages connections   │
│ - Watches for changes   │
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│ ToolEmbedder            │
│ - Builds local index    │
│ - Cosine similarity     │
│ - Returns top-K tools   │
└─────────────────────────┘
    ↓
Upstream MCP Servers (filesystem, github, fetch, …)

Core Loop

Startup: Parse claude_desktop_config.json, connect to each MCP server, fetch tool definitions.
Embed: For each tool, create text "{name}: {description}" and embed using chosen backend.
Request: User provides intent (e.g., --intent "read and write files").
Filter: Compute cosine similarity, return top-K tools (default K=5).
Proxy: Forward tools/list and tools/call requests transparently.

Example

$ shutup --config ~/Library/Application\ Support/Claude/claude_desktop_config.json \
         --intent "create a GitHub issue about the API outage" \
         --top-k 3

[shutup] Loading config: claude_desktop_config.json
[shutup] Connected to 3 MCP servers (filesystem, github, fetch)
[shutup] Fetched 47 total tools
[shutup] Intent: "create a GitHub issue about the API outage"
[shutup] Returning 3/47 tools:
  - github__create_issue
  - github__list_issues
  - github__get_repo

The agent only sees 3 tools. Token overhead drops from ~25,000 to ~300.

Getting Started

Install

pip install shutup-mcp

Run

# Default: sentence-transformers (auto-downloads model)
shutup --config ~/Library/Application\ Support/Claude/claude_desktop_config.json \
       --intent "your task description"

# Privacy mode: use Ollama
shutup --config ~/Library/Application\ Support/Claude/claude_desktop_config.json \
       --intent "read and write files" \
       --embedder ollama

Integrate with Claude Code

In your claude_desktop_config.json, replace direct MCP server entries with shutup as a proxy, or run shutup as a standalone gateway. Full integration docs are on the GitHub repo.

What's Next?

This is v0.1.0—a minimal, functional proxy that proves the pattern works. I'm actively working on:

Benchmarking: Head-to-head comparison with Anthropic Tool Search, mcp-compressor, and Stacklok Optimizer (public dataset, reproducible).
Hybrid search: BM25 + embeddings for better exact-match performance.
Rust rewrite: Move embedding and similarity computation to Rust for sub-millisecond latency at scale.
Tool usage analytics: Show which tools your agent actually uses vs. what gets filtered out.

Why I Built This

I was tired of watching my agent burn tokens on tools it would never use. Tired of "pick the wrong tool" errors. Tired of configuring regex filters every time I added a new MCP server.

The Redis team proved the pattern: treat tool selection as retrieval. 98% token reduction. 8x faster. Double the accuracy.

But their solution required Redis. Stacklok's required a commercial platform. Anthropic's couldn't reliably find the right tools.

I wanted something that worked out of the box, completely local, with zero configuration.

So I built it. In 200 lines of Python.

Try It Yourself

GitHub: github.com/hjs-spec/shutup-mcp
PyPI: pip install shutup-mcp

Star the repo if this solves a problem for you. PRs welcome—especially if you want to help with benchmarking or the Rust rewrite.

Your agent doesn't need 167 tools. It needs 3. Tell it to shutup.

I Logged Every Decision My AI Agent Made for a Week. Here's What I Learned.

HJS Foundation — Sat, 11 Apr 2026 02:51:26 +0000

10,847 decision events. 3 surprising insights. And one $23 wake-up call that changed how I think about agent observability.

The $23 Mystery

I run a multi-agent system that does market research. Three agents, one goal:

Scout: Gathers data from APIs and web sources
Analyst: Processes raw data into insights
Writer: Produces the final report

It worked fine. Until it didn't.

One Monday morning, I found a report that was 48 hours late and cost $23 in API credits. Normal runs take 2 hours and cost around $4.

What happened?

I checked everything. API rate limits? No. Model downtime? No. LangSmith traces showed the chain completed successfully. Each agent reported "task done." Every log line was green.

But somewhere between "task done" and "report ready," 46 hours vanished.

That's when I realized: I had no idea what my agents were actually deciding to do. I only knew what they did.

So I ran an experiment.

The Experiment: 50 Lines of Code, One Week, Every Decision

I added a lightweight decision logger to my agent orchestrator. Not tracing API calls—we already have that. I wanted to log decisions:

J (Judge): An agent initiates a new task or makes a determination
D (Delegate): An agent hands off work to another agent
T (Terminate): An agent ends a task, successfully or not
V (Verify): An agent validates someone else's output

Here's the core code (simplified—full version on GitHub):

import json
import hashlib
import time
import asyncio
from uuid import uuid4

def hash_content(content: str) -> str:
    """Create a content-addressable hash for the decision payload."""
    return f"sha256:{hashlib.sha256(content.encode()).hexdigest()[:16]}"

async def log_decision(verb: str, who: str, what: str, ref: str = None):
    event = {
        "jep": "1",
        "verb": verb,
        "who": who,
        "what": hash_content(what) if what else None,
        "when": int(time.time()),
        "nonce": str(uuid4()),
        "ref": ref,
        "aud": "research-pipeline-v1"
    }
    # Async write—doesn't block the agent
    await write_to_ndjson(event)

I deployed it on a Tuesday. One week later, I had 10,847 decision events.

Here's what I found.

Discovery #1: 35% of Delegations Were Circular

My agents delegate work to each other constantly. Scout hands raw data to Analyst. Analyst hands insights to Writer. Writer asks Scout for clarification. Normal.

But when I graphed the D (Delegate) events by ref chain, I saw something unexpected:

Scout → Analyst → Scout → Analyst → Scout (terminates)

1,203 times in one week, agents created delegation loops of length ≥ 2. Each loop burned:

~2 seconds of compute time
One extra LLM call for the handoff reasoning
Token costs for the delegation message itself

Total waste: ~40 minutes of compute and $3.20 in API costs. Not catastrophic. But completely invisible until I logged the D events with their ref chains.

The fix: I added a simple rule—if an agent receives a delegation from someone it already delegated to in the current chain, break and escalate. Loops dropped to near zero.

Discovery #2: Failed Tool Calls Retried 7 Times Before Giving Up

One of Scout's jobs is scraping competitor pricing from public websites. Occasionally, a site times out. Normal.

What wasn't normal: the retry behavior.

When a tool call failed, my agent retried—on average—7 times before terminating. The worst offender was that scraping tool. One timeout at 11:23 PM turned into:

11:23 PM - Tool call fails (timeout)
11:24 PM - Retry 1 fails
11:26 PM - Retry 2 fails
11:29 PM - Retry 3 fails
...
03:17 AM - Retry 11 fails, agent finally terminates

Four hours. Eleven retries. Each one a fresh API call with a new browser instance. Cost of that single failure chain: $1.87.

Across the week, excessive retries wasted ~$9.40.

The fix: I capped retries at 3 for non-critical tools. If it still fails, the agent logs a T with reason: "tool_unavailable" and moves on with partial data. The report might be slightly less complete, but it arrives on time and under budget.

Discovery #3: The 3 AM Termination Storm

At 3:14 AM on Wednesday, I saw something strange in the logs:

47 T (Terminate) events within 90 seconds.

Normal rate is ~10 per hour.

Every single one had reason: "empty_response". Turns out, a data provider's API had a brief outage, returning 200 OK with an empty body. Every parallel agent hit it simultaneously, received nothing, and terminated immediately.

No alert fired. From the orchestrator's perspective, all tasks "completed successfully"—they just completed with zero data. The final report that morning was 40% shorter than usual, and I had no idea why until I dug through the decision logs.

The fix: I added a simple monitor—if T events with reason: "empty_response" exceed 5 per minute, pause the pipeline and alert. The next time that API flaked, I knew within 60 seconds.

Discovery #4: Verification Was Silently Slowing Down

The V (Verify) event happens when one agent checks another's output. Analyst produces insights; Writer verifies they're coherent before including them.

I noticed something in the timestamps:

Day	Avg Time Between `J` and `V`
Tuesday	1.2 seconds
Wednesday	1.8 seconds
Thursday	2.9 seconds
Friday	4.1 seconds
Monday	4.7 seconds

The verification service was drifting. Not enough to break anything yet, but a clear trend. Turned out the vector database used for fact-checking had accumulated 6 months of stale embeddings and queries were getting slower.

Without decision-level timestamps, I would have found out only after it started timing out and breaking the pipeline. Instead, I scheduled a re-indexing job over the weekend. Tuesday morning: back to 1.3 seconds.

What I Changed (And What You Can Steal)

I didn't build a complex observability platform. I added three rules to my orchestrator based on what the decision logs revealed:

Rule	Trigger	Action
Circular delegation guard	`D` chain contains duplicate `who`	Break loop, escalate
Retry cap	Tool call fails > 3 times	Log `T`, continue with partial data
Termination storm alert	> 5 `T` with same reason in 1 min	Pause pipeline, notify

The next week's run: 1.8 hours. $3.70. And I caught an API outage before it silently corrupted a report.

Why This Matters (And Why Most Agent Logs Are Useless)

Here's what I learned: there's a difference between logging actions and logging decisions.

Action Log	Decision Log
"Called API X"	"Delegated to Analyst because confidence < 0.7"
"Task completed"	"Terminated with partial data due to tool timeout"
"Received response"	"Verified output—hash matches, coherence score 0.82"

Action logs tell you what happened. Decision logs tell you why.

Without the "why," debugging multi-agent systems is just guesswork.

How to Get Started (Without Adopting a New Protocol)

You don't need to rebuild your entire stack. Start simple:

Level 1: Structured handoff logs

Add one JSON line every time an agent hands off work to another. Include from, to, reason, and a hash of the payload. That alone will catch delegation loops.

Level 2: Add decision verbs

Tag each log with what kind of decision it represents: initiated, delegated, terminated, verified. This makes it searchable and graphable.

Level 3: Chain them together

Use a ref field to link events. Now you have a trace of the entire decision chain, not just isolated events.

Level 4: Add signatures (if you need non-repudiation)

If you're building something where audit trails matter—compliance, finance, multi-party systems—you'll want cryptographic signatures. The format I used above is compatible with JEP (Judgment Event Protocol), which adds signing and anti-replay protection out of the box. But you can get 80% of the value with plain JSON and a ref field.

I Open-Sourced the Logger

The logger I used for this experiment is now open-source. It's 200 lines of Python, works with any agent framework, and writes to NDJSON so you can cat and grep like it's 1999.

👉 GitHub: agent-decision-logger

It includes:

The core logger with all four decision verbs
A Mermaid visualization script (see your agents' decision chains as flowcharts)
Analysis tools to detect delegation loops and termination storms
Complete examples and tests

Your Turn

My agent wasn't broken. It was just making expensive decisions I couldn't see.

What's the weirdest thing you've found in your agent's logs? Or are you flying blind?

Drop a comment—I'd love to hear what you're seeing (or not seeing) in your own systems.

I Built a "Blame Finder" for AI Agents – So You Never Have to Guess Who Broke Production

HJS Foundation — Tue, 07 Apr 2026 14:42:03 +0000

The 3 AM Slack Message We All Fear

"Hey, the multi-agent pipeline just deleted the staging database. Any idea which agent did it?"

Your PM Agent says it passed a clean requirement.

Your Coder Agent says it followed the spec perfectly.

Your Verifier Agent says it never even got the output.

You spend the next 4 hours grepping through thousands of lines of logs. You find nothing.

This is the Accountability Vacuum. And it's a nightmare.

So I built a cure: Agent Blame-Finder – an open‑source cryptographic black box for multi‑agent systems.

What Does It Do?

In 3 seconds, it tells you exactly which agent messed up.

$ blame-finder blame incident-abc123

🎯 Verdict: Coder-Agent
💡 Reason: Input requirement was correct, but output didn't match expectations
🔗 Chain:
   ✅ PM-Agent – success
   ❌ Coder-Agent – failed
   ⏳ Verifier-Agent – not reached

No more finger‑pointing. No more log spelunking. Just a verifiable, signed receipt of every decision.

How It Works (The 10‑Second Technical)

Under the hood, it implements two IETF Internet‑Drafts:

JEP (Judgment Event Protocol) – a minimal, cryptographically signed log format for agent decisions.
JAC (Judgment Accountability Chain) – a task_based_on field that links every decision to its parent.

Each time an agent does something, a JEP receipt is created:

{
  "verb": "J",
  "who": "Coder-Agent",
  "when": 1742345678,
  "what": "sha256:...",
  "task_based_on": "parent-task-hash",
  "sig": "Ed25519 signature"
}

The four verbs – J (Judge), D (Delegate), T (Terminate), V (Verify) – are all you need to model any accountability flow.

Integration: One Decorator

from blame_finder import BlameFinder

finder = BlameFinder(storage="./blackbox_logs")

@finder.trace(agent_name="Coder-Agent")
def write_code(requirement: str) -> str:
    # Your existing logic – no changes needed
    return "print('hello world')"

# Later, when something breaks:
print(finder.blame(incident_id="task_123"))

That’s it. The decorator handles hashing, signing, storage, and chain linking.

Why You Should Care

Without Blame‑Finder	With Blame‑Finder
Hours of log hunting	`blame-finder blame <id>`
"Maybe Agent X?" finger‑pointing	Cryptographic proof
No audit trail	JEP receipts (immutable, signed)
Broken causality	Full `task_based_on` tree

It’s like git blame but for AI agents.

And because it’s based on IETF drafts, it’s not another walled garden – it’s infrastructure.

The Road Ahead

✅ Rust core engine (fast)
✅ Python & TypeScript SDKs
🚧 LangChain / CrewAI native adapters
🚧 Visual dashboard (blame-finder dashboard – already works!)
🚧 One‑click PDF/HTML blame reports

Try It Right Now

pip install agent-blame-finder

Then launch the dashboard:

blame-finder dashboard

You’ll see a causality tree visualizer that looks like a Git graph – but for agent decisions.

Contribute

MIT licensed. We need:

Integrations with popular agent frameworks
More tests
Documentation improvements
Your crazy ideas

GitHub: https://github.com/hjs-spec/Agent-Blackbox

Stop the guessing game. Start the Blame‑Finder. 🔍

P.S. The name is intentionally provocative. Your PM will hate it. Your CTO will love it.

Stop Debugging Black Boxes: How jac-agent Solves the 3 Hardest Pain Points in Training Production-Grade AI Agents

HJS Foundation — Wed, 01 Apr 2026 02:23:59 +0000

Subtitle

If your training data is messy, your logs are useless, and you can’t prove why your agent failed — this is for you.

Intro

Training AI agents isn’t just about prompt engineering anymore.

If you’re building anything that touches production, you’re already hitting these walls:

You can’t trace failures. A bad decision derailed your pipeline — but you have no idea which step caused it.
Your training data is garbage. You’re scraping unstructured logs to build SFT/RL datasets, wasting hours cleaning noise.
You can’t deploy safely. Regulators and auditors want proof your agent isn’t making harmful choices, but you have no way to show it.

Today, I’m releasing jac-agent: an open-source SDK built on IETF standards, designed to solve exactly these problems — while adding zero overhead to your training loop.

GitHub: github.com/hjs-spec/jac-agent

The 3 Pain Points of Training Production Agents

1. The "Black Box" Debugging Nightmare

You run an agent for 100 steps. It makes 99 good decisions, then one catastrophic call.

Your logs look like this:

INFO: Processing user request
INFO: Calling tool
INFO: Tool response received
ERROR: Pipeline failed

You have no causal link between the steps. No way to know why it failed, only that it did.

This isn’t just annoying — it makes training slow, risky, and impossible to validate.

2. Training Data That Costs You Hours to Clean

To fine-tune your agent, you need structured trajectories.

But raw logs are unstructured, inconsistent, and often missing context.

You end up writing brittle scripts to parse free-text outputs, only to find half the data is corrupted or incomplete.

3. The "How Do We Prove It’s Safe?" Compliance Gap

Regulators and enterprise clients are already asking:

"Can you show us exactly why your agent made that decision?"

If you can’t, you can’t deploy.

How `jac-agent` Fixes All 3 Problems

jac-agent isn’t just another logging library. It’s built on three open standards (JEP/HJS/JAC) to turn your agent’s decisions into provable, structured, and training-ready data.

1. Trace Failures to the Exact Step (No More Black Boxes)

Every decision your agent makes is recorded in an immutable, cryptographically verified chain.

You get a clear causal path from the root task to the final action — no guesswork, no missing links.

from jac_agent import judge, show_trace_chain

# Record decisions in your agent loop
judge(subject="Route selection", judgment="Choose Route A", evidence="Low congestion, high safety")
judge(subject="Cost check", judgment="Approve Route A", evidence="Under budget, valid tolls")

# Print the full causal trace
show_trace_chain()

You see exactly which decision caused a failure, in seconds.

2. Turn Logs Into Training Data — Automatically

jac-agent’s task_based_on field structures every decision into causal chains.

When you’re ready to train, one call exports a clean, ready-to-use dataset for SFT/RL/DPO.

from jac_agent import enable_training_mode, export_training_dataset

# Enable zero-overhead mode for training
enable_training_mode(batch_size=32)

# Run your training loop as usual — logging happens in memory
for step in range(1000):
    judge(subject=f"Task {step}", judgment=f"Action {step}", evidence="Agent observation")

# Export structured causal trajectories
export_training_dataset()

No parsing. No cleaning. Just high-quality training data.

3. Build a Provable, Auditable Safety Layer

Every record is cryptographically signed, timestamped, and linked to the previous step.

You can export a formal audit report at any time to prove:

Your agent followed its rules.
Decisions were made in order.
No logs were altered after the fact.

from jac_agent import export_audit_report
export_audit_report("agent_audit_2026-04-01")

You get compliance-ready evidence without changing your agent.

Under the Hood: Built on Open Standards

jac-agent isn’t proprietary. It’s the first reference implementation of three IETF specifications:

JEP: Standard event format for agent decisions.
HJS: Immutable accountability layer with privacy controls.
JAC: Causal chain linking via task_based_on.

This means:

No vendor lock-in.
Interoperable with any agent framework.
Built to evolve with open standards, not closed tools.

Try It in 2 Minutes

pip install jac-agent

from jac_agent import judge, show_trace_chain

judge(subject="User request", judgment="Approve action", evidence="Policy check passed")
show_trace_chain()

You’re already recording verified decisions.

Closing Thoughts

Training production-grade agents requires more than good prompts. It requires visibility, safety, and proof.

With jac-agent, you don’t have to choose between training speed and auditability — you get both.

I’d love your feedback. Star the repo, open an issue, or drop a comment below.

GitHub: github.com/hjs-spec/jac-agent

Full-Link Accountability for AI Agents

HJS Foundation — Fri, 27 Mar 2026 04:11:57 +0000

Core Event Primitives

Four standard event types (J, D, V, T) cover the full accountability lifecycle:

J: Judge – Create and initiate a judgment/decision
D: Delegate – Transfer authority or assign a task
V: Verify – Review and validate a record
T: Terminate – End a judgment or task lifecycle

Pain Points & Technical Solutions

Pain Point 1: Broken chain in multi-agent workflows, unable to trace root cause

Trigger Primitives: J + D

Solution: Add a task_based_on field to every record to enforce a hash reference to the parent task. A null value indicates the start of a chain; a populated value links to a preceding action, ensuring full end-to-end traceability.

Pain Point 2: Unclear accountability, hard to assign fault for errors

Trigger Primitives: D + V

Solution: Permanently include a who field in every record, bound to the actor’s DID or public key hash. Combined with cryptographic signing, records become non-repudiable and tamper-proof, enabling precise accountability.

Pain Point 3: Lack of compliant audit evidence for regulatory requirements

Trigger Primitives: V + T

Solution: Equip every record with a timestamp, unique nonce, and signature verification. Full audit trails with replay protection are natively supported, directly satisfying compliance requirements under the EU AI Act and Singapore IMDA frameworks.

Core Data Structure (Ready for Use)

{
"jep": "1",
"verb": "J",
"who": "did:example:agent-789",
"when": 1742345678,
"what": "122059e8878aa9a38f4d123456789abcdef01234",
"nonce": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
"aud": "https://platform.example.com",
"task_based_on": "hash-of-parent-task",
"ref": "",
"sig": "eyJhbGciOiJFZERTQSJ9..."
}

Key Field Definitions

verb: Required; one of [J, D, V, T]
who: Required; unique identifier of the actor
when: Required; Unix timestamp to prevent stale/tampered records
nonce: Required; UUIDv4 to prevent replay attacks
task_based_on: Traceability field; hash of the parent task
ref: For verification events only; references the ID of the event being checked
sig: Required; JWS digital signature

Verification Logic (Pseudocode)

def verify_record(record):
# 1. Verify signature integrity
if not verify_jws_signature(record):
return "INVALID"
# 2. Ensure nonce uniqueness to prevent replay
if not is_valid_nonce(record["nonce"]):
return "INVALID"
# 3. Validate timestamp within acceptable window
if not is_within_time_window(record["when"]):
return "INVALID"
# 4. Verify parent task chain integrity
if record.get("task_based_on") and not task_exist(record["task_based_on"]):
return "INVALID"
# 5. Verify events must include a reference
if record["verb"] == "V" and not record.get("ref"):
return "INVALID"
return "VALID"

Critical Security Rules

All records must be signed; any tampering invalidates the signature
Nonces must be globally unique; duplicate requests are rejected
Timestamp tolerance: ±5 minutes to account for clock skew
Verify (V) events must include a ref field to avoid circular validation
Ed25519 recommended; support for SM2, ECDSA P-256, and post-quantum algorithms

Optional Extensions

Task State: Add status field (pending, executing, completed, terminated)
Assignment Log: Record DID of assigner and assignee
Result Validation: Include confidence score and human review flag
Fault Handling: Log missing parent tasks and failure reasons for chain breaks

Implementation based on these IETF Drafts:

draft-wang-jep-judgment-event-protocol-01
draft-wang-jac-00

Beyond App-Level Harness: A Technical Analysis of Native Underlying AI Constraints

HJS Foundation — Mon, 23 Mar 2026 09:48:13 +0000

As AI engineering evolves, a key technical distinction in Harness design has become increasingly clear. Current Harness implementations focus on app-level, post-execution adjustments, while a more foundational approach—built into the protocol layer—offers distinct advantages in AI control and reliability.

This analysis focuses on the technical differences between these two approaches, using protocol-level designs for AI boundary and accountability as a framework for comparison.

Technical Characteristics of App-Level Harness Implementations

Existing Harness solutions deliver practical value through a set of operational adjustments, all implemented as layers built on top of pre-existing models. Core technical components include:

Context engineering to curate and deliver relevant information to AI agents during execution
CI/CD linting and structured testing to identify and correct errors after execution
Behavioral guideline documents to establish operational parameters for agents
Tool curation to limit agent capabilities to predefined scopes

These components effectively translate raw model capability into usable output—with documented improvements in performance metrics when optimized Harnesses are applied. However, their technical limitation lies in being soft constraints: they operate as external guidance rather than inherent controls, creating potential for agent drift or boundary bypass under complex operational conditions.

Technical Advantages of Protocol-Level Harness Design

A protocol-level approach differs fundamentally by embedding control mechanisms into the core operational layer, rather than adding them as external wrappers. This design prioritizes inherent constraints and accountability, with three key technical differentiators:

1. Native Isolation vs. Post-Execution Constraints

Protocol-level designs establish hard, inherently enforced boundaries between distinct entities from the outset. Instead of relying on external prompts or linting to guide behavior, they define separate execution domains, identity isolation, and permission boundaries that are technically impossible to bypass at the protocol layer. This shifts control from reactive adjustment to proactive prevention, eliminating the technical possibility of boundary breach.

2. Accountability as a Core Technical Primitive

Unlike app-level Harnesses that focus on error correction after occurrence, protocol-level designs embed accountability into the foundational architecture. This includes a technical framework for tracking agent actions, linking them to verifiable identities, and enabling full traceability—all integrated natively into the protocol. This moves beyond feedback loops to create a persistent, auditable system for AI behavior accountability.

3. Open Source Interoperability as a Technical Priority

Protocol-level Harness designs prioritize open source principles to enable interoperability across diverse model architectures and toolchains. By avoiding proprietary lock-in, they create a universal foundation that can be adopted, extended, and integrated into varied AI workflows. This technical design choice addresses a critical challenge as AI scales: preventing fragmentation across different Harness implementations.

The Technical Case for Depth in Harness Design

A simple technical analogy illustrates the core difference:

App-level Harnesses operate like external safety features—effective for standard conditions but vulnerable to bypass under complex scenarios.
Protocol-level Harnesses function as inherent structural controls—integrated into the operational foundation to eliminate the technical possibility of drift or bypass.

The growing recognition of Harness importance in AI engineering is driving a shift toward deeper, more integrated control mechanisms. Soft, external constraints are sufficient for small-scale, well-defined use cases, but as AI systems become more autonomous and complex, a protocol-level approach becomes technically necessary. It ensures that control, isolation, and accountability scale proportionally with AI capability, rather than relying on external adjustments that may fail under stress.

The value of protocol-level Harness design lies in its ability to create a foundational layer for reliable, controllable AI at scale. By embedding control mechanisms into the protocol itself, it addresses the technical limitations of app-level implementations, offering a more robust solution for increasingly complex AI systems.

Further technical discussion and collaboration around protocol-level Harness design are encouraged to advance the reliability and controllability of AI systems.