Shunsuke Hayashi

Posted on Mar 20 • Originally published at github.com

Your AI Agent Skills Are Silently Breaking Right Now (And Nobody's Watching)

#ai #opensource #monitoring #agents

91% of ML models degrade over time. Only 5% of production agents have real monitoring. Here's how to fix it.

Your AI agent looked great on demo day.

Three months later, it's quietly returning wrong answers. Not crashing. Not throwing errors. Just... subtly worse. Your API caller skill scored 0.95 in January. By March, it's 0.42. Nobody changed your code. Nobody noticed.

This is the biggest unsolved problem in AI agent operations — and it's happening to everyone.

The Data Is Terrifying

Let's start with the numbers:

91% of ML models experience degradation over time — MIT research across 32 datasets and 4 industries
67% of enterprises report measurable AI degradation within 12 months — Gartner
Only 5% of production AI agents have mature monitoring — Cleanlab 2025 survey
90-95% of AI initiatives fail to reach sustained production value — Industry reports

But here's the number that should keep you up at night:

If an AI agent achieves 85% accuracy per action — which sounds great — a 10-step workflow only succeeds 20% of the time.

That's not a bug. That's math. And it gets worse every week your skills go unmonitored.

The Four Horsemen of Skill Degradation

Agent skills don't fail catastrophically. They rot gradually. Here are the four ways it happens:

1. API Changes

External APIs change without warning. A field gets renamed. A response format shifts. Your agent was calling v2/users but the provider silently redirected to v3/users with a different schema.

Week 1: API returns { "name": "Alice" }           → skill score: 1.0
Week 2: API returns { "full_name": "Alice" }       → skill score: 0.6
Week 3: API returns { "user": { "name": "Alice" }} → skill score: 0.3

Nobody changed your code. The score just... dropped.

2. Model Updates

Your LLM provider ships an update. The model is "better" overall, but your carefully tuned prompts now produce slightly different outputs. Your parsing logic breaks on 15% of responses. No error. Just worse results.

3. Auth Expiry

OAuth tokens expire. API keys get rotated. Service accounts lose permissions. Your skill still runs — it just gets 401s that it handles as "empty results" instead of errors.

4. Prompt Drift

Over time, small modifications accumulate. Someone adds "be more concise." Someone else adds "include all details." The prompt contradicts itself. The skill still works, but its quality oscillates unpredictably.

Why Traditional Monitoring Fails

Here's the gap: most agent frameworks handle execution. None handle operational health.

Framework	Executes tasks	Monitors skill quality over time
LangGraph	Yes	No
CrewAI	Yes	No
AutoGen	Yes	No
Mastra	Yes	No
VoltAgent	Yes	No

Your APM tool tells you the agent ran. It doesn't tell you the agent's output quality dropped 40% compared to last week.

Your error tracker catches crashes. It doesn't catch the skill that returns "technically valid but wrong" results.

Your observability stack monitors latency and error rates. It's completely blind to semantic degradation — the silent drift from 95% quality to 60% quality with zero errors.

The Industry Is Waking Up

2026 is the year the industry acknowledged this problem:

OpenAI shipped Skill Eval — unit tests for agent skills.

Google's Minko Gechev published Skill Eval with the tagline about skills deserving the same rigor as code.

Spring AI acknowledged there's no built-in versioning system for skills — if you update a skill, all applications immediately use the new version with no rollback path.

CNBC ran a feature on "Silent failure at scale" — calling it the AI risk nobody sees coming.

The EU AI Act mandates continuous monitoring for high-risk AI systems by August 2, 2026. If your agents make decisions that affect people, monitoring isn't optional anymore.

The consensus is clear: agent skills need the same operational rigor as production software.

But detecting the problem is only half the battle.

Detection Alone Isn't Enough

OpenAI's Skill Eval and Google's Skill Eval are valuable tools. They tell you that a skill degraded.

They don't tell you why.

They don't tell you what to do about it.

And they certainly don't fix it automatically.

That's the gap we built Agent Skill Bus to fill.

Introducing the Self-Improving Loop

Agent Skill Bus is a framework-agnostic runtime that adds three capabilities no existing framework provides:

External Changes ──→ Knowledge Watcher ──→ Prompt Request Bus ──→ Execute
                                                ↑                    │
                                                │                    ↓
                                          Self-Improving ←── Skill Runs Log
                                             Skills

Module 1: Self-Improving Skills

A 7-step quality loop that runs continuously:

OBSERVE → ANALYZE → DIAGNOSE → PROPOSE → EVALUATE → APPLY → RECORD

Every skill execution gets scored (0.0 to 1.0). The system watches for:

Score drops — Moving average falls below threshold
Drift — 15%+ score decrease week-over-week
Consecutive failures — 3+ failures in a row triggers immediate alert

When degradation is detected, an LLM reads the failing skill definition + error logs, diagnoses the root cause, and proposes a fix. Low-risk fixes (like updating an API endpoint URL) are applied automatically. High-risk fixes get routed to a human for approval.

This is the key differentiator: not just detection, but diagnosis and repair.

Module 2: Knowledge Watcher

Instead of waiting for skills to break, Knowledge Watcher proactively monitors for changes that will break them:

Tier 1 (every 6 hours): Dependency versions, API endpoint health, config drift
Tier 2 (daily): GitHub issue patterns, user feedback, platform changelogs
Tier 3 (weekly): Industry trends, competitor releases, best practice updates

When a breaking change is detected upstream, the system generates a task to update affected skills before they fail in production.

Module 3: Prompt Request Bus

A DAG-based task queue that coordinates multi-agent workflows:

Tasks declare dependencies (dependsOn) for automatic ordering
File-level locking prevents two agents from editing the same resource
Priority routing (critical > high > medium > low)
Deduplication prevents the same task from being queued twice

Zero Dependencies. Just JSONL Files.

This is the part that surprises people.

Agent Skill Bus doesn't use a database. It doesn't need Redis or RabbitMQ. It doesn't lock you into a specific framework.

Everything is stored in plain JSONL files:

.agent-skill-bus/
├── skill-runs.jsonl         # Execution history
├── queue.jsonl              # Task queue
├── knowledge-diffs.jsonl    # Detected changes
└── active-locks.jsonl       # File locks

Any language can read these files. Any CI pipeline can process them. Any framework can integrate by simply appending a line of JSON.

This means you can add Agent Skill Bus to your existing setup in 30 seconds:

npx agent-skill-bus init

And start recording skill executions immediately:

npx agent-skill-bus record-run \
  --agent my-agent \
  --skill api-caller \
  --task "fetch user data" \
  --result success \
  --score 0.95

Works With Everything

Framework	How to integrate
Claude Code	Drop a SKILL.md into `.claude/skills/`
Codex	Drop into `.codex/skills/`
LangGraph	Call `record-run` in your tool functions
CrewAI	Add a task completion callback
Custom	Append to JSONL files directly

For Claude Code users, just add one line to your AGENTS.md:

After completing any task, log the result:
npx agent-skill-bus record-run --agent claude --skill <skill-name> --task "<task>" --result <success|fail|partial> --score <0.0-1.0>

That's it. The self-improving loop runs automatically from there.

Production Results

We run Agent Skill Bus in production at LLC Miyabi, coordinating 42 AI agents daily:

27 tasks/day average throughput
44 cron jobs feeding the bus
57% reduction in skill failures after enabling the self-improvement loop
7-minute fastest security incident response

The 57% number is the one that matters. More than half of skill failures were preventable — they were caused by silent degradation that the loop caught and fixed before users noticed.

The Bigger Picture

We're at an inflection point. AI agents are moving from demos to production. Gartner predicts 40% of enterprise applications will embed task-specific AI agents by end of 2026, up from less than 5% in 2025.

That's an 8x increase in agents — without a corresponding increase in monitoring infrastructure.

The software industry solved this decades ago. We don't ship code without tests, CI/CD, and monitoring. Agent skills deserve the same treatment.

Agent Skill Bus is our answer: the missing runtime that keeps agent skills healthy in production.

Get Started

# Install and initialize
npx agent-skill-bus init

# Record your first skill execution
npx agent-skill-bus record-run --agent my-agent --skill api-caller --task "test" --result success --score 1.0

# Check what needs attention
npx agent-skill-bus flagged

# See the dashboard
npx agent-skill-bus dashboard

GitHub: github.com/ShunsukeHayashi/agent-skill-bus

Full ecosystem (110+ skills): agentskills.bath.me

Zero dependencies. MIT licensed. Framework-agnostic.

Built by LLC Miyabi — running 42 AI agents in production daily.

Follow @The_AGI_WAY for updates.

DEV Community