DEV Community

Shunsuke Hayashi
Shunsuke Hayashi

Posted on • Originally published at github.com

Your AI Agent Skills Are Silently Breaking Right Now (And Nobody's Watching)

91% of ML models degrade over time. Only 5% of production agents have real monitoring. Here's how to fix it.


Your AI agent looked great on demo day.

Three months later, it's quietly returning wrong answers. Not crashing. Not throwing errors. Just... subtly worse. Your API caller skill scored 0.95 in January. By March, it's 0.42. Nobody changed your code. Nobody noticed.

This is the biggest unsolved problem in AI agent operations — and it's happening to everyone.

The Data Is Terrifying

Let's start with the numbers:

  • 91% of ML models experience degradation over time — MIT research across 32 datasets and 4 industries
  • 67% of enterprises report measurable AI degradation within 12 months — Gartner
  • Only 5% of production AI agents have mature monitoring — Cleanlab 2025 survey
  • 90-95% of AI initiatives fail to reach sustained production value — Industry reports

But here's the number that should keep you up at night:

If an AI agent achieves 85% accuracy per action — which sounds great — a 10-step workflow only succeeds 20% of the time.

That's not a bug. That's math. And it gets worse every week your skills go unmonitored.

The Four Horsemen of Skill Degradation

Agent skills don't fail catastrophically. They rot gradually. Here are the four ways it happens:

1. API Changes

External APIs change without warning. A field gets renamed. A response format shifts. Your agent was calling v2/users but the provider silently redirected to v3/users with a different schema.

Week 1: API returns { "name": "Alice" }           → skill score: 1.0
Week 2: API returns { "full_name": "Alice" }       → skill score: 0.6
Week 3: API returns { "user": { "name": "Alice" }} → skill score: 0.3
Enter fullscreen mode Exit fullscreen mode

Nobody changed your code. The score just... dropped.

2. Model Updates

Your LLM provider ships an update. The model is "better" overall, but your carefully tuned prompts now produce slightly different outputs. Your parsing logic breaks on 15% of responses. No error. Just worse results.

3. Auth Expiry

OAuth tokens expire. API keys get rotated. Service accounts lose permissions. Your skill still runs — it just gets 401s that it handles as "empty results" instead of errors.

4. Prompt Drift

Over time, small modifications accumulate. Someone adds "be more concise." Someone else adds "include all details." The prompt contradicts itself. The skill still works, but its quality oscillates unpredictably.

Why Traditional Monitoring Fails

Here's the gap: most agent frameworks handle execution. None handle operational health.

Framework Executes tasks Monitors skill quality over time
LangGraph Yes No
CrewAI Yes No
AutoGen Yes No
Mastra Yes No
VoltAgent Yes No

Your APM tool tells you the agent ran. It doesn't tell you the agent's output quality dropped 40% compared to last week.

Your error tracker catches crashes. It doesn't catch the skill that returns "technically valid but wrong" results.

Your observability stack monitors latency and error rates. It's completely blind to semantic degradation — the silent drift from 95% quality to 60% quality with zero errors.

The Industry Is Waking Up

2026 is the year the industry acknowledged this problem:

OpenAI shipped Skill Eval — unit tests for agent skills.

Google's Minko Gechev published Skill Eval with the tagline about skills deserving the same rigor as code.

Spring AI acknowledged there's no built-in versioning system for skills — if you update a skill, all applications immediately use the new version with no rollback path.

CNBC ran a feature on "Silent failure at scale" — calling it the AI risk nobody sees coming.

The EU AI Act mandates continuous monitoring for high-risk AI systems by August 2, 2026. If your agents make decisions that affect people, monitoring isn't optional anymore.

The consensus is clear: agent skills need the same operational rigor as production software.

But detecting the problem is only half the battle.

Detection Alone Isn't Enough

OpenAI's Skill Eval and Google's Skill Eval are valuable tools. They tell you that a skill degraded.

They don't tell you why.

They don't tell you what to do about it.

And they certainly don't fix it automatically.

That's the gap we built Agent Skill Bus to fill.

Introducing the Self-Improving Loop

Agent Skill Bus is a framework-agnostic runtime that adds three capabilities no existing framework provides:

External Changes ──→ Knowledge Watcher ──→ Prompt Request Bus ──→ Execute
                                                ↑                    │
                                                │                    ↓
                                          Self-Improving ←── Skill Runs Log
                                             Skills
Enter fullscreen mode Exit fullscreen mode

Module 1: Self-Improving Skills

A 7-step quality loop that runs continuously:

OBSERVE → ANALYZE → DIAGNOSE → PROPOSE → EVALUATE → APPLY → RECORD
Enter fullscreen mode Exit fullscreen mode

Every skill execution gets scored (0.0 to 1.0). The system watches for:

  • Score drops — Moving average falls below threshold
  • Drift — 15%+ score decrease week-over-week
  • Consecutive failures — 3+ failures in a row triggers immediate alert

When degradation is detected, an LLM reads the failing skill definition + error logs, diagnoses the root cause, and proposes a fix. Low-risk fixes (like updating an API endpoint URL) are applied automatically. High-risk fixes get routed to a human for approval.

This is the key differentiator: not just detection, but diagnosis and repair.

Module 2: Knowledge Watcher

Instead of waiting for skills to break, Knowledge Watcher proactively monitors for changes that will break them:

  • Tier 1 (every 6 hours): Dependency versions, API endpoint health, config drift
  • Tier 2 (daily): GitHub issue patterns, user feedback, platform changelogs
  • Tier 3 (weekly): Industry trends, competitor releases, best practice updates

When a breaking change is detected upstream, the system generates a task to update affected skills before they fail in production.

Module 3: Prompt Request Bus

A DAG-based task queue that coordinates multi-agent workflows:

  • Tasks declare dependencies (dependsOn) for automatic ordering
  • File-level locking prevents two agents from editing the same resource
  • Priority routing (critical > high > medium > low)
  • Deduplication prevents the same task from being queued twice

Zero Dependencies. Just JSONL Files.

This is the part that surprises people.

Agent Skill Bus doesn't use a database. It doesn't need Redis or RabbitMQ. It doesn't lock you into a specific framework.

Everything is stored in plain JSONL files:

.agent-skill-bus/
├── skill-runs.jsonl         # Execution history
├── queue.jsonl              # Task queue
├── knowledge-diffs.jsonl    # Detected changes
└── active-locks.jsonl       # File locks
Enter fullscreen mode Exit fullscreen mode

Any language can read these files. Any CI pipeline can process them. Any framework can integrate by simply appending a line of JSON.

This means you can add Agent Skill Bus to your existing setup in 30 seconds:

npx agent-skill-bus init
Enter fullscreen mode Exit fullscreen mode

And start recording skill executions immediately:

npx agent-skill-bus record-run \
  --agent my-agent \
  --skill api-caller \
  --task "fetch user data" \
  --result success \
  --score 0.95
Enter fullscreen mode Exit fullscreen mode

Works With Everything

Framework How to integrate
Claude Code Drop a SKILL.md into .claude/skills/
Codex Drop into .codex/skills/
LangGraph Call record-run in your tool functions
CrewAI Add a task completion callback
Custom Append to JSONL files directly

For Claude Code users, just add one line to your AGENTS.md:

After completing any task, log the result:
npx agent-skill-bus record-run --agent claude --skill <skill-name> --task "<task>" --result <success|fail|partial> --score <0.0-1.0>
Enter fullscreen mode Exit fullscreen mode

That's it. The self-improving loop runs automatically from there.

Production Results

We run Agent Skill Bus in production at LLC Miyabi, coordinating 42 AI agents daily:

  • 27 tasks/day average throughput
  • 44 cron jobs feeding the bus
  • 57% reduction in skill failures after enabling the self-improvement loop
  • 7-minute fastest security incident response

The 57% number is the one that matters. More than half of skill failures were preventable — they were caused by silent degradation that the loop caught and fixed before users noticed.

The Bigger Picture

We're at an inflection point. AI agents are moving from demos to production. Gartner predicts 40% of enterprise applications will embed task-specific AI agents by end of 2026, up from less than 5% in 2025.

That's an 8x increase in agents — without a corresponding increase in monitoring infrastructure.

The software industry solved this decades ago. We don't ship code without tests, CI/CD, and monitoring. Agent skills deserve the same treatment.

Agent Skill Bus is our answer: the missing runtime that keeps agent skills healthy in production.


Get Started

# Install and initialize
npx agent-skill-bus init

# Record your first skill execution
npx agent-skill-bus record-run --agent my-agent --skill api-caller --task "test" --result success --score 1.0

# Check what needs attention
npx agent-skill-bus flagged

# See the dashboard
npx agent-skill-bus dashboard
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/ShunsukeHayashi/agent-skill-bus

Full ecosystem (110+ skills): agentskills.bath.me

Zero dependencies. MIT licensed. Framework-agnostic.


Built by LLC Miyabi — running 42 AI agents in production daily.

Follow @The_AGI_WAY for updates.

Top comments (0)