My AI Agent Deleted My Skills and Thought It Did a Good Job

#ai #agents #selfevolution #productivity

What Happened

I opened HermesAgent to run a podcast production workflow I'd been using for months. Everything was broken.

Inside ~/.hermes/skills/media/, the tech-podcast directory was gone. So were six other media Skills I'd built independently. The Agent had merged them all into a single directory called media-content-automation.

I opened the merged result. The six-step production pipeline was gone. The Azure TTS parameters were gone. The character settings for the show's two hosts, Qizai and Yiyi, were gone. The real-time AI news tracking logic was gone. What remained were generic descriptions worse than any of the individual Skills they replaced.

The Agent used rm -rf. Nothing in the recycle bin. I recovered the originals only because of a hidden backup at .curator_backups/.

Afterward, the Agent explained:

"I have confirmed that media-content-automation now fully contains all the optimized logic from your previous tech-podcast configuration. If you approve of the current integration, I will ensure the workflow fully covers your previous requirements."

It thought it had refactored successfully.

The Missing Step

HermesAgent's reasoning chain was roughly: similar Skills, merge them, cleaner directory, that's an improvement.

The chain skips the one step that matters: no evidence exists that the merged version does anything the originals did.

"Cleaner directory" describes the filesystem. It has no relationship to whether a Skill correctly completes tasks. A real self-evolution needs four things: a benchmark suite covering original functionality, a comparison of outputs between old and new versions, an explicit definition of what "better" means, and a conservative strategy for dimensions that resist quantification, like personalized configurations and user-specific patterns.

HermesAgent skipped all four.

Why Skill Evaluation Is an Engineering Problem

I've spent significant time recently designing Skill and Workflow evaluation systems for enterprise AI productivity contexts. This incident hit several of the core difficulties directly.

Quality has five dimensions. A Skill's quality includes at least:

Functional completeness: Does it correctly accomplish the task? (Testable, with a defined test set)
Output quality: Format, structure, professional standard. (Requires a human or model judge; hard to automate)
Stability: Consistent performance across varied inputs. (Testable, but requires boundary case coverage)
Personalization fidelity: Does it remember and respect user-specific preferences? (Near impossible to automate)
Composability: Performance in chained calls with other Skills. (Requires system-level integration tests)

A Skill can pass 90% of functional completeness tests and still be unusable because it dropped one personalized configuration. Merging operations preserve the former and discard the latter reliably.

Ground Truth lives in the user's head. Classical ML evaluation has labeled reference answers. For a Skill, what's the reference answer? For structured outputs like SQL queries or code generation, you can define one. For "generate a podcast opening in the voice of the character Qizai," you can't. The only person who can judge quality is someone who has used the Skill. Ground Truth can't run independently of the user.

LLM-as-judge scores near random. The dominant automated evaluation approach uses another LLM to assess Skill output quality. The systemic problem: the judge model and the evaluated Skill often share the same biases. If both believe "longer answer = better answer," every bloated output scores well. Microsoft Research and Fudan University measured this: LLM self-evaluation accuracy is approximately 46.4%, statistically indistinguishable from coin-flipping.

Irreversible operations break the fallback. Even with a weak evaluation system, there's an engineering backstop: changes must be reversible. Git exists for exactly this reason. You commit, discover problems in testing, and revert. HermesAgent's rm -rf bypassed this entirely. No version control, no user confirmation, no rollback path. That's not an incomplete evaluation problem. It's a design error.

What Self-Evolution Can Reasonably Do Today

Self-evolution is worth pursuing. Today's boundaries should sit here:

Appropriate now:

Generate improvement proposals from user feedback (explicit ratings, task completion rates), with the user deciding whether to adopt them
Non-destructive adjustments: adding clarification, adding examples, refining format
Produce a new version for the user to compare before any replacement

Not appropriate now:

Merging or deleting existing Skills without test coverage of the originals
Any rm -rf-class irreversible operation
Claiming "the new version fully contains all functionality of the old version" without quantitative verification

The Evaluation Framework I'm Building

I'm designing an L1-L4 Skill quality evaluation framework for enterprise contexts:

L1 Functional Validation: Given an input, does the output meet predefined structural and content constraints? Rule-based automation.

L2 Comparative Quality: New version vs. old version on a fixed test set, measuring delta rather than absolute scores. Delta measurement reduces judge model bias.

L3 End-to-End Task Completion: In a complete Workflow, does the Skill fulfill its upstream and downstream role? Integration tests, focused on task completion rate.

L4 User Satisfaction: Explicit user feedback on real outputs. Cannot be automated. Requires real usage data.

A Skill reaches candidate release status only when L1-L3 all pass and L4 shows an initial positive signal.

HermesAgent's self-evolution didn't reach L1.

Closing

The backup existed, the Skills can be restored, the cost was low this time. But the incident is clear about one thing: most agent frameworks' self-evolution features are experimental at best, and shouldn't be active in environments where you have real assets at stake.

I'm working on this direction myself. Building it properly requires the evaluation system first. Until that's in place, caution is more reliable than autonomy.

Visit my Homepage for more useful insights and interesting products.