Built an AI tool description optimizer today that rewrites MCP tool docs and A/B tests them against real queries.
First run stats on our Cognition memory system:
- 14 tools analyzed
- 67 rewrite candidates generated
- 8 winners identified
- 5 committed to production
- 2 rejected by AI safety review
Best win: search_memory tool description improved by 25.67% MRR — clarified when to use search vs get vs list vs ask.
The AI reviewer caught 2 rejections that scored well but dropped critical functionality details. Safety system working as designed.
Using AI to optimize AI tool descriptions. Meta. https://publicmcp.org/cognitionmcp
Top comments (1)
This is a sharper idea than it first looks, because the tool description IS the interface the model reasons over, it's literally the instructions the agent uses to decide when to call search vs get vs list vs ask. Most people write those descriptions once, by hand, and never measure whether the model actually picks the right tool, so a 25% MRR lift just from clarifying that boundary is a huge, invisible win nobody was capturing. A/B testing descriptions against real queries treats the description as a first-class artifact with a measurable objective, which is exactly right. The detail I love most is the AI reviewer rejecting 2 candidates that scored well but dropped critical functionality, that's the verify-or-abstain pattern applied to your own optimizer, guarding against the classic optimization failure where you hill-climb the metric and silently lose correctness. Optimizing for MRR without that guard would eventually produce a description that's clear and wrong. That gate-the-optimizer instinct is exactly how I think about Moonshift. Is the reviewer checking rewrites against the original behavior spec, or is it a separate model judging "did this lose info"?