The Problem Nobody Talks About
AI coding agents (Claude Code, Codex, Cursor) have a dirty secret: they waste ~50,000 tokens every time they need a new skill.
The flow looks like this:
- Agent gets a task it doesn't know how to do
- Searches GitHub for SKILL.md files
- Reads 5-10 of them to compare
- Picks one
- Finally starts the actual work
Steps 2-4 burn through context window like jet fuel. And here's the kicker — the agent does this every session, because it doesn't remember what it found last time.
The Math That Changed My Mind
I measured it across 50 sessions:
| Approach | Tokens for skill discovery | Tokens for actual task |
|---|---|---|
| Manual search | ~50,000 | ~5,000 |
| Pre-loaded (20 skills) | 0 (pre-loaded) | ~5,000 + 40,000 context tax |
| Resolver | ~200 | ~5,000 |
The resolver approach is 250x more efficient than manual search and doesn't waste context on skills you might never use.
What I Built
SkillsHub — an open-source skill registry with a resolve endpoint that matches natural language task descriptions to the best skill:
curl 'https://skillshub.wtf/api/v1/skills/resolve?task=terraform+modules+testing'
Response:
{
"data": [{
"skill": {
"name": "terraform-skill",
"description": "Terraform and OpenTofu — modules, tests, CI/CD..."
},
"confidence": 0.92,
"fetchUrl": "https://skillshub.wtf/antonbabenko/terraform-skill/terraform-skill?format=md"
}],
"tokenWeights": { "terraform": 3.0, "modules": 1.5, "testing": 1.2 }
}
One call. Best skill. Confidence score. Direct URL to fetch.
The Ranking Algorithm
The resolver uses a multi-signal scoring system (no LLM inference required):
1. Text Relevance (0-70 points)
TF-IDF-inspired token weighting. Rare terms score higher:
weight = log2(totalMatched / tokenMatchCount) + 1
"terraform" appears in 12 out of 5,000 skills → weight 3x.
"testing" appears in 400 → weight 1.2x.
This means domain-specific terms dominate the ranking naturally.
2. Phrase Matching
"infrastructure as code" is treated as one concept, not three separate words. 25+ phrase mappings prevent "code signing" from matching "infrastructure as code."
3. Quality Score (0-20 points)
- Readme depth (log₂ of character count)
- Tag completeness
- Description quality
4. Popularity (0-10 points)
- GitHub stars (log₁₀ scaled)
5. Domain Conflict Penalties
If the query says "nodejs" but the skill is for Django → -15 points. Prevents cross-domain false positives.
What I Learned
1. TF-IDF beats embeddings for structured matching
At 5,000 skills, keyword matching with IDF weighting runs in ~300ms and achieves 7.75/10 accuracy on our benchmark. No GPU needed, no embedding model to maintain.
Embeddings would be better for semantic matching ("infrastructure as code" ≈ "IaC tooling") but the added complexity and latency aren't worth it at this scale.
2. Tag quality is make-or-break
We imported 5,000+ skills from 200+ GitHub repos. Auto-tagging from descriptions created massive pollution — an email composer skill got tagged "docker" because the description mentioned Docker deployment. We had to build 3 layers of tag validation to fix it.
3. The feedback loop is the hardest part
Our initial confidence scores measured semantic match — how well the words align. But "sounds right" ≠ "actually works." We shipped a feedback endpoint so agents can report whether a skill actually helped:
POST /api/v1/skills/{id}/feedback
{"task": "terraform testing", "helpful": true}
The real ranking will come from execution outcomes, not word matching.
4. Agents need methodology, not knowledge
The most useful SKILL.md files aren't encyclopedias. They're playbooks:
- When to use this vs that
- Decision tree for common scenarios
- What breaks and how to fix it
- When to stop and ask for help
Trail of Bits' 61 security skills are the gold standard. Each one reads like an engineering runbook.
The Stack
| Layer | Technology |
|---|---|
| API | Next.js 16 API Routes |
| Database | PostgreSQL + Drizzle ORM |
| Scoring | TF-IDF + phrase matching (pure TypeScript) |
| Deploy | Vercel |
| Auth | API keys (skh_...) for writes, none for reads |
Try It
# Get the API guide
curl https://skillshub.wtf/api/v1
# Resolve a skill
curl 'https://skillshub.wtf/api/v1/skills/resolve?task=react+testing+best+practices'
# Fetch the skill markdown
curl 'https://skillshub.wtf/trailofbits/skills/modern-python?format=md'
No auth needed for reading. GitHub repo (MIT license).
5,000+ skills from Microsoft, OpenAI, Trail of Bits, HashiCorp, Sentry, Snyk, and 200+ more repos.
What's your experience with AI agent skill discovery? Do your agents pre-load skills or resolve on demand? I'd love to hear about different approaches in the comments.
Top comments (0)