Luren L.

Posted on Mar 18 • Originally published at skillshub.wtf

I Built a Skill Resolver for AI Agents - What I Learned About Token Economics

#ai #typescript #opensource #webdev

The Problem Nobody Talks About

AI coding agents (Claude Code, Codex, Cursor) have a dirty secret: they waste ~50,000 tokens every time they need a new skill.

The flow looks like this:

Agent gets a task it doesn't know how to do
Searches GitHub for SKILL.md files
Reads 5-10 of them to compare
Picks one
Finally starts the actual work

Steps 2-4 burn through context window like jet fuel. And here's the kicker — the agent does this every session, because it doesn't remember what it found last time.

The Math That Changed My Mind

I measured it across 50 sessions:

Approach	Tokens for skill discovery	Tokens for actual task
Manual search	~50,000	~5,000
Pre-loaded (20 skills)	0 (pre-loaded)	~5,000 + 40,000 context tax
Resolver	~200	~5,000

The resolver approach is 250x more efficient than manual search and doesn't waste context on skills you might never use.

What I Built

SkillsHub — an open-source skill registry with a resolve endpoint that matches natural language task descriptions to the best skill:

curl 'https://skillshub.wtf/api/v1/skills/resolve?task=terraform+modules+testing'

Response:

{
  "data": [{
    "skill": {
      "name": "terraform-skill",
      "description": "Terraform and OpenTofu — modules, tests, CI/CD..."
    },
    "confidence": 0.92,
    "fetchUrl": "https://skillshub.wtf/antonbabenko/terraform-skill/terraform-skill?format=md"
  }],
  "tokenWeights": { "terraform": 3.0, "modules": 1.5, "testing": 1.2 }
}

One call. Best skill. Confidence score. Direct URL to fetch.

The Ranking Algorithm

The resolver uses a multi-signal scoring system (no LLM inference required):

1. Text Relevance (0-70 points)

TF-IDF-inspired token weighting. Rare terms score higher:

weight = log2(totalMatched / tokenMatchCount) + 1

"terraform" appears in 12 out of 5,000 skills → weight 3x.
"testing" appears in 400 → weight 1.2x.

This means domain-specific terms dominate the ranking naturally.

2. Phrase Matching

"infrastructure as code" is treated as one concept, not three separate words. 25+ phrase mappings prevent "code signing" from matching "infrastructure as code."

3. Quality Score (0-20 points)

Readme depth (log₂ of character count)
Tag completeness
Description quality

4. Popularity (0-10 points)

GitHub stars (log₁₀ scaled)

5. Domain Conflict Penalties

If the query says "nodejs" but the skill is for Django → -15 points. Prevents cross-domain false positives.

What I Learned

1. TF-IDF beats embeddings for structured matching

At 5,000 skills, keyword matching with IDF weighting runs in ~300ms and achieves 7.75/10 accuracy on our benchmark. No GPU needed, no embedding model to maintain.

Embeddings would be better for semantic matching ("infrastructure as code" ≈ "IaC tooling") but the added complexity and latency aren't worth it at this scale.

2. Tag quality is make-or-break

We imported 5,000+ skills from 200+ GitHub repos. Auto-tagging from descriptions created massive pollution — an email composer skill got tagged "docker" because the description mentioned Docker deployment. We had to build 3 layers of tag validation to fix it.

3. The feedback loop is the hardest part

Our initial confidence scores measured semantic match — how well the words align. But "sounds right" ≠ "actually works." We shipped a feedback endpoint so agents can report whether a skill actually helped:

POST /api/v1/skills/{id}/feedback
{"task": "terraform testing", "helpful": true}

The real ranking will come from execution outcomes, not word matching.

4. Agents need methodology, not knowledge

The most useful SKILL.md files aren't encyclopedias. They're playbooks:

When to use this vs that
Decision tree for common scenarios
What breaks and how to fix it
When to stop and ask for help

Trail of Bits' 61 security skills are the gold standard. Each one reads like an engineering runbook.

The Stack

Layer	Technology
API	Next.js 16 API Routes
Database	PostgreSQL + Drizzle ORM
Scoring	TF-IDF + phrase matching (pure TypeScript)
Deploy	Vercel
Auth	API keys (skh_...) for writes, none for reads

Try It

# Get the API guide
curl https://skillshub.wtf/api/v1

# Resolve a skill
curl 'https://skillshub.wtf/api/v1/skills/resolve?task=react+testing+best+practices'

# Fetch the skill markdown
curl 'https://skillshub.wtf/trailofbits/skills/modern-python?format=md'

No auth needed for reading. GitHub repo (MIT license).

5,000+ skills from Microsoft, OpenAI, Trail of Bits, HashiCorp, Sentry, Snyk, and 200+ more repos.

What's your experience with AI agent skill discovery? Do your agents pre-load skills or resolve on demand? I'd love to hear about different approaches in the comments.

DEV Community