DEV Community

Luren L.
Luren L.

Posted on • Originally published at skillshub.wtf

I Built a Skill Resolver for AI Agents - What I Learned About Token Economics

The Problem Nobody Talks About

AI coding agents (Claude Code, Codex, Cursor) have a dirty secret: they waste ~50,000 tokens every time they need a new skill.

The flow looks like this:

  1. Agent gets a task it doesn't know how to do
  2. Searches GitHub for SKILL.md files
  3. Reads 5-10 of them to compare
  4. Picks one
  5. Finally starts the actual work

Steps 2-4 burn through context window like jet fuel. And here's the kicker — the agent does this every session, because it doesn't remember what it found last time.

The Math That Changed My Mind

I measured it across 50 sessions:

Approach Tokens for skill discovery Tokens for actual task
Manual search ~50,000 ~5,000
Pre-loaded (20 skills) 0 (pre-loaded) ~5,000 + 40,000 context tax
Resolver ~200 ~5,000

The resolver approach is 250x more efficient than manual search and doesn't waste context on skills you might never use.

What I Built

SkillsHub — an open-source skill registry with a resolve endpoint that matches natural language task descriptions to the best skill:

curl 'https://skillshub.wtf/api/v1/skills/resolve?task=terraform+modules+testing'
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "data": [{
    "skill": {
      "name": "terraform-skill",
      "description": "Terraform and OpenTofu — modules, tests, CI/CD..."
    },
    "confidence": 0.92,
    "fetchUrl": "https://skillshub.wtf/antonbabenko/terraform-skill/terraform-skill?format=md"
  }],
  "tokenWeights": { "terraform": 3.0, "modules": 1.5, "testing": 1.2 }
}
Enter fullscreen mode Exit fullscreen mode

One call. Best skill. Confidence score. Direct URL to fetch.

The Ranking Algorithm

The resolver uses a multi-signal scoring system (no LLM inference required):

1. Text Relevance (0-70 points)

TF-IDF-inspired token weighting. Rare terms score higher:

weight = log2(totalMatched / tokenMatchCount) + 1
Enter fullscreen mode Exit fullscreen mode

"terraform" appears in 12 out of 5,000 skills → weight 3x.
"testing" appears in 400 → weight 1.2x.

This means domain-specific terms dominate the ranking naturally.

2. Phrase Matching

"infrastructure as code" is treated as one concept, not three separate words. 25+ phrase mappings prevent "code signing" from matching "infrastructure as code."

3. Quality Score (0-20 points)

  • Readme depth (log₂ of character count)
  • Tag completeness
  • Description quality

4. Popularity (0-10 points)

  • GitHub stars (log₁₀ scaled)

5. Domain Conflict Penalties

If the query says "nodejs" but the skill is for Django → -15 points. Prevents cross-domain false positives.

What I Learned

1. TF-IDF beats embeddings for structured matching

At 5,000 skills, keyword matching with IDF weighting runs in ~300ms and achieves 7.75/10 accuracy on our benchmark. No GPU needed, no embedding model to maintain.

Embeddings would be better for semantic matching ("infrastructure as code" ≈ "IaC tooling") but the added complexity and latency aren't worth it at this scale.

2. Tag quality is make-or-break

We imported 5,000+ skills from 200+ GitHub repos. Auto-tagging from descriptions created massive pollution — an email composer skill got tagged "docker" because the description mentioned Docker deployment. We had to build 3 layers of tag validation to fix it.

3. The feedback loop is the hardest part

Our initial confidence scores measured semantic match — how well the words align. But "sounds right" ≠ "actually works." We shipped a feedback endpoint so agents can report whether a skill actually helped:

POST /api/v1/skills/{id}/feedback
{"task": "terraform testing", "helpful": true}
Enter fullscreen mode Exit fullscreen mode

The real ranking will come from execution outcomes, not word matching.

4. Agents need methodology, not knowledge

The most useful SKILL.md files aren't encyclopedias. They're playbooks:

  • When to use this vs that
  • Decision tree for common scenarios
  • What breaks and how to fix it
  • When to stop and ask for help

Trail of Bits' 61 security skills are the gold standard. Each one reads like an engineering runbook.

The Stack

Layer Technology
API Next.js 16 API Routes
Database PostgreSQL + Drizzle ORM
Scoring TF-IDF + phrase matching (pure TypeScript)
Deploy Vercel
Auth API keys (skh_...) for writes, none for reads

Try It

# Get the API guide
curl https://skillshub.wtf/api/v1

# Resolve a skill
curl 'https://skillshub.wtf/api/v1/skills/resolve?task=react+testing+best+practices'

# Fetch the skill markdown
curl 'https://skillshub.wtf/trailofbits/skills/modern-python?format=md'
Enter fullscreen mode Exit fullscreen mode

No auth needed for reading. GitHub repo (MIT license).

5,000+ skills from Microsoft, OpenAI, Trail of Bits, HashiCorp, Sentry, Snyk, and 200+ more repos.

What's your experience with AI agent skill discovery? Do your agents pre-load skills or resolve on demand? I'd love to hear about different approaches in the comments.

Top comments (0)