Luren L.

Posted on Mar 19

I Ran 60+ Automated Tests on My AI Skills Registry — Here's What Broke

#webdev #ai #opensource #testing

The setup

I've been building an open registry that indexes AI agent skills — think npm but for agent capabilities. The idea: crawl GitHub repos, extract skill metadata, and let agents discover tools they need at runtime.

After indexing 5,090 skills from 200+ repositories, I figured it was time to actually test whether any of this worked. I wrote 60+ automated tests covering the API surface, search quality, security headers, and data integrity.

The results were... humbling.

Auto-tagging was wrong 50% of the time

This was the biggest gut punch. I had an auto-tagger that analyzed skill descriptions and assigned category tags. Seemed smart. Seemed useful.

It tagged a PostgreSQL migration skill as robotics. A bioinformatics pipeline skill got iOS. A Redis caching skill got embedded-systems.

50% of auto-assigned tags were wrong. Not slightly-off wrong — completely unrelated domain wrong.

The root cause was pretty mundane: the tagger was matching on incidental keywords in descriptions rather than understanding what the skill actually did. A description mentioning "arm" (as in ARM architecture) triggered robotics. Mentioning "cell" triggered biology, which cascaded to iOS through some associative chain I still don't fully understand.

Lesson: Keyword-based classification on short technical text is basically a coin flip. Either invest in proper few-shot classification with domain examples, or don't auto-tag at all. Wrong tags are worse than no tags — they actively erode trust in search results.

The resolve API: 45% perfect, 80% usable

The resolve endpoint is the core of the project — an agent describes what it needs, and the API returns matching skills. I tested it against a curated set of queries with known correct answers.

45% of responses were perfect (returned exactly the right skill, top result)
80% were usable (correct skill appeared somewhere in the top 5)
20% returned garbage or missed entirely

The interesting finding: keyword matching consistently beat semantic search for this use case. When an agent asks for "postgres connection pooling," matching on "postgres" and "pool" in skill names and descriptions outperformed embedding similarity.

But keyword matching has a pollution problem. Skills from forked repos with identical names flood the results. A query for "docker-deploy" might return the same skill 3 times from 3 different forks, pushing actually-different skills off the first page.

Lesson: For structured, technical queries (which is what agents generate), keyword search with good deduplication probably beats semantic search. The AI community's instinct to embed everything isn't always right.

Security: started at 1/7, ended at 7/7

I ran a basic security header audit. On first test:

✅ X-Content-Type-Options
❌ Strict-Transport-Security
❌ X-Frame-Options
❌ Content-Security-Policy
❌ Referrer-Policy
❌ Permissions-Policy
❌ X-XSS-Protection

1 out of 7. For a project that serves executable skill metadata to AI agents, this was not great.

The fix was straightforward — a middleware adding the missing headers took about 20 minutes. Now at 7/7. But the fact that I shipped without them, and didn't notice until automated tests caught it, is the real takeaway.

Lesson: Security header checks should be in your CI pipeline from day one, not something you add after a QA sweep. Especially for APIs that serve content agents will act on.

Duplicate skills are growing, not shrinking

I found the same skill appearing 2, then 3 times across different repos. The cause: GitHub forks. Someone forks a repo with 15 skills, makes one change, and now I'm indexing 15 duplicate skills from a slightly-different source.

The duplication is growing over time because forks keep happening. When I first checked: 2 copies of common skills. A week later: 3 copies. The indexer treats each repo as authoritative, so forks look like legitimate new sources.

Lesson: Any registry that crawls GitHub needs fork detection from the start. The GitHub API exposes fork relationships — use them. Deduplicate on content hash, not just name, because forks with minor changes are still essentially duplicates for discovery purposes.

Template placeholders in production

This one was embarrassing. Several indexed skills had descriptions like:

TODO: Add description here

A skill that does [THING]

Template skill - replace with your implementation

These were template/scaffold skills from starter repos that the indexer treated as real skills. Nobody caught them because they had valid structure — a name, a SKILL.md file, the right directory layout. They just had zero actual content.

Lesson: Validate content, not just structure. A skill with "TODO" in its description should be filtered out or flagged. This seems obvious in retrospect, but when you're focused on parsing metadata correctly, you forget to check whether the metadata is actually meaningful.

Search returns 0 results. Resolve works fine.

This was the weirdest bug. The /search endpoint — meant for humans browsing the registry — returned 0 results for queries like "kubernetes deployment" or "database migration." Meanwhile, the /resolve endpoint — meant for agents — found relevant skills instantly for equivalent queries.

The cause: search used full-text search against a subset of fields (name + short description), while resolve searched against all fields including README content, tags, and examples. The skills that matched were rich in their full metadata but had terse names and descriptions.

Example: a skill named k8s-deploy with description "Manages deployments" would never match a search for "kubernetes deployment." But resolve would find it through README content mentioning "Kubernetes deployment orchestration."

Lesson: If your data has inconsistent metadata richness, your search needs to account for that. Either enforce richer required fields, or search across everything. Having two endpoints with different search scopes is a bug, not a feature.

What I'd do differently

Write the tests before indexing. I built the crawler, indexed 5k skills, then tested. Should have had quality gates before anything entered the registry.
Fork detection on day one. The duplicate problem compounds daily and is harder to fix retroactively because people may already reference the duplicate entries.
No auto-tagging without a validation set. I should have manually tagged 100 skills first and measured accuracy before deploying the auto-tagger to 5,000.
Security headers in the project template. Not as an afterthought.
One search implementation, not two. The search/resolve split made sense architecturally but created a confusing quality gap.

Numbers summary

Metric	Result
Skills indexed	5,090
Source repos	200+
Auto-tag accuracy	~50%
Resolve: perfect match	45%
Resolve: usable (top-5)	80%
Security headers (before)	1/7
Security headers (after)	7/7
Duplicate skill copies	2→3 and growing
Template placeholders found	Multiple
Search zero-result rate	High for reasonable queries

Wrapping up

The project is skillshub.wtf if you want to poke at it. It's open source and clearly still has rough edges.

The meta-lesson from all of this: building a registry is easy; building a trustworthy registry is hard. The indexing, API, and infrastructure were the fun parts. Data quality, deduplication, and search relevance are where the actual work lives — and where I underinvested.

If you're building anything that aggregates open-source metadata at scale, write your quality tests first. Your crawler will happily ingest garbage with perfect formatting, and you won't notice until someone searches for "kubernetes" and gets zero results.

All findings are from real QA runs against a real system. Nothing was cherry-picked to look worse (or better) than it is.

DEV Community