DEV Community

Cover image for How I indexed 69,000 Claude Code skills (and what I learned doing it)
Adam Lankamer
Adam Lankamer

Posted on • Originally published at claudskills.com

How I indexed 69,000 Claude Code skills (and what I learned doing it)


One month ago I started building an open catalog of Claude Code skills. Yesterday it crossed 69,369 indexed SKILL.md files. This post is the engineering story — what I built, what surprised me, and what's free for anyone to use.

If you've never written a Claude Code skill: it's a Markdown file with YAML frontmatter that gives Anthropic's Claude Code agent specialized behavior. Drop it in ~/.claude/skills/<name>/SKILL.md and Claude can invoke it as a slash command. Think of it like a Vim plugin or a VSCode extension, except the contract is "instructions in English" rather than "code in Lua / TypeScript."

The format is brand-new. The official spec doesn't ship a catalog. The awesome-* lists I could find at the time covered maybe 300 hand-picked entries. Meanwhile, GitHub's code search showed thousands of public repos with SKILL.md files in them. The long tail of the ecosystem was completely invisible. That's the gap I set out to close.

The shape of the problem

Here's what I knew going in:

  1. Discovery was broken. A skill author would push their SKILL.md to GitHub and ... nothing. No directory, no aggregator, no search surface. The only way another developer found it was Twitter, Discord, or stumbling onto the repo.

  2. Quality varied wildly. Some skills were 200-line operator-grade tools with pricing tables, anti-trigger sections, and structured examples. Others were 4-line stubs that read like "TODO: write a skill that does X." Both were indexable, neither was distinguishable from outside.

  3. The format itself was changing fast. The frontmatter spec gained fields monthly — allowed-tools, user-invokable, model, metadata.api_base. Yesterday's "good" SKILL.md could be tomorrow's missing-required-field.

  4. There was no good API surface. If you wanted to build something on top of the skill ecosystem (a tool for evaluating skills, a recommender, an installer), you had to scrape GitHub yourself.

I wanted a catalog that fixed all four. Open data, daily refresh, free API, free dataset. No pay-to-list, no listing fees, no ranking-for-money. The only paid product would be an evaluation layer for end-users (a quality score in the desktop app), never anything skill authors had to opt into. Anti-rent-seeking by construction.

The miner — 24 sources, every night

The catalog is built by a single Python script that runs on a Mac mini in my office at 01:00 local. It crawls 24 public sources looking for SKILL.md files:

Source What it discovers
GitHub code search (filename:SKILL.md) The bulk of the catalog — 101 query variants covering language hints, frontmatter fields, and date-bounded slices to defeat the 1000-result hard cap
GitHub Topics (topic:claude-code-skills) + 31 variants Topic-tagged repos
GitHub Gists Single-file skills posted as gists (most catalogs miss these)
Awesome-list READMEs (32 lists) Anything the existing curators picked
GitLab, Codeberg Skills outside GitHub
HuggingFace Skills uploaded as datasets
Reddit, HackerNews Algolia, Bluesky, Mastodon, dev.to, YouTube, Telegram Mentions in posts/comments — text-blob scan for repo URLs
Wayback Machine CDX API Renamed / deleted repos still discoverable via archive.org
Stargazer graph mining Once we find one good skill repo, mine who starred it — they often have skills too
Author repo enumeration When we admit one of an author's skills, scan their other repos
Topic co-occurrence Topics tagged alongside claude-code-skills get crawled for next run
VSCode + Open VSX marketplaces Some extensions ship with SKILL.md companions
Brave Search API Web-search-anchored discovery
LLM query expansion Claude generates next-week's search queries based on what's been found

Each source returns candidate repo URLs. The miner fetches the SKILL.md, validates the YAML frontmatter, runs admission scoring (more on this below), categorizes by domain (Engineering / Security / Growth / etc. — 10 categories total), tags across ~100 orthogonal dimensions (language, framework, AI provider, cloud, integration type), and writes a static HTML page at /skills/<slug>/.

The miner is bounded: per-source caps prevent any one source from draining the GitHub API budget; every section runs inside a _safe_section() try-block so a single broken endpoint can't kill the run.

A full run takes about 4 hours. New skills appear on the live catalog the same day they're discovered.

Admission — content signals only, no popularity

This is the part I'm most opinionated about. Ranking can't be bought. The moment a paid signal influences who appears in the catalog (or in what order), the value proposition collapses — nobody pays for "objective evaluation" when it isn't objective.

So the catalog admits skills based on a content score derived from the SKILL.md itself:

  • Anti-trigger discipline — does the SKILL.md have a "when NOT to use" or "out of scope" section? That's a +4 per pattern, capped at +16. Strong negative-space marking is the single best signal that the author thought about edge cases.
  • Pricing / quota transparency — does it document costs, rate limits, or expected API spend? +10.
  • Frontmatter depth — beyond name: and description:, how many other fields are present (model:, tags:, version:, license:, allowed-tools:, metadata.*)? Capped at 10 distinct keys to prevent gaming.
  • Length × structure — is the body substantive (>800 chars in description:, multiple code blocks, headings)?
  • Filler-phrase penalty// TODO, Lorem ipsum, generic templated phrases → minus 5.

The score never weighs stars, forks, install counts, GitHub follower count, or any other popularity signal. A skill written by a developer with 0 GitHub followers and a clear anti-trigger section beats a flashy skill by a 50k-follower influencer that's just frontmatter-and-vibes. That's the bar.

For ranking inside the desktop app's Pro tier — a separate evaluation layer — the formula is the same content-only structural score plus frontmatter-completeness, rescaled to [50, 100]. Still no popularity signals.

This costs me about 30% of what an unconstrained "rank by stars" catalog would surface. I'm OK with that trade.

What surprised me

1. The catalog is dominated by a handful of prolific authors. One contributor has 3,446 admitted skills (yes, really). The top 25 authors account for ~30% of the catalog. There's a Pareto distribution underneath the long tail.

2. Sales-category skills score highest on content quality. Counter-intuitive — I expected Engineering or Security to be most polished. Turns out sales-focused skill authors over-index on structure (anti-trigger sections, scope discipline, pricing transparency) because that's their professional habit. Engineering authors more often skip the "when NOT to use" section because they assume it's obvious.

3. Vendor-side adoption is still 0. The catalog has zero skills with author_url pointing at anthropic.com, OpenAI.com, or any other large AI vendor. Every entry is independent. The ecosystem is fully community-driven.

4. The SKILL.md format is leaking sideways. I found skills in repos tagged cline-skills, cursor-rules, aider-skills, windsurf-rules. The format is becoming a portable agent-skill standard, not just a Claude Code thing. The catalog admits these too — they're SKILL.md files, the agent that loads them is the user's choice.

5. The biggest discovery surface isn't GitHub code search. It's the stargazer graph. When a SKILL.md hits a few hundred stars, the people who star it have a 30%+ rate of having their own SKILL.md somewhere in their account. Mining the graph yields skills the code-search queries don't find.

What's free

Everything the catalog produces is open:

  • Public catalog at https://claudskills.com/ — browseable + searchable.
  • Open dataset at github.com/claudskills/catalog-public — daily refresh in 6 formats (JSON, NDJSON, CSV, Parquet, Atom feed, README). CC BY 4.0.
  • HuggingFace mirror at huggingface.co/datasets/claudskills/skills — same data, parquet-native, suitable for LLM training.
  • Public REST API at https://claudskills.com/api/v1/ — read-only, no auth, CORS-open, edge-cached. OpenAPI 3.1 spec covers every endpoint. Paginated /skills, single-skill /skills/<slug>, /categories, /tags, /stats. The catalog API itself is ~300 LOC of Cloudflare Worker code; the heavy lifting is the daily miner.
  • Embeddable skill card at https://claudskills.com/embed/<slug>.js — one-line <script> tag that injects a styled card into any blog post or doc page. The card you'd drop into your own writeup of a favorite skill.
  • Shields.io-style badge at https://claudskills.com/badge/<slug>.svg — for skill authors to drop into their GitHub READMEs.
  • Daily Skill-of-the-Day archive at /sotd/YYYY-MM-DD/ — every UTC day picks one skill via a date-hash that stays consistent across mobile push, social posts, and the web.
  • Per-category, per-tag, per-author, and per-use-case landing pages — about 2,800 hub pages total covering the catalog from every browsing angle.

What I'd change if starting over

A few things I learned the hard way:

  1. Build the public dataset first, the website second. I spent the first two months making the website nice. The dataset would have driven more usage faster — researchers and tool-builders pick up CC BY 4.0 data within days of finding it; consumer-facing UIs take months to build word-of-mouth.

  2. Cloudflare Workers + R2 + Netlify together is more reliable than any one of them. The site has 64,000+ per-skill HTML pages, which would blow Netlify's deploy-prep budget at scale. So per-skill HTML files live in Cloudflare R2 with a Netlify rewrite to serve them from claudskills.com/skills/<slug>/. API + embed + badge endpoints are Cloudflare Workers bound to the same domain. The homepage + static pages are direct from Netlify. Each layer doing what it's best at.

  3. Anti-popularity signals were the hardest decision and the most important one. Every time I evaluate a candidate change to the ranking algorithm, "would skill authors pay to influence this?" is the test. If yes, the change doesn't ship. The discipline pays off when you have a Pro subscription product — it's "pay $9/month for the multi-signal Quality Score in the desktop app," and there's nothing for me to defend about why the score is honest. It's honest by construction.

What's next

The next quarter is about distribution — the catalog exists, now developers need to find it. The roadmap:

  • 25 awesome-list PRs (live next week)
  • A weekly catalog-growth report cross-posted to dev.to / Hashnode / Medium / LinkedIn
  • Embed cards in third-party blog posts (the API is ready; the inbound demand will tell us if the embed surface gets traction)
  • iOS and Android companion apps for discovery (already in App Store review at the time of writing)

If you've written a SKILL.md, it's probably already in the catalog — search for your repo name at claudskills.com. If you haven't, the catalog will pick it up within 24 hours of you pushing to a public GitHub repo. If you want to fast-track it, there's a submit form on the homepage.

If you're a researcher, a tool-builder, or an LLM-pipeline operator who wants to ingest the data: the public dataset refreshes daily, and the API is rate-limit-free for normal use. Build something cool — I'd love to hear about it.

The catalog is at claudskills.com. The dataset is at github.com/claudskills/catalog-public. Comments + questions welcome.


ClaudSkills is an independent community catalog. Claude™ is a trademark of Anthropic PBC; ClaudSkills is not affiliated with, endorsed by, or sponsored by Anthropic.

Top comments (1)

Collapse
 
adamlankamer profile image
Adam Lankamer

Thanks for reading 👋 Happy to dig into specifics — the miner's discovery-source priorities, the Cloudflare Workers + R2 split, the content-only ranking scorecard, or why I dropped popularity signals entirely. The catalog itself is at claudskills.com if you want to poke around — search for your own GitHub username, your skills are probably already in there.