TL;DR: I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here's how.
The Manual Content Problem
When I launched indx.sh, I had a content problem. The AI coding ecosystem moves fast:
- New MCP servers pop up daily
- Developers publish cursor rules and skill definitions constantly
- Official repositories get updates
- Star counts change
Manually tracking all this? Impossible.
The Solution: GitHub Crawlers
I built three automated crawlers that run daily:
-
Prompts Crawler - Discovers
.cursorrules,CLAUDE.md, andcopilot-instructions.mdfiles -
Skills Crawler - Finds repos with
SKILL.mdfiles - MCP Crawler - Finds Model Context Protocol servers
All run as Vercel cron jobs, so the directory stays fresh without manual work.
How the Prompts Crawler Works
The newest crawler searches for AI coding rules across multiple tools:
const FILE_SEARCHES = [
{ query: 'filename:.cursorrules', tool: 'cursor' },
{ query: 'filename:CLAUDE.md', tool: 'claude-code' },
{ query: 'filename:copilot-instructions.md', tool: 'copilot' },
];
const REPO_SEARCHES = [
'cursor-rules in:name,description',
'awesome-cursorrules',
'topic:cursor-rules',
];
For each file found:
- Fetch the content from GitHub
- Generate a slug from
owner-repo-filename - Infer category and tags from content
- Auto-verify repos with 100+ stars
- Upsert to database
First run indexed 175 prompts across Cursor, Claude Code, and Copilot.
How the Skills Crawler Works
// Search GitHub for SKILL.md files
const { items } = await searchGitHub('filename:SKILL.md');
for (const item of items) {
// Fetch the actual SKILL.md content
const content = await fetchFileContent(owner, repo, item.path);
// Parse frontmatter (name, description, tags)
const metadata = parseFrontmatter(content);
// Upsert to database
await prisma.skill.upsert({
where: { slug },
create: { ...metadata, content, githubStars },
update: { githubStars }, // Keep stars fresh
});
}
The key insight: GitHub's code search API lets you search by filename. filename:SKILL.md returns every repo with that file.
How the MCP Crawler Works
MCP servers are trickier - there's no single file convention. I use multiple search strategies:
const SEARCH_STRATEGIES = [
'mcp server in:name,description',
'model context protocol server',
'topic:mcp',
'@modelcontextprotocol/server',
'mcp server typescript',
'mcp server python',
];
For each strategy:
- Search GitHub repos sorted by stars
- Filter for MCP-related content
- Fetch
package.jsonfor npm package names - Infer categories from description/topics
- Mark official repos (from
modelcontextprotocolorg) as verified
The Cron Schedule
{
"crons": [
{ "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" },
{ "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" },
{ "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" },
{ "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" }
]
}
Every night (UTC):
- 3:00 AM - Sync GitHub star counts for existing resources
- 4:00 AM - Discover new skills
- 5:00 AM - Discover new MCP servers
- 6:00 AM - Discover new prompts/rules
Rate Limiting Matters
GitHub's API has limits. Without a token: 10 requests/minute. With a token: 5,000 requests/hour.
I handle this carefully:
- Small delays between requests
- Process in batches (50 items per cron run)
- Graceful retry on rate limit errors
if (res.status === 403) {
const resetTime = res.headers.get('X-RateLimit-Reset');
console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`);
await sleep(60000); // Wait and retry
}
What I Learned
1. Incremental is better than bulk
Early versions tried to crawl everything at once. Timeouts, rate limits, chaos. Now I process 50 items per run and let it accumulate.
2. Deduplication by slug
Same repo can appear in multiple search strategies. I generate consistent slugs (owner-repo-path) and upsert instead of insert.
3. Don't trust descriptions
Many repos have empty or useless descriptions. I fall back to: "AI rules from {owner}/{repo}". Not pretty, but works.
4. Official = trusted
Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs get auto-verified badges. Community repos need manual verification.
Current Stats
After the crawlers have been running:
- 790+ MCP servers indexed
- 1,300+ skills discovered
- 300+ prompts/rules indexed
- Daily updates keep star counts fresh
The Honest Struggle
GitHub search isn't perfect. I get false positives - repos that mention "mcp" but aren't MCP servers. Manual review still matters for quality.
Also: the 50-item limit per cron run means it takes days to fully index everything. Vercel's 10-second timeout for hobby plans is real.
What's Next
- Better category inference using AI
- README parsing for richer descriptions
- Automatic quality scoring based on stars, activity, docs
- User submissions to fill gaps
Try It
Browse the auto-discovered resources at indx.sh:
- Rules & Prompts - Cursor, Claude Code, Copilot rules
- MCP Servers - sorted by GitHub stars
- Skills - searchable by name/tags
Got a resource that's not indexed? Submit it or wait for the crawlers to find it.
This is part 2 of the "Building indx.sh" series.
Top comments (0)