DEV Community

Ajit for AWS Community Builders

Posted on

How I Search 10,000+ AWS GitHub Repos in 10 Seconds

The Problem

Every AWS developer knows this pain: you need a reference architecture or sample code, and you end up with 47 browser tabs open across GitHub, AWS docs, Stack Overflow, and random blog posts from 2019.

GitHub search gives you 2,000 results with zero context. ChatGPT confidently hallucinates repos that don't exist. Stack Overflow answers are outdated.

What I Built

I built an AI-powered search engine that indexes 10,000+ repos from AWS's official GitHub organizations:

  • aws-samples: 8,031 repos
  • awslabs: 993 repos
  • aws-solutions-library-samples: 315 repos
  • aws-ia: 234 repos
  • aws-solutions: 72 repos

How It Works

The search uses a hybrid approach:

  • 70% BM25 (keyword matching) — catches exact AWS service names
  • 30% FAISS (semantic vector search) — understands what you mean, not just what you type

Each repo is classified by Amazon Bedrock (Nova Pro) across 22 metadata fields: solution type, AWS services used, complexity, freshness, setup time estimate, and more.

Auto-indexed twice daily via EventBridge — new AWS repos are searchable within 12 hours.

Tech Stack

  • Amazon Bedrock (Nova Pro) for AI classification
  • FAISS with Titan Embed v2 (1024-dim) for vector search
  • AWS Lambda (Docker, Python 3.12)
  • API Gateway
  • DynamoDB for usage tracking
  • CloudFront + S3 for frontend
  • EventBridge for scheduled indexing
  • AWS CDK (TypeScript) for infrastructure

Key Architecture Decisions

Why hybrid search over pure vector?
AWS service names are precise — "Lambda" means something specific. Pure semantic search sometimes returns "serverless compute" results when you specifically want Lambda. BM25 catches exact matches, FAISS catches intent. The 70/30 split was tuned through testing.

Why FAISS over OpenSearch?
Cost. OpenSearch Serverless has a minimum cost that runs 24/7. FAISS on Lambda + S3 costs under $6/month for the search component. For a bootstrapped product, that matters.

Why 22 metadata fields?
Developers don't just want to find a repo — they want to know: Is it maintained? How complex is it? What AWS services does it use? How long will setup take? Nova Pro classifies each repo across all 22 fields automatically.

What I Learned

  1. Hybrid search beats pure vector for domain-specific queries
  2. FAISS on Lambda is surprisingly cost-effective
  3. The 22-field classification is what makes results actually useful — not just the search itself
  4. Auto-indexing twice daily keeps results fresh without manual work

Try It

Free to use — 3 searches without registration, 10 per day with a free account.

AWS Solution Finder

What's Next

Exploring ways to improve relevance scoring and add more metadata fields. If you've worked with FAISS or hybrid search, I'd love to hear your approach.


Built solo by an AWS Community Builder. Questions? Drop a comment.

Top comments (1)

Collapse
 
ben profile image
Ben Halpern

Wow