Ajit for AWS Community Builders

Posted on Mar 31

How I Search 10,000+ AWS GitHub Repos in 10 Seconds

#aws #ai #devtools #serverless

Hybrid search using FAISS and Bedrock

The Problem

Every AWS developer knows this pain: you need a reference architecture or sample code, and you end up with 47 browser tabs open across GitHub, AWS docs, Stack Overflow, and random blog posts from 2019.

GitHub search gives you 2,000 results with zero context. ChatGPT confidently hallucinates repos that don't exist. Stack Overflow answers are outdated.

What I Built

I built an AI-powered search engine that indexes 10,000+ repos from AWS's official GitHub organizations:

aws-samples: 8,031 repos
awslabs: 993 repos
aws-solutions-library-samples: 315 repos
aws-ia: 234 repos
aws-solutions: 72 repos

How It Works

The search uses a hybrid approach:

70% BM25 (keyword matching) — catches exact AWS service names
30% FAISS (semantic vector search) — understands what you mean, not just what you type

Each repo is classified by Amazon Bedrock (Nova Pro) across 22 metadata fields: solution type, AWS services used, complexity, freshness, setup time estimate, and more.

Auto-indexed twice daily via EventBridge — new AWS repos are searchable within 12 hours.

Tech Stack

Amazon Bedrock (Nova Pro) for AI classification
FAISS with Titan Embed v2 (1024-dim) for vector search
AWS Lambda (Docker, Python 3.12)
API Gateway
DynamoDB for usage tracking
CloudFront + S3 for frontend
EventBridge for scheduled indexing
AWS CDK (TypeScript) for infrastructure

Key Architecture Decisions

Why hybrid search over pure vector?
AWS service names are precise — "Lambda" means something specific. Pure semantic search sometimes returns "serverless compute" results when you specifically want Lambda. BM25 catches exact matches, FAISS catches intent. The 70/30 split was tuned through testing.

Why FAISS over OpenSearch?
Cost. OpenSearch Serverless has a minimum cost that runs 24/7. FAISS on Lambda + S3 costs under $6/month for the search component. For a bootstrapped product, that matters.

Why 22 metadata fields?
Developers don't just want to find a repo — they want to know: Is it maintained? How complex is it? What AWS services does it use? How long will setup take? Nova Pro classifies each repo across all 22 fields automatically.

What I Learned

Hybrid search beats pure vector for domain-specific queries
FAISS on Lambda is surprisingly cost-effective
The 22-field classification is what makes results actually useful — not just the search itself
Auto-indexing twice daily keeps results fresh without manual work

Try It

Free to use — 3 searches without registration, 10 per day with a free account.

AWS Solution Finder

What's Next

Exploring ways to improve relevance scoring and add more metadata fields. If you've worked with FAISS or hybrid search, I'd love to hear your approach.

Built solo by an AWS Community Builder. Questions? Drop a comment.

Top comments (5)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 31

Ben Halpern • Mar 31

Wow

MergeShield • Apr 1

the scale point is what makes this relevant beyond aws - once you're operating at 10k repo scale the manual review layer collapses entirely. automated risk scoring at the PR level becomes the only thing that works. humans reviewing individual PRs stops being realistic at that volume.

Harsh • Mar 31

Great 💖

Alex K • Mar 31

Interesting