Why Scrape GitHub?
GitHub hosts 400M+ repositories and 100M+ developers. That's a goldmine if you know how to extract it:
- Recruiter sourcing — Find active contributors to specific frameworks (e.g., PyTorch, LangChain) and reach out with context
- Competitive analysis — Track competitor repos: stars growth, commit frequency, contributor count
- Tech stack research — Map which languages and tools companies actually use (not what their job posts claim)
- Contributor tracking — Monitor who's building what in your niche, spot rising talent early
The challenge? Doing this at scale without getting rate-limited into oblivion.
GitHub REST API vs. Web Scraping
Don't scrape GitHub's HTML. Their API is better in every way:
| REST API | Web Scraping | |
|---|---|---|
| Rate limit | 60 req/hr (unauth), 5,000/hr (with token) | Aggressive bot detection |
| Data format | Clean JSON | Fragile HTML parsing |
| Reliability | Stable endpoints | Breaks on layout changes |
| Fields | Rich metadata | What's visible on page |
The only downside? Rate limits. At 60 requests/hour without auth, scraping 1,000 repos takes ~17 hours. Even with a token (5,000/hr), large-scale jobs need smart throttling.
The Easier Way: A Purpose-Built GitHub Scraper
I built a GitHub Scraper on Apify that handles all the API complexity for you. It runs 5 modes:
1. search-repos
Search repositories by keyword, language, stars, or any GitHub search qualifier.
Output per repo (18 fields):
name, full_name, description, url, html_url, language, stars, forks, open_issues, watchers, created_at, updated_at, pushed_at, size, default_branch, topics, license, owner
2. search-users
Find developers by location, language, followers, or bio keywords.
3. user-profile
Get full profile data for specific users — repos, contributions, bio, company, location.
Output per user (15 fields):
login, name, bio, company, location, email, blog, twitter_username, public_repos, public_gists, followers, following, created_at, updated_at, avatar_url
4. repo-details
Deep-dive into specific repositories — contributors, languages breakdown, recent commits.
5. org-repos
List all public repos for any GitHub organization. Great for mapping a company's open-source footprint.
Rate limiting is built in — the actor uses 1.5-second delays between requests to stay well within API limits, and automatically backs off on 429 responses.
Python Quick Start (10 Lines)
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("cryptosignals/github-scraper").call(input={
"mode": "search-repos",
"query": "machine learning language:python stars:>100",
"maxItems": 50
})
for repo in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{repo['full_name']} - {repo['stars']} stars — {repo['description']}")
Install the client: pip install apify-client
Use Case: Finding Python ML Contributors for Outreach
Let's say you're recruiting ML engineers or building a developer community. Here's a step-by-step workflow:
Step 1: Find trending ML repos
run = client.actor("cryptosignals/github-scraper").call(input={
"mode": "search-repos",
"query": "machine learning language:python stars:>500 pushed:>2026-01-01",
"maxItems": 20
})
repos = list(client.dataset(run["defaultDatasetId"]).iterate_items())
Step 2: Get contributor profiles from top repos
for repo in repos[:5]:
run = client.actor("cryptosignals/github-scraper").call(input={
"mode": "repo-details",
"repoUrl": repo["html_url"]
})
details = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(f"
{repo['full_name']} contributors:")
for contributor in details[0].get("contributors", [])[:10]:
print(f" - {contributor['login']}")
Step 3: Enrich with full profiles
contributor_logins = ["username1", "username2"] # from step 2
for login in contributor_logins:
run = client.actor("cryptosignals/github-scraper").call(input={
"mode": "user-profile",
"username": login
})
profile = list(client.dataset(run["defaultDatasetId"]).iterate_items())[0]
print(f"{profile['name']} | {profile['company']} | {profile['location']}")
if profile.get('email'):
print(f" Email: {profile['email']}")
You now have a targeted list of active ML contributors with their company, location, and (when public) email — all from structured API data, no HTML parsing required.
What's Next
The GitHub Scraper is launching this week on Apify. Starting April 3, the Pro plan ($4.99/month) unlocks higher concurrency and priority support.
Try it free today — the Apify free tier gives you enough compute to test all 5 modes.
Questions? Drop a comment below or open an issue on the actor's page.
Top comments (0)