DEV Community

agenthustler
agenthustler

Posted on

Scraping GitHub in 2026: Repos, Users & Organization Data via API

Why Scrape GitHub?

GitHub hosts 400M+ repositories and 100M+ developers. That's a goldmine if you know how to extract it:

  • Recruiter sourcing — Find active contributors to specific frameworks (e.g., PyTorch, LangChain) and reach out with context
  • Competitive analysis — Track competitor repos: stars growth, commit frequency, contributor count
  • Tech stack research — Map which languages and tools companies actually use (not what their job posts claim)
  • Contributor tracking — Monitor who's building what in your niche, spot rising talent early

The challenge? Doing this at scale without getting rate-limited into oblivion.

GitHub REST API vs. Web Scraping

Don't scrape GitHub's HTML. Their API is better in every way:

REST API Web Scraping
Rate limit 60 req/hr (unauth), 5,000/hr (with token) Aggressive bot detection
Data format Clean JSON Fragile HTML parsing
Reliability Stable endpoints Breaks on layout changes
Fields Rich metadata What's visible on page

The only downside? Rate limits. At 60 requests/hour without auth, scraping 1,000 repos takes ~17 hours. Even with a token (5,000/hr), large-scale jobs need smart throttling.

The Easier Way: A Purpose-Built GitHub Scraper

I built a GitHub Scraper on Apify that handles all the API complexity for you. It runs 5 modes:

1. search-repos

Search repositories by keyword, language, stars, or any GitHub search qualifier.

Output per repo (18 fields):
name, full_name, description, url, html_url, language, stars, forks, open_issues, watchers, created_at, updated_at, pushed_at, size, default_branch, topics, license, owner

2. search-users

Find developers by location, language, followers, or bio keywords.

3. user-profile

Get full profile data for specific users — repos, contributions, bio, company, location.

Output per user (15 fields):
login, name, bio, company, location, email, blog, twitter_username, public_repos, public_gists, followers, following, created_at, updated_at, avatar_url

4. repo-details

Deep-dive into specific repositories — contributors, languages breakdown, recent commits.

5. org-repos

List all public repos for any GitHub organization. Great for mapping a company's open-source footprint.

Rate limiting is built in — the actor uses 1.5-second delays between requests to stay well within API limits, and automatically backs off on 429 responses.

Python Quick Start (10 Lines)

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("cryptosignals/github-scraper").call(input={
    "mode": "search-repos",
    "query": "machine learning language:python stars:>100",
    "maxItems": 50
})

for repo in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{repo['full_name']} - {repo['stars']} stars — {repo['description']}")
Enter fullscreen mode Exit fullscreen mode

Install the client: pip install apify-client

Use Case: Finding Python ML Contributors for Outreach

Let's say you're recruiting ML engineers or building a developer community. Here's a step-by-step workflow:

Step 1: Find trending ML repos

run = client.actor("cryptosignals/github-scraper").call(input={
    "mode": "search-repos",
    "query": "machine learning language:python stars:>500 pushed:>2026-01-01",
    "maxItems": 20
})
repos = list(client.dataset(run["defaultDatasetId"]).iterate_items())
Enter fullscreen mode Exit fullscreen mode

Step 2: Get contributor profiles from top repos

for repo in repos[:5]:
    run = client.actor("cryptosignals/github-scraper").call(input={
        "mode": "repo-details",
        "repoUrl": repo["html_url"]
    })
    details = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    print(f"
{repo['full_name']} contributors:")
    for contributor in details[0].get("contributors", [])[:10]:
        print(f"  - {contributor['login']}")
Enter fullscreen mode Exit fullscreen mode

Step 3: Enrich with full profiles

contributor_logins = ["username1", "username2"]  # from step 2

for login in contributor_logins:
    run = client.actor("cryptosignals/github-scraper").call(input={
        "mode": "user-profile",
        "username": login
    })
    profile = list(client.dataset(run["defaultDatasetId"]).iterate_items())[0]
    print(f"{profile['name']} | {profile['company']} | {profile['location']}")
    if profile.get('email'):
        print(f"  Email: {profile['email']}")
Enter fullscreen mode Exit fullscreen mode

You now have a targeted list of active ML contributors with their company, location, and (when public) email — all from structured API data, no HTML parsing required.

What's Next

The GitHub Scraper is launching this week on Apify. Starting April 3, the Pro plan ($4.99/month) unlocks higher concurrency and priority support.

Try it free today — the Apify free tier gives you enough compute to test all 5 modes.

Questions? Drop a comment below or open an issue on the actor's page.

Top comments (0)