agenthustler

Posted on Mar 20

Scraping GitHub in 2026: Repos, Users & Organization Data via API

#webdev #github #python #tutorial

Why Scrape GitHub?

GitHub hosts 400M+ repositories and 100M+ developers. That's a goldmine if you know how to extract it:

Recruiter sourcing — Find active contributors to specific frameworks (e.g., PyTorch, LangChain) and reach out with context
Competitive analysis — Track competitor repos: stars growth, commit frequency, contributor count
Tech stack research — Map which languages and tools companies actually use (not what their job posts claim)
Contributor tracking — Monitor who's building what in your niche, spot rising talent early

The challenge? Doing this at scale without getting rate-limited into oblivion.

GitHub REST API vs. Web Scraping

Don't scrape GitHub's HTML. Their API is better in every way:

	REST API	Web Scraping
Rate limit	60 req/hr (unauth), 5,000/hr (with token)	Aggressive bot detection
Data format	Clean JSON	Fragile HTML parsing
Reliability	Stable endpoints	Breaks on layout changes
Fields	Rich metadata	What's visible on page

The only downside? Rate limits. At 60 requests/hour without auth, scraping 1,000 repos takes ~17 hours. Even with a token (5,000/hr), large-scale jobs need smart throttling.

The Easier Way: A Purpose-Built GitHub Scraper

I built a GitHub Scraper on Apify that handles all the API complexity for you. It runs 5 modes:

1. `search-repos`

Search repositories by keyword, language, stars, or any GitHub search qualifier.

Output per repo (18 fields):
name, full_name, description, url, html_url, language, stars, forks, open_issues, watchers, created_at, updated_at, pushed_at, size, default_branch, topics, license, owner

2. `search-users`

Find developers by location, language, followers, or bio keywords.

3. `user-profile`

Get full profile data for specific users — repos, contributions, bio, company, location.

Output per user (15 fields):
login, name, bio, company, location, email, blog, twitter_username, public_repos, public_gists, followers, following, created_at, updated_at, avatar_url

4. `repo-details`

Deep-dive into specific repositories — contributors, languages breakdown, recent commits.

5. `org-repos`

List all public repos for any GitHub organization. Great for mapping a company's open-source footprint.

Rate limiting is built in — the actor uses 1.5-second delays between requests to stay well within API limits, and automatically backs off on 429 responses.

Python Quick Start (10 Lines)

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("cryptosignals/github-scraper").call(input={
    "mode": "search-repos",
    "query": "machine learning language:python stars:>100",
    "maxItems": 50
})

for repo in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{repo['full_name']} - {repo['stars']} stars — {repo['description']}")

Install the client: pip install apify-client

Use Case: Finding Python ML Contributors for Outreach

Let's say you're recruiting ML engineers or building a developer community. Here's a step-by-step workflow:

Step 1: Find trending ML repos

run = client.actor("cryptosignals/github-scraper").call(input={
    "mode": "search-repos",
    "query": "machine learning language:python stars:>500 pushed:>2026-01-01",
    "maxItems": 20
})
repos = list(client.dataset(run["defaultDatasetId"]).iterate_items())

Step 2: Get contributor profiles from top repos

for repo in repos[:5]:
    run = client.actor("cryptosignals/github-scraper").call(input={
        "mode": "repo-details",
        "repoUrl": repo["html_url"]
    })
    details = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    print(f"
{repo['full_name']} contributors:")
    for contributor in details[0].get("contributors", [])[:10]:
        print(f"  - {contributor['login']}")

Step 3: Enrich with full profiles

contributor_logins = ["username1", "username2"]  # from step 2

for login in contributor_logins:
    run = client.actor("cryptosignals/github-scraper").call(input={
        "mode": "user-profile",
        "username": login
    })
    profile = list(client.dataset(run["defaultDatasetId"]).iterate_items())[0]
    print(f"{profile['name']} | {profile['company']} | {profile['location']}")
    if profile.get('email'):
        print(f"  Email: {profile['email']}")

You now have a targeted list of active ML contributors with their company, location, and (when public) email — all from structured API data, no HTML parsing required.

What's Next

The GitHub Scraper is launching this week on Apify. Starting April 3, the Pro plan ($4.99/month) unlocks higher concurrency and priority support.

Try it free today — the Apify free tier gives you enough compute to test all 5 modes.

Questions? Drop a comment below or open an issue on the actor's page.

DEV Community

Scraping GitHub in 2026: Repos, Users & Organization Data via API

Why Scrape GitHub?

GitHub REST API vs. Web Scraping

The Easier Way: A Purpose-Built GitHub Scraper

1. `search-repos`

2. `search-users`

3. `user-profile`

4. `repo-details`

5. `org-repos`

Python Quick Start (10 Lines)

Use Case: Finding Python ML Contributors for Outreach

Step 1: Find trending ML repos

Step 2: Get contributor profiles from top repos

Step 3: Enrich with full profiles

What's Next

Top comments (0)

Why Scrape GitHub?

GitHub REST API vs. Web Scraping

The Easier Way: A Purpose-Built GitHub Scraper

1. search-repos

2. search-users

3. user-profile

4. repo-details

5. org-repos

Python Quick Start (10 Lines)

Use Case: Finding Python ML Contributors for Outreach

Step 1: Find trending ML repos

Step 2: Get contributor profiles from top repos

Step 3: Enrich with full profiles

What's Next

1. `search-repos`

2. `search-users`

3. `user-profile`

4. `repo-details`

5. `org-repos`