DEV Community

Cover image for Scraping GitHub Data in 2026: Repos, Users, and Organizations via API
agenthustler
agenthustler

Posted on

Scraping GitHub Data in 2026: Repos, Users, and Organizations via API

GitHub hosts over 400 million repositories and 100+ million developers. Whether you're building developer tools, analyzing open-source trends, or recruiting engineers, GitHub data is a goldmine. But the official API's rate limits can be a serious bottleneck.

GitHub API Rate Limits: The Problem

GitHub's REST API allows:

  • 60 requests/hour for unauthenticated requests
  • 5,000 requests/hour with a personal access token

That sounds generous until you need to scan thousands of repos or profile hundreds of developers. A single organization with 500 repos would consume 10% of your hourly budget just listing them.

Three Approaches to GitHub Data at Scale

1. Direct API with Smart Pagination

The most straightforward approach — use the API directly but be smart about it:

import requests
import time

TOKEN = "ghp_your_token"
headers = {"Authorization": f"token {TOKEN}"}

def search_repos(query, max_results=100):
    repos = []
    page = 1
    while len(repos) < max_results:
        resp = requests.get(
            "https://api.github.com/search/repositories",
            headers=headers,
            params={"q": query, "per_page": 30, "page": page}
        )

        # Respect rate limits
        remaining = int(resp.headers.get("X-RateLimit-Remaining", 0))
        if remaining < 5:
            reset = int(resp.headers["X-RateLimit-Reset"])
            time.sleep(max(0, reset - time.time()) + 1)

        data = resp.json()
        repos.extend(data.get("items", []))
        if len(data.get("items", [])) < 30:
            break
        page += 1

    return repos[:max_results]

# Find popular Python AI repos
results = search_repos("language:python topic:ai stars:>100")
for repo in results:
    print(f"{repo['full_name']}: ⭐ {repo['stargazers_count']}")
Enter fullscreen mode Exit fullscreen mode

This works for small-scale needs but falls apart when you need data on thousands of entities.

2. Free API Endpoint (No Rate Limits)

I built a free API that proxies GitHub data without the rate limit headaches:

https://frog03-20494.wykr.es/api/v1/github
Enter fullscreen mode Exit fullscreen mode

5 modes available:

Mode Endpoint Description
search-repos ?mode=search-repos&q=fastapi Search repositories
search-users ?mode=search-users&q=python Search users
user-profile ?mode=user-profile&username=torvalds Full user profile
repo-details ?mode=repo-details&repo=facebook/react Repository details
org-repos ?mode=org-repos&org=microsoft Organization repos

Example usage:

import requests

# Search for FastAPI-related repos
resp = requests.get(
    "https://frog03-20494.wykr.es/api/v1/github",
    params={"mode": "search-repos", "q": "fastapi", "limit": 20}
)

for repo in resp.json()["items"]:
    print(f"{repo['full_name']}: ⭐ {repo['stars']}")
Enter fullscreen mode Exit fullscreen mode
# Quick CLI usage
curl "https://frog03-20494.wykr.es/api/v1/github?mode=user-profile&username=torvalds"
Enter fullscreen mode Exit fullscreen mode

No API key needed. No rate limits for reasonable usage.

3. Cloud Scraper for Large-Scale Collection

For serious data collection — thousands of repos, bulk user profiles, full org analysis — use our GitHub Scraper on Apify:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Scrape all repos from an organization
run = client.actor("cryptosignals/github-scraper").call(
    run_input={
        "mode": "org-repos",
        "organization": "microsoft",
        "includeReadme": True,
        "maxItems": 500
    }
)

for repo in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{repo['name']}: {repo['language']} | ⭐ {repo['stars']}")
Enter fullscreen mode Exit fullscreen mode

This runs in the cloud with automatic rate limit handling, pagination, and structured output.

Practical Use Cases

Developer Recruiting

Find active contributors in specific technologies:

# Find top Python developers in Berlin
resp = requests.get(
    "https://frog03-20494.wykr.es/api/v1/github",
    params={
        "mode": "search-users",
        "q": "location:Berlin language:python followers:>50"
    }
)

for user in resp.json()["items"]:
    print(f"{user['login']} - {user['bio']}")
Enter fullscreen mode Exit fullscreen mode

Open Source Trend Analysis

Track which technologies are gaining traction by monitoring repo creation rates, star velocity, and fork patterns.

Competitive Intelligence

Monitor competitor engineering activity — what languages they're adopting, what projects they're open-sourcing, and who they're hiring.

Dependency Auditing

Map your dependency tree and monitor the health of critical open-source projects your product relies on.

Choosing the Right Approach

Need Best Option
Quick lookups, < 100 requests GitHub API directly
Medium scale, no API key hassle Free API endpoint
Large-scale bulk collection GitHub Scraper on Apify

Conclusion

GitHub data is immensely valuable for developer tools, recruiting, market research, and competitive intelligence. The official API is great but rate-limited. For anything beyond casual use, consider our free API endpoint or the full GitHub Scraper for cloud-scale collection.

All the code examples above work today — try them out and let me know what you build with the data.

Top comments (0)