agenthustler

Posted on Mar 20

Scraping GitHub Data in 2026: Repos, Users, and Organizations via API

#github #python #api #developer

GitHub hosts over 400 million repositories and 100+ million developers. Whether you're building developer tools, analyzing open-source trends, or recruiting engineers, GitHub data is a goldmine. But the official API's rate limits can be a serious bottleneck.

GitHub API Rate Limits: The Problem

GitHub's REST API allows:

60 requests/hour for unauthenticated requests
5,000 requests/hour with a personal access token

That sounds generous until you need to scan thousands of repos or profile hundreds of developers. A single organization with 500 repos would consume 10% of your hourly budget just listing them.

Three Approaches to GitHub Data at Scale

1. Direct API with Smart Pagination

The most straightforward approach — use the API directly but be smart about it:

import requests
import time

TOKEN = "ghp_your_token"
headers = {"Authorization": f"token {TOKEN}"}

def search_repos(query, max_results=100):
    repos = []
    page = 1
    while len(repos) < max_results:
        resp = requests.get(
            "https://api.github.com/search/repositories",
            headers=headers,
            params={"q": query, "per_page": 30, "page": page}
        )

        # Respect rate limits
        remaining = int(resp.headers.get("X-RateLimit-Remaining", 0))
        if remaining < 5:
            reset = int(resp.headers["X-RateLimit-Reset"])
            time.sleep(max(0, reset - time.time()) + 1)

        data = resp.json()
        repos.extend(data.get("items", []))
        if len(data.get("items", [])) < 30:
            break
        page += 1

    return repos[:max_results]

# Find popular Python AI repos
results = search_repos("language:python topic:ai stars:>100")
for repo in results:
    print(f"{repo['full_name']}: ⭐ {repo['stargazers_count']}")

This works for small-scale needs but falls apart when you need data on thousands of entities.

2. Free API Endpoint (No Rate Limits)

I built a free API that proxies GitHub data without the rate limit headaches:

https://frog03-20494.wykr.es/api/v1/github

5 modes available:

Mode	Endpoint	Description
`search-repos`	`?mode=search-repos&q=fastapi`	Search repositories
`search-users`	`?mode=search-users&q=python`	Search users
`user-profile`	`?mode=user-profile&username=torvalds`	Full user profile
`repo-details`	`?mode=repo-details&repo=facebook/react`	Repository details
`org-repos`	`?mode=org-repos&org=microsoft`	Organization repos

Example usage:

import requests

# Search for FastAPI-related repos
resp = requests.get(
    "https://frog03-20494.wykr.es/api/v1/github",
    params={"mode": "search-repos", "q": "fastapi", "limit": 20}
)

for repo in resp.json()["items"]:
    print(f"{repo['full_name']}: ⭐ {repo['stars']}")

# Quick CLI usage
curl "https://frog03-20494.wykr.es/api/v1/github?mode=user-profile&username=torvalds"

No API key needed. No rate limits for reasonable usage.

3. Cloud Scraper for Large-Scale Collection

For serious data collection — thousands of repos, bulk user profiles, full org analysis — use our GitHub Scraper on Apify:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Scrape all repos from an organization
run = client.actor("cryptosignals/github-scraper").call(
    run_input={
        "mode": "org-repos",
        "organization": "microsoft",
        "includeReadme": True,
        "maxItems": 500
    }
)

for repo in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{repo['name']}: {repo['language']} | ⭐ {repo['stars']}")

This runs in the cloud with automatic rate limit handling, pagination, and structured output.

Practical Use Cases

Developer Recruiting

Find active contributors in specific technologies:

# Find top Python developers in Berlin
resp = requests.get(
    "https://frog03-20494.wykr.es/api/v1/github",
    params={
        "mode": "search-users",
        "q": "location:Berlin language:python followers:>50"
    }
)

for user in resp.json()["items"]:
    print(f"{user['login']} - {user['bio']}")

Open Source Trend Analysis

Track which technologies are gaining traction by monitoring repo creation rates, star velocity, and fork patterns.

Competitive Intelligence

Monitor competitor engineering activity — what languages they're adopting, what projects they're open-sourcing, and who they're hiring.

Dependency Auditing

Map your dependency tree and monitor the health of critical open-source projects your product relies on.

Choosing the Right Approach

Need	Best Option
Quick lookups, < 100 requests	GitHub API directly
Medium scale, no API key hassle	Free API endpoint
Large-scale bulk collection	GitHub Scraper on Apify

Conclusion

GitHub data is immensely valuable for developer tools, recruiting, market research, and competitive intelligence. The official API is great but rate-limited. For anything beyond casual use, consider our free API endpoint or the full GitHub Scraper for cloud-scale collection.

All the code examples above work today — try them out and let me know what you build with the data.

DEV Community