Devil Scrapes

Posted on Jun 1

GitHub Repo Scraper: bulk-extract repository metadata for $2.00/1K

#webscraping #python #apify #data

Quick answer: The GitHub REST API is public and documented, but turning a list of owner/repo slugs into a clean, typed dataset at any kind of scale means wiring up parallel fetches, exponential-backoff retry on secondary rate limits, token rotation, and two optional sub-resource calls per repo (languages and latest release). The GitHub Repository Scraper handles all of that and returns 25-field Pydantic-validated rows at $0.002 per repo — about $2.00 per 1,000 results. No token required for small batches; bring your own for the 5,000-requests/hour tier.

Someone on my team needed to benchmark 800 competitor repos for a DevRel quarterly review: stars, forks, open issues, primary language, latest release tag, and whether any had gone archived. The GitHub API gives you all of it — in theory. In practice the job was three days of plumbing: a paginated client, GitHub's secondary rate limit (the "you're making requests too fast" one that doesn't count against the hard cap), fan-out for the /languages and /releases/latest sub-resources, and enough retry logic to survive a token that hit a burst limit mid-run. I spent more time on infrastructure than on the analysis.

That pattern repeats. Here's the whole picture, and the one-call alternative.

What is the GitHub REST API? 🔎

The GitHub REST API is GitHub's programmatic interface for public repository data. It exposes repo metadata — stars, forks, language, topics, license, default branch, creation date — plus separate endpoints for language breakdowns (/repos/{owner}/{repo}/languages) and releases (/repos/{owner}/{repo}/releases/latest). Unauthenticated callers get 60 requests/hour per IP; with a personal access token (PAT) that rises to 5,000/hour. GitHub also enforces a secondary rate limit on burst concurrency that it describes only vaguely and adjusts silently.

For ten repos, it's fine. For 800 it becomes a project.

Does GitHub have a bulk-repo-metadata API?

No. As of 2026, GitHub ships no endpoint that accepts a list of slugs and returns all metadata in one call. You call /repos/{owner}/{repo} per repo, plus two more calls for languages and latest-release — up to three round trips per repo, multiplied by your list size and bounded by rate limits. The GraphQL API lets you batch some fields, but it has its own cost model (points per node) and still requires token management at scale.

What the data looks like

Each repo comes back as one flat, typed row, validated by a Pydantic model before it's written to the dataset:

{
  "owner": "apify",
  "name": "apify-sdk-python",
  "full_name": "apify/apify-sdk-python",
  "html_url": "https://github.com/apify/apify-sdk-python",
  "description": "The Apify SDK for Python.",
  "fork": false,
  "archived": false,
  "disabled": false,
  "stargazers_count": 415,
  "forks_count": 41,
  "watchers_count": 415,
  "open_issues_count": 12,
  "size_kb": 3840,
  "language": "Python",
  "languages": {
    "Python": 198432,
    "Dockerfile": 812
  },
  "topics": ["apify", "scraping", "sdk"],
  "license": "Apache-2.0",
  "default_branch": "main",
  "homepage": "https://docs.apify.com/sdk/python",
  "created_at": "2022-06-14T12:00:00Z",
  "updated_at": "2026-05-28T08:14:22Z",
  "pushed_at": "2026-05-27T16:45:03Z",
  "latest_release_tag": "v3.4.0",
  "latest_release_published_at": "2026-05-10T09:00:00Z",
  "scraped_at": "2026-05-31T10:22:14Z"
}

Twenty-five fields, stable schema, ISO-8601 timestamps throughout. Drop it into Pandas, push it to BigQuery, or feed it into a RAG pipeline — no positional-array wrangling on your side.

The naive approach (and why it falls apart) ⚠️

The first attempt looks obvious: loop over repo slugs, call requests.get(f"https://api.github.com/repos/{slug}"), parse JSON. Here's where that unravels at any real scale:

1. The secondary rate limit. GitHub's primary rate limit (60/hr unauthenticated, 5,000/hr with a token) is the one everyone knows. The secondary limit — which throttles burst concurrency — is the one that breaks scripts in production. It fires 403 Forbidden with a Retry-After header, and the interval varies by load. We handle this by retrying with exponential backoff on 408/429/5xx and honouring every Retry-After header, up to 5 attempts per resource.

2. Fan-out for sub-resources. Stars and forks are in the base response; languages and latest-release are separate endpoints. With both enabled, each repo is up to 3 calls — across 800 repos that's 2,400 requests, most of your hourly token budget. We fan them out concurrently with a configurable concurrency parameter (default 6; up to 8 comfortably with a token).

3. Missing repos and error surfacing. A repo might not exist, might have been deleted, or might be temporarily unavailable. A script that silently drops errors gives you a dataset short by an unknown number of rows. We log a warning on every skipped repo and write the partial dataset — you always know exactly what you got.

4. Token management. A single PAT at 5,000/hr sounds like enough until you run two jobs in parallel. We accept an optional githubToken, use it when present, and fall back gracefully to the unauthenticated tier — while rotating curl-cffi browser fingerprints (Chrome, Firefox, Safari TLS profiles) so every request looks like a browser, not a Python script.

None of this is heroic engineering. It's the mandatory plumbing that costs 1–2 dev-weeks before any analysis begins. We've already paid that cost — and we charge $0.002 per repo.

The Actor 🔥

The result is an Apify Actor: GitHub Repository Scraper. Paste a list of slugs in the Apify Console and click Start, or run it from Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/github-repo-scraper").call(
    run_input={
        "repos": [
            "apify/apify-sdk-python",
            "apify/crawlee-python",
            "openai/openai-python",
            "langchain-ai/langchain",
        ],
        "includeLanguages": True,
        "includeLatestRelease": True,
        "concurrency": 6,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["full_name"], item["stargazers_count"], item["language"])

Input parameters:

Field	Default	Notes
`repos`	`["apify/apify-sdk-python", "apify/crawlee-python"]`	`owner/repo` or full GitHub URL
`githubToken`	(none)	Optional PAT — lifts rate limit from 60 to 5,000/hr
`includeLanguages`	`true`	Adds `languages` map; one extra call per repo
`includeLatestRelease`	`true`	Adds `latest_release_tag` + timestamp; one extra call
`concurrency`	`6`	Parallel requests — 8 is the comfortable ceiling with a token

Output goes to the run's default dataset — exportable as JSON, CSV, or Excel from the Console or via the Apify API.

What you'd actually use this for

Five concrete patterns:

Open-source watch list. Track stars, forks, and open issues across your top 20 competitor repos weekly and pipe the diff to Slack — new releases, sustained growth, or a sudden spike in open issues (often a production incident on their side) all become first-class signals.

Dependency-health scanning. Feed the repos your stack depends on and flag anything archived: true or where pushed_at is more than six months old. A repo with 50k stars and no commits in 18 months is a liability your security team needs to know about — the same signal tools like Snyk and Socket.dev surface, built from raw metadata.

M&A / investment diligence. Quantify the open-source surface area of a target company. Stars, fork count, release cadence (latest_release_published_at), and language diversity feed an "OSS health score" that VCs and acquirers increasingly ask for. Pull 200 repos in a single run.

AI / RAG corpus building. The description, topics, language, and languages fields make repo metadata an excellent seed for a RAG retrieval layer — especially for developer tools, where the repo IS the documentation. Feed 10,000 repos into a vector store and you have a searchable "what does this library do" knowledge base.

Hiring research. Pull metadata for the repos a candidate lists on their profile. Stars, topics, and activity timestamps are a rough proxy for technical reputation and recency — a first pass at scale, not a replacement for code review.

Pricing — exact numbers 💰

Pay-per-event — you pay for repos you get, nothing for the infrastructure.

Event	Price
Actor start (one-time per run)	$0.005
Result emitted (per repo)	$0.002

Volume	Cost
100 repos	$0.21
1,000 repos	$2.01
10,000 repos	$20.01
100,000 repos (big benchmark)	$200.01

Apify's $5 free trial credit covers your first ~2,490 repos with no credit card. A PAT in your input lifts the rate-limit ceiling and makes large runs faster.

The technically interesting part

Here's the piece most blog posts skip: GitHub's secondary rate limit fires on burst concurrency, not just total volume. The docs describe it as "too many concurrent requests" but give no hard number. In practice the sweet spot with a PAT is around 6–8 concurrent requests; without a token, 2–3 is the reliable ceiling before 403s start eating your retry budget.

We encode this as the concurrency default (6), and the input description tells users why the number matters. We pace requests with a Retry-After parser — if GitHub asks us to wait 30 seconds, we wait and report partial progress rather than silently blocking. The curl-cffi layer rotates four TLS profiles (Chrome 131, Chrome 124, Firefox 147, Safari 180) so the handshake stays indistinguishable from a real browser.

Limitations

Private repos require a scoped token. A PAT grants access to whatever that token can see, but this Actor is built for public-data extraction. Don't reuse production tokens with broad scopes.

README content and commit graphs are out of scope. This Actor returns the repos endpoint metadata. Code content, commit history, issues, and pull requests are separate GitHub APIs with their own rate-limit envelopes — they belong in a dedicated Actor.

Stars/forks can be slightly stale. GitHub caches some counts for a few minutes at high traffic. If you need real-time counts, run twice 5+ minutes apart and compare.

Large batches without a token are slow. At 60 requests/hour unauthenticated, 800 repos ≈ 13 hours. Bring a PAT for any serious batch job.

No contributor count. Despite the field appearing in some store copy, contributor count is not in the current schema — the /contributors endpoint is one paginated call per repo and was deferred. Don't rely on a field that isn't in the output above.

FAQ

Is scraping GitHub repo metadata legal?
The GitHub Terms of Service permit automated access to public data via the documented REST API. This Actor calls only documented, public GitHub endpoints at reasonable request rates — it doesn't scrape the HTML UI or touch private data. As always, verify your own jurisdiction and use case.

Does GitHub have a bulk-repo API?
No. The REST API is per-repo; the GraphQL API lets you batch some fields but still requires token management at scale. There is no "give me all repos in this list" endpoint.

Can I export to Google Sheets or a data warehouse?
Yes — export JSON, CSV, or Excel from the Console, webhook the dataset on ACTOR.RUN.SUCCEEDED into Make/Zapier/n8n, or pull the rows via the Apify Dataset API.

What if a repo slug doesn't exist?
The Actor logs a warning, skips that repo, and continues. The final dataset will have fewer rows than your input list — the run log tells you exactly which slugs were skipped and why.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/github-repo-scraper.

Free $5 trial credit, no credit card. Drop in 10 slugs and you'll have a clean 25-field dataset in under a minute. Need a field that isn't there — contributor count, latest commit SHA, branch protection rules? Open a request in the Actor's Issues tab. We ship based on what people actually use.

What metadata would make your repo-monitoring pipeline complete?

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community