Devil Scrapes

Posted on May 31

GitHub Organization Scraper: export any org's profile + repos for $3/1K

#webscraping #python #apify #data

Quick answer: GitHub has a public REST API at api.github.com/orgs/{slug} that returns org metadata — display name, location, blog, follower count, public repo count. A GitHub organization scraper calls that endpoint in bulk, fans out to each org's public repo list, and returns everything as clean, typed JSON. The Apify Actor below does it for $0.003 per org (~$3.00 per 1,000), with fingerprint rotation, proxy threading, and Pydantic-validated rows handled for you.

GitHub organisations are an underused public data surface in developer-tools research. Every company that ships an open-source library has an org page exposing its name, description, blog URL, public repo and follower counts, location, and the canonical list of its public repositories. There is no download button and no bulk export API. If you want 500 of these records in a spreadsheet, you either write a script or use a hosted Actor that already handles everything your script will hit on the third run.

What is a GitHub Organisation?

A GitHub Organisation is an account type designed for teams and companies. Unlike personal accounts, orgs own repositories on behalf of the whole team and expose a public profile at github.com/{org-slug}. The GitHub REST API surfaces org metadata — name, description, location, blog, creation date, follower count, public repo and gist counts — plus the list of public repositories sorted by last-push date.

Does GitHub have a bulk export API for organisations?

No. The GitHub REST API lets you read a single org at GET /orgs/{org} and its repos at GET /orgs/{org}/repos, but there is no endpoint that accepts a list of org slugs and returns all their metadata in one payload. You iterate, one org at a time. That is fine for five orgs. For five hundred, you need a loop that handles rate limits, retries on 429s, paginates repos correctly, and doesn't lose partial progress when GitHub's unauthenticated cap (60 requests/hour) kicks in after the first few dozen calls.

What the data looks like

Each organisation comes back as one flat, typed row. Every field in the Pydantic ResultRow model, so there are no surprises at the other end:

{
  "login": "apify",
  "name": "Apify",
  "description": "Web scraping and automation platform.",
  "company": null,
  "blog": "https://apify.com",
  "location": "Prague, Czechia",
  "email": null,
  "twitter_username": null,
  "public_repos": 412,
  "public_gists": 0,
  "followers": 3820,
  "html_url": "https://github.com/apify",
  "avatar_url": "https://avatars.githubusercontent.com/u/24586296?v=4",
  "members_url_template": "https://api.github.com/orgs/apify/members{/member}",
  "type": "Organization",
  "is_verified": null,
  "created_at": "2017-06-08T13:22:00Z",
  "updated_at": "2024-11-15T11:40:00Z",
  "repos": [
    {
      "name": "apify-sdk-python",
      "full_name": "apify/apify-sdk-python",
      "html_url": "https://github.com/apify/apify-sdk-python",
      "description": "Apify SDK for Python",
      "language": "Python",
      "stargazers_count": 312,
      "forks_count": 48,
      "fork": false,
      "archived": false,
      "pushed_at": "2026-05-28T10:12:00Z"
    }
  ],
  "scraped_at": "2026-05-31T09:00:00+00:00"
}

Twenty fields at the org level, plus an optional repos array. Pydantic validates every row before it hits the dataset — no None where an int belongs, no silent field truncation.

The naive approach (and why it falls apart)

The first thing everyone who knows the GitHub REST API tries:

import requests
for slug in org_list:
    r = requests.get(f"https://api.github.com/orgs/{slug}")
    print(r.json())

This works for the first 60 requests. Then it stops, because:

1. Unauthenticated rate limits. GitHub's REST API allows 60 requests/hour without a token. At that rate a 500-org list takes over 8 hours. With a GitHub token the ceiling lifts to 5,000/hour — but the token must be threaded through correctly, and a shared IP can still hit secondary rate limits.

2. TLS fingerprinting still matters. Datacenter IP ranges get 403s on anything resembling automated traffic at scale. We run every request through curl-cffi with rotating impersonation profiles — chrome131, chrome124, firefox147, safari180 — so the TLS handshake looks like a browser client, not Python's stdlib SSL. We thread Apify residential proxies and rotate session_id on every block so GitHub sees a distribution of exit IPs, not a single datacenter CIDR.

3. Pagination is per-repo, not per-org. The GET /orgs/{slug}/repos endpoint caps at 100 repos per page and doesn't return a total count. For a large org you have to walk the Link header. The Actor caps at maxReposPerOrg (default 30, max 100), documented and exposed in the input — not silently truncated.

4. A 404 isn't a failure you should stop on. Org slugs in a lead-gen list are frequently stale — renamed, deleted, or pointing to a personal account. Bare requests raises and crashes your loop; we log a warning, skip the row, and carry on. We retry with exponential backoff (start 2s, double, cap 30s, five attempts) on 408 / 429 / 5xx and honour Retry-After headers. On partial success we surface a clear status message instead of an empty dataset with a green run status.

None of that is exotic — but it's the difference between a script that works today and a feed that still runs next week.

The Actor

The result is on the Apify Store: GitHub Organisation Scraper.

Open the Apify Console and paste your list of org slugs, or call it programmatically with the Apify Python client:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/github-org-scraper").call(
    run_input={
        "orgs": ["apify", "anthropics", "openai", "microsoft", "vercel"],
        "includeRepos": True,
        "maxReposPerOrg": 30,
        "githubToken": "ghp_yourReadOnlyToken",  # optional but recommended
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["login"], item["public_repos"], item["location"])

All six input parameters are documented and validated by Pydantic before the Actor touches the network. Pass a GitHub personal access token with public read-only scope and the per-hour limit lifts from 60 to 5,000 — enough for most research lists. The Actor also accepts full GitHub URLs (https://github.com/apify) and strips them to the slug automatically, so you can paste raw links from a browser tab.

What you would actually use this for

Four concrete use cases sourced from the README and the marketing brief — not invented:

DevRel competitive mapping. Pull every org in your competitive landscape — your SDK's adopters, rivals, and adjacent tools. Compare public_repos, followers, and created_at in a spreadsheet. At $3 per 1,000 orgs, a 200-org sweep costs $0.60.

Sales / BD qualification. Your outbound list has company names but no GitHub signals. Feed the org slugs in, filter for public_repos > 20 and followers > 500 to score engineering-heavy targets before a rep spends time on them.

Hiring target research. Filter by location and public_repos to find engineering-dense orgs in your city, then use the repos language distribution (Python vs Go vs Rust) to target teams likely to hire for your stack.

Dependency and M&A intelligence. Pull the full public repo list with includeRepos: true and combine the output with the GitHub Repo Scraper to inventory the tech stack of an acquisition target.

Pricing — exact numbers

Pay-Per-Event. You pay for orgs that land in the dataset; nothing for orgs that 404 or error out.

$0.005 per run (covers warm-up and token handshake)
$0.003 per org written to the dataset

Orgs	Cost
100	$0.31
1,000	$3.01
5,000	$15.01
10,000	$30.01

Apify's free $5 trial credit covers your first ~1,660 orgs with no credit card. Pay-per-event suits research workloads where volume spikes one week and sits idle the next — you never pay for an idle subscription.

The technically interesting bit

GitHub's unauthenticated rate limit is enforced per source IP, not per session or per user-agent. On a shared datacenter range your first 60 calls might all "succeed," but you're depleting a pool shared with dozens of other tenants on the same machine. We route through Apify residential proxies — real exit nodes with rotating IPs — so each request slot burns from a fresh allocation. On authenticated calls the limit is per token, not per IP, and the Actor threads the token through every concurrent session without sharing state between parallel fetches.

The concurrency parameter (default 4, max 16) lets you tune parallelism to your token's remaining headroom, and the Actor reports via set_status_message if it starts hitting rate limits mid-run.

Limitations (the honest list)

Member lists are out of scope. Public members require GET /orgs/{slug}/members, which GitHub gates behind org-membership for most orgs. We return the members_url_template field so you can call it yourself with an appropriate token, but the Actor doesn't fan out to member profiles.
Private repos, teams, and projects are never returned. This is a public read-only Actor; authenticated private data is out of scope.
Some fields are frequently null. email, location, blog, and twitter_username are user-supplied and most orgs skip them. We surface what GitHub publishes — we never infer or guess.
Repo cap is 100 per org per run. maxReposPerOrg caps at 100 (one API page); for orgs with thousands of repos you'll only see the 100 most recently pushed.
GitHub's verified-org flag is sparse. is_verified is true only for orgs that have gone through GitHub's verification flow. Most orgs return null.

FAQ

Is scraping GitHub org data legal?
The GitHub REST API /orgs endpoints are public, unauthenticated endpoints for public data — the same data visible at github.com/{slug} in a browser. The Actor operates well within GitHub's acceptable use policies for accessing publicly available information at respectful request rates. Use this Actor responsibly and within your own jurisdiction's applicable rules.

Can I export the output to a spreadsheet or warehouse?
Yes — the Apify Console has one-click CSV, Excel, and JSON export from any dataset. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull via the Apify REST API directly into Pandas, BigQuery, or any warehouse.

Is there an official GitHub org bulk-export API?
No. The GitHub REST API reads one org at a time. The GitHub GraphQL API can batch queries, but you still need to manage pagination, rate limits, and the cursor protocol yourself. This Actor bundles all of that.

How does this differ from the GitHub Repo Scraper and GitHub User Scraper?
Different scope, different row shape. This Actor returns one row per org with the org's profile fields and an optional repos array. The GitHub Repo Scraper returns one row per repository (all fields including topics, license, and contributor count). The GitHub User Scraper returns personal account profiles. All three can be chained: org scraper to discover the org → repo scraper to detail its repos → user scraper to profile key contributors.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/github-org-scraper.

Free $5 trial credit, no credit card. Paste five org slugs, hit Start, and you'll have a clean JSON dataset in under a minute. Need a field from the GitHub API that isn't exposed yet? Leave a comment — the Actor ships updates weekly.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community