Devil Scrapes

Posted on Jun 1

GitHub User Scraper: extract profile data for recruiting & OSINT

#webscraping #python #apify #data

Quick answer: A GitHub user scraper calls the GitHub REST API for each username or org slug you provide and returns a typed JSON row with name, bio, company, location, blog, follower count, public repo count, hireable flag, account creation date, and — optionally — the user's top public repos. The Apify Actor below bundles all of that for $0.003 per profile (~$3.00 per 1,000), with rate-limit pacing, retries, and Pydantic-validated output handled for you.

GitHub is the closest thing the software industry has to a public professional directory. Every public profile carries self-declared company, location, bio keywords, and a follower graph — exactly the signals a technical recruiter, DevRel team, or OSINT analyst wants. The problem is extraction at scale: the unauthenticated REST endpoint allows only 60 requests per hour per IP. Authenticated, you get 5,000 — still not enough for a thousand-profile sweep without a token layer and a retry harness. Here's what that work involves, and how I collapsed it to one API call.

What is GitHub's user API? 🔎

GitHub exposes every public user and organisation profile via its REST API at GET /users/{login}. The endpoint is documented, stable, and intentionally public — GitHub's own web UI and mobile app hit it. It returns a JSON object covering identity, stats, and timestamps.

What it gives you per account:

Profile identity: login, display name, bio, company, blog URL, location, public email, and linked X (Twitter) handle.
Activity stats: public repository count, public gist count, follower count, following count.
Account metadata: hireable flag, account creation date, last-profile-updated timestamp.
Repo summary (optional, one extra call): the user's public repos with name, stars, and language.

What it does not give you unauthenticated: private repo count, contribution graph, sponsor relationships, starred repos, or private-org membership. Those require OAuth scope elevation or are simply not exposed via the public API.

Does GitHub have an official bulk export for user profiles?

No. The GitHub REST API is a per-profile lookup, not a bulk-export endpoint. GET /users/{login} returns one user per call. There is no POST /users/batch or CSV-download surface. The Search API (/search/users) lets you query by keyword but returns abbreviated objects and caps results at 1,000 per query — not a reliable pipeline for sourcing at scale. The only way to get a clean, fully-typed dataset for a list of usernames is to call /users/{login} for each one, handle rate limits and transient failures, and validate the response shape before it lands.

What the data looks like

Each profile comes back as one flat, typed row. Here is a real one from the apify organisation:

{
  "login": "apify",
  "type": "Organization",
  "name": "Apify",
  "company": null,
  "blog": "https://apify.com",
  "location": "Prague, Czechia",
  "email": null,
  "bio": null,
  "twitter_username": "apify",
  "public_repos": 412,
  "public_gists": 0,
  "followers": 856,
  "following": 2,
  "html_url": "https://github.com/apify",
  "avatar_url": "https://avatars.githubusercontent.com/u/24586296?v=4",
  "hireable": null,
  "created_at": "2017-01-11T14:36:55Z",
  "updated_at": "2026-05-20T09:14:12Z",
  "repos": null,
  "scraped_at": "2026-05-28T11:03:44+00:00"
}

Twenty fields, the same shape every run, validated with Pydantic before it lands in your dataset. It drops directly into Pandas, a PostgreSQL insert, or an ATS webhook — no response-shape wrangling on your end.

The naive approach (and why it falls apart) 🔥

The first thing any Python developer tries:

import httpx
for username in usernames:
    r = httpx.get(f"https://api.github.com/users/{username}")
    data.append(r.json())

This works for five users. It falls apart at fifty. Three reasons, and they are exactly what a hosted Actor earns its keep solving:

1. Rate-limit exhaustion at 60 req/hour. Without a token, GitHub returns 403 after 60 requests per hour per IP. GitHub's rate-limit docs put the ceiling at 5,000/hour with authentication — but that still means token management, secret storage, and a refresh strategy. We pace against the X-RateLimit-Remaining and X-RateLimit-Reset headers and slow down before the wall rather than hammering into it.

2. Transient failures at scale. A thousand-profile sweep takes minutes, and in that window you'll see 408, 429, and occasional 503 responses from GitHub's CDN edge. A naive loop either crashes or silently skips rows. We retry each request up to five attempts with exponential backoff starting at two seconds, cap at thirty seconds, and honour Retry-After headers when present. Partial success surfaces as a status message, not a silently-truncated dataset.

3. Response schema drift. A plain r.json() pipeline emits whatever the wire sends — including None where downstream expects an integer. We run every response through a Pydantic ResultRow model at write time, so your dataset never gets a row where followers is a string. No data, no charge.

The Actor ⚙️

I packaged the result as an Apify Actor: GitHub User & Org Scraper.

Paste a list of usernames in the Apify Console and click Start, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/github-user-scraper").call(
    run_input={
        "usernames": ["torvalds", "gvanrossum", "antirez", "apify"],
        "includeRepos": True,
        "maxReposPerUser": 10,
        "concurrency": 6,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["login"], item["followers"], item.get("hireable"))

The usernames field accepts plain handles (torvalds), profile URLs (https://github.com/torvalds — the host is stripped automatically), and organisation slugs (apify). Supply an optional githubToken to raise the rate ceiling from 60 to 5,000 requests per hour. Set includeRepos: true to append a repos array to each row (capped at maxReposPerUser, default 30, ceiling 100) — useful for keyword-matching against repo descriptions.

Concurrency defaults to 6 parallel fetches (ceiling 32). With a token you can push it up; without one, keep it at or below 6 to stay inside the hourly budget.

What you would actually use this for 💡

Four concrete patterns, not generic "developer intelligence":

Technical recruiting sweep. Pull 500 GitHub accounts whose bios mention "Rust" and whose location contains "Berlin", then filter to public_repos > 20 and hireable: true. That's your cold-outreach list for a Rust-engineer search, assembled without a LinkedIn Recruiter seat. At $3.00 per 1,000 profiles, a 500-name sweep costs $1.50.

ATS profile enrichment. When a candidate applies with a GitHub URL, pipe it through the Actor via a Make/Zapier webhook on ACTOR.RUN.SUCCEEDED. You get followers, public_repos, bio, blog, and company alongside the raw URL — enough to pre-score before a screen call.

DevRel contributor mapping. Your open-source project merged 200 PRs this year. Pull contributor usernames through the Actor and rank by followers to identify high-signal voices for a community newsletter or conference invite.

Developer OSINT. A pseudonymous developer posts as user-xyz across GitHub and a personal blog. Their profile may declare company, location, blog, and twitter_username — fields the web UI shows but no bulk-download surface assembles. The Actor returns all of them in one call.

Pricing — exact numbers 💰

Pay-per-event. You pay for profiles you receive, nothing for profiles you request that fail.

$0.005 per run (one-off warm-up charge, covers Actor boot)
$0.003 per profile written to the dataset

Pull	Cost
100 profiles	$0.31
1,000 profiles	$3.01
10,000 profiles	$30.01
50,000 profiles (monthly ATS enrichment)	$150.01

Apify's $5 free trial credit covers roughly 1,660 profiles, no credit card required. For context, commercial enrichment APIs (Clearbit, Hunter, Apollo) typically charge $0.01–$0.10 per enriched contact — this Actor, at $0.003 per GitHub profile, is the raw-data layer underneath those services.

The technically interesting part

The interesting engineering here is not the request layer — it is the response normalisation. GitHub returns hireable as true, false, or null depending on whether the user ever set it. The company field may contain @acme (an org reference), ACME Corp (free text), or null. The blog field may be a full URL, a scheme-less URL, or free text that is not a URL at all.

The Pydantic ResultRow model accepts all three states for nullable fields as T | None, so downstream code never needs to distinguish "field missing" from "field present but null". The blog field is returned as-is — the caller's pipeline is better positioned to decide whether to prefix https://. That's a deliberate choice the README documents.

Limitations 🚧

These are the real ones, not hedges:

No contribution graph. GitHub's public REST API does not expose the contribution heatmap. It requires GraphQL with authentication and is out of scope for v1.
No private org membership. You see only publicly-declared company text and public org memberships. Private-org membership needs OAuth scope read:org and is out of scope.
No starred repos. The Actor hits /users/{login} only. GET /users/{login}/starred requires a separate paginated call not currently included.
Email is often null. GitHub hides email by default unless the user explicitly makes it public. We surface whatever the public API returns; we do not probe commit metadata or noreply address patterns.
60 req/hour without a token. The unauthenticated ceiling is real. For lists longer than ~50 profiles, supply a githubToken or accept a slower run.
Stale data. We fetch on demand; we do not cache. If a profile changed five minutes ago, you get the current state. For freshness guarantees, schedule re-runs.

FAQ ❓

Is scraping GitHub user profiles legal?
This Actor calls the documented, public GitHub REST API — the same endpoint GitHub's own mobile app uses. It reads only data GitHub makes publicly available. It does not bypass authentication, access private data, or exceed the rate limits for an authenticated session. As always, review your jurisdiction and intended use case before ingesting profile data into outreach systems.

Does it work for organisations, not just individual users?
Yes. The /users/{login} endpoint returns "type": "Organization" for org slugs and "type": "User" for personal accounts. The dataset shape is identical; hireable is always null for organisations.

Can I export to Google Sheets or a data warehouse?
Yes — export CSV, JSON, or Excel from the Apify Console after a run, webhook the dataset into Make/Zapier/n8n on ACTOR.RUN.SUCCEEDED, or pull rows via the Apify Dataset API. The Actor's output is standard Apify dataset format.

What is the difference between this Actor and the GitHub Repo Scraper?
This Actor targets user and organisation profiles (/users/{login}). The GitHub Repo Scraper targets repository metadata (/repos/{owner}/{repo}). Different endpoints, different schemas, designed to complement each other: fetch the user first to get their profile, then pass their login to the repo scraper to enumerate their repositories in detail.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/github-user-scraper.

Free $5 trial credit, no credit card. Drop 20 GitHub usernames in the input box and you'll have clean, typed JSON in your dataset inside a minute. Want a field it doesn't return yet? Leave it in the comments — I ship from what people actually ask for.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community