Devil Scrapes

Posted on Jun 1

GitHub Trending API: scrape today's hottest repos for $2.00/1K

#webscraping #python #apify #data

Quick answer: GitHub's Trending page (github.com/trending) has no public API and no bulk export. To get the data programmatically you scrape the HTML page. A GitHub Trending scraper fetches the page — filtered by programming language, spoken language, and time window — and returns every trending repo as a typed JSON row: rank, repo slug, description, total stars, stars gained in the window, fork count, primary language, and contributor avatars. The Apify Actor below does it for $0.002 per repo (~$2.00 per 1,000 results), with TLS fingerprint rotation and proxy handling managed for you.

GitHub's Trending page is one of the most watched signals in open source. Newsletter editors check it every Monday. Indie hackers scan it for saturating niches. VC analysts use it as early-warning radar for seed-stage OSS projects. The problem: no download button and no API. Just an HTML page that re-renders every ~6 hours, 25 repos at a time, with no stable RSS feed and no documented endpoint. If you want this data in a spreadsheet, a Slack bot, or a vector store, you extract it yourself. Here is what that involves, and how I got it to one API call.

What is the GitHub Trending page? 🔎

The GitHub Trending page (github.com/trending) is a curated leaderboard that ranks public repositories by stars gained in the last day, week, or month. It is not powered by the GitHub REST API — it is a separately rendered HTML view, generated by GitHub's internal heuristics, with no documented data contract.

Each card gives you:

The repo slug (owner/name) and a link to the repo
A description (the repo's tagline)
The primary language displayed on the card
Total star count and stars gained in the current window
Fork count
Up to five contributor avatar URLs ("built by" section)

The page accepts URL parameters for time window (?since=daily|weekly|monthly), programming language (?l=python), and a spoken-language filter — none documented. The whole thing is an HTML page, not an API.

Does GitHub Trending have an API? 🤔

No. As of 2026, GitHub publishes no official API or bulk-export endpoint for the Trending page. The GitHub REST API has a search endpoint (GET /search/repositories) with a sort=stars option, but that returns all-time or recent stars across all repos — not the curated trending list GitHub computes internally. The two surfaces give different results. If your use case needs the trending list specifically, you scrape the HTML page. GitHub is aware of the gap; they have been discussing a Trending API since 2012 and it has not shipped.

What the data looks like

Every repo comes back as one flat, typed row:

{
  "rank": 1,
  "full_name": "microsoft/TypeScript",
  "owner": "microsoft",
  "name": "TypeScript",
  "html_url": "https://github.com/microsoft/TypeScript",
  "description": "TypeScript is JavaScript with syntax for types.",
  "language": "TypeScript",
  "stars_total": 101200,
  "stars_in_window": 412,
  "forks": 12800,
  "built_by": [
    "https://avatars.githubusercontent.com/u/7809484?s=40&v=4",
    "https://avatars.githubusercontent.com/u/4538978?s=40&v=4"
  ],
  "scraped_at": "2026-05-31T08:14:22+00:00",
  "window": "daily"
}

Thirteen fields, stable shape, Pydantic-validated before it lands in your dataset. The scraped_at ISO-8601 timestamp and window label let you append rows from multiple runs and still know which trending window each row came from.

The naive approach (and why it falls apart) ⚙️

The obvious first attempt: fetch https://github.com/trending with requests, parse the HTML with BeautifulSoup, loop the cards. This works once in a dev environment, then breaks in production for three compounding reasons.

1. TLS fingerprinting and bot detection. GitHub inspects the TLS handshake signature of incoming requests. A plain Python HTTP library emits a JA3 fingerprint nothing like a real browser, so GitHub's infrastructure throttles or blocks it — especially from datacenter IPs. We sidestep this by running curl-cffi with explicit browser impersonation (chrome131, chrome124, firefox147, safari180, rotating across requests), replaying a real browser's TLS ClientHello and HTTP/2 SETTINGS frame. To GitHub's TLS layer, the connection looks like a Chrome tab, because the handshake is one.

2. Proxy-induced session loss. Sending unauthenticated volume from a single datacenter IP triggers rate-limiting fast. We thread Apify proxies through every request and rotate sessions on any 408 / 429 / 5xx response — fresh exit IP, fresh cookie jar, exponential backoff from 2 seconds, capping at 20, up to 4 attempts per page. We never return an empty dataset behind a green status light.

3. HTML contract drift. The Trending page's markup changes without notice. A brittle CSS selector that works today breaks silently next month, leaving a dataset full of nulls. Our parser fails loud on unexpected structure rather than swallowing bad parses. If GitHub changes the page shape, you see an error in the run log with the exact selector that missed, not a silent empty result.

None of that is glamorous. All of it is the difference between a script that ran clean on your laptop last Tuesday and a feed that holds up in production next quarter.

The Actor 🛠️

I packaged this as an Apify Actor — GitHub Trending Scraper. Run it from the Apify Console by setting a language filter and clicking Start, or programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/github-trending-scraper").call(
    run_input={
        "since": "weekly",
        "language": "python",
        "maxResults": 25,
        "proxyConfiguration": {"useApifyProxy": True},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["rank"], item["full_name"], item["stars_in_window"])

Input parameters:

Field	Default	Notes
`since`	`"daily"`	Time window: `"daily"`, `"weekly"`, or `"monthly"`
`language`	`""`	Language slug: `"python"`, `"typescript"`, `"rust"`, etc.
`spokenLanguage`	`""`	Spoken-language filter (ISO code, e.g. `"en"`, `"zh"`)
`maxResults`	`25`	Max repos to return (GitHub typically shows 25 per page)
`proxyConfiguration`	Apify Proxy	Datacenter is fine; residential available for higher-volume runs

Use cases

Five concrete patterns:

Newsletter automation. Pull the top 10 Python trending repos every Friday into your newsletter template — the workflow Pycoder's Weekly and TLDR Dev editors run by hand today. One Actor run, one $0.025 charge, done.

Tech-trend dashboards. Schedule daily runs across ten language filters into a time-series table, and chart which languages are surging week over week.

Investor scouting. Surface early-stage open-source projects gaining stars before they hit the Hacker News front page. A repo jumping from 200 to 800 stars in a single week is a meaningful signal — catching it in the trending window is the point.

Competitive alerts. Schedule a run and alert when a competitor's repo appears in the results — more direct than GitHub's watch notifications, which fire on every commit rather than on trending status.

AI agent data feed. Feed full_name, description, language, and stars_in_window into an LLM briefing pipeline that produces a daily "what's hot in OSS" digest. The typed rows slot into a vector store or prompt without cleanup.

Pricing — exact numbers 💰

Pay-per-event. You pay for repos that land in your dataset, nothing for repos that never arrived.

Event	Price
Actor start (one-off per run)	$0.005
Result emitted (per repo)	$0.002

Pull	Cost
25 repos (one full page)	$0.055
100 repos	$0.205
1,000 repos	$2.005
10,000 repos (daily archival)	$20.005

Apify's $5 free trial credit covers your first ~2,490 repos with no credit card.

The technically interesting part

The stars_in_window field is the one GitHub doesn't expose anywhere else. The REST API gives you a repo's total star count, but not how many stars it gained this week. The Trending page renders that number from an internal heuristics model, and it's the signal most people want. We parse it straight from the rendered HTML — the "412 stars this week" line next to the repo card — into the typed stars_in_window field on every row. No approximation, no delta from polling: the number GitHub itself rendered.

Limitations 🚧

One page per time-window + language combination. GitHub Trending exposes ~25 repos per page, per window. Run multiple Actor calls with different language values to broaden coverage; there is no way to retrieve more than ~25 repos for a single language-window combination — that's what GitHub's page shows.
No topic-based filtering. The Trending page exposes no topic filters (a REST API feature). Use the GitHub REST API for topic queries.
Trending data is ~6 hours stale at worst. GitHub recalculates its list throughout the day, not in real time; a run reflects the page as rendered at request time.
HTML structure can change. GitHub ships UI changes without notice. If the page HTML changes shape, the parser fails loud (empty rows + error in run log) rather than emitting garbage. We aim to ship a parser patch within 24 hours of any breakage.
Developer-level trending is not scraped. This Actor scrapes repository trending, not the separate developer/user leaderboard.

FAQ

Is scraping GitHub Trending legal?
The Trending page is a public, unauthenticated HTML page GitHub serves to any browser without login. This Actor reads only what a logged-out visitor sees, at a polite request rate, and collects no personal data beyond the publicly-displayed contributor avatar URLs. As always, verify your use case against GitHub's Terms of Service and your own jurisdiction before running at volume.

Does GitHub Trending have an RSS feed?
No. GitHub removed their RSS feeds in 2012, and the third-party rebuilds have gone stale. This Actor is the cleanest way to get the data as a structured feed on a schedule — set up a scheduled run and webhook the dataset downstream.

Can I export to Google Sheets or a warehouse?
Yes. Export CSV / Excel / JSON from the Apify Console, configure a webhook on ACTOR.RUN.SUCCEEDED to push results into Make / Zapier / n8n, or pull the dataset via the Apify API.

How is this different from the GitHub REST API's search endpoint?
The REST API sort=stars ranks repos by total or recent stars across all of GitHub — it does not return GitHub's internal trending list. The two produce different rankings because the Trending page incorporates velocity, recency, and editorial signals the API's search doesn't expose. If you want the trending list, you need this page.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/github-trending-scraper.

Free $5 trial credit, no credit card. Set language: "python" and since: "weekly", click Start, and you'll have this week's 25 trending Python repos — ranked, with star counts — in under 30 seconds. Feeding it into a Slack bot, a newsletter, or a dashboard? Drop what you built in the comments — I'm curious what the OSS-trend signal ends up powering.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community