Devil Scrapes

Posted on Jun 1

HuggingFace Scraper: export models, datasets & Spaces to JSON

#webscraping #python #apify #datascience

Quick answer: The HuggingFace Hub is the world's largest public registry of open-source AI models, datasets, and Spaces, but its public REST API returns one page at a time with no bulk export. A HuggingFace scraper paginates that API for you, optionally enriches each row with safetensors parameter counts and GGUF detection, and writes every repo as a typed JSON row. The Apify Actor below does it for $0.002 per row in list mode (~$2.05 per 1,000), with rate-limit handling, backoff, and Pydantic-validated output included.

The HuggingFace Hub now hosts over one million public models. That number is noise — for any real purpose (trend tracking, hardware-fit analysis, GGUF availability, investor dashboards) you need a slice: the top-1,000 trending text-generation models by 30-day download velocity, or every Space an org has deployed, or every dataset tagged speech and multilingual. The Hub's search UI gives you that slice visually. Getting it as a CSV has no "Download" button.

What is the HuggingFace Hub? 🤗

The HuggingFace Hub is a Git-based hosting platform for machine-learning repos — models, datasets, and Spaces (interactive ML demos). It is the de-facto distribution channel for open-source AI: nearly every major foundation model (LLaMA, Mistral, Falcon, Whisper, SDXL) is listed here, alongside the datasets used to build them and live inference Spaces.

Each public repo carries structured metadata: repo type, owner, download count, like count, creation timestamp, tag list, and — for models — a pipeline tag (e.g. text-generation) and library name. That metadata is what this Actor exports.

Does HuggingFace have a bulk export API?

No — not for trending snapshots or cross-author catalog scans. HuggingFace publishes a public REST API at huggingface.co/api/models (and the equivalent /datasets and /spaces paths). It is well-documented for per-query access — fetch one page of models filtered by a single tag. What it doesn't give you: pagination that survives 10,000+ results, a "trending" sort that maps correctly to the internal trendingScore param, derived fields like model size categories or GGUF detection, or a scheduled weekly pull with retry logic. That gap is the job.

What the data looks like 📊

One row per repo, Pydantic-validated before it hits the dataset. Here is a real detail-mode row for openai/whisper-large-v3:

{
  "repo_type": "model",
  "repo_id": "openai/whisper-large-v3",
  "repo_owner": "openai",
  "repo_name": "whisper-large-v3",
  "repo_url": "https://huggingface.co/openai/whisper-large-v3",
  "downloads": 4932732,
  "likes": 5690,
  "created_at": "2023-11-07T18:41:14.000Z",
  "last_modified": "2024-08-12T10:20:10.000Z",
  "tags": ["transformers", "safetensors", "whisper", "automatic-speech-recognition"],
  "gated": false,
  "private": false,
  "pipeline_tag": "automatic-speech-recognition",
  "library_name": "transformers",
  "safetensors_total_params": 1543490560,
  "model_size_category": "1B-7B",
  "has_gguf": false,
  "dataset_size_categories": null,
  "dataset_task_categories": null,
  "dataset_languages": null,
  "space_sdk": null,
  "space_runtime_stage": null,
  "scraped_at": "2026-05-16T12:00:00+00:00"
}

Twenty-three fields in a flat, stable schema. Type-specific fields (pipeline_tag, library_name, safetensors_total_params, dataset_*, space_*) are null for repo types where they don't apply — never omitted, always present. The row lands straight into Pandas, BigQuery, or DuckDB without structural wrangling.

The naive approach (and why it falls apart) ⚙️

The HuggingFace API is well-behaved for a single fetch. A dozen pages in, things start to fall apart:

1. Rate limiting bites at scale. HuggingFace enforces 500 requests per 5 minutes (verified from the ratelimit-policy response header). In detail mode — where each list row also needs a detail request — you hit the ceiling in under two minutes. We pace requests, honour Retry-After, and retry on 429 and 503 with exponential backoff (up to 5 attempts). Partial runs surface with a clear status message rather than a silently truncated dataset.

2. Pagination state is fragile. The list endpoint returns a Link: <...>; rel="next" header for cursor-based pagination. Resume a dropped session naively and you restart from page 1, double-counting early rows. We thread the cursor across pages, de-duplicate on repo_id, and fail loud if the cursor goes missing mid-run.

3. The trending sort is undocumented. The Hub's "Trending" view is backed by the trendingScore param, but the public API docs only reference downloads, likes, lastModified, createdAt. We map all five sort fields — including trending → trendingScore — correctly. Guess the wrong param and you silently get a downloads-sorted list with no error.

4. Detail enrichment is a separate request per row. Safetensors parameter counts, GGUF file detection, and Space runtime stage (RUNNING / SLEEPING / STOPPED) are not on the list endpoint — they require a per-repo detail call. We issue those in batches, rotate the curl-cffi browser fingerprint across requests, and back off when the rate limit kicks in. You pay $0.005/row when you need this enrichment; $0.002/row when you don't.

We rotate browser TLS fingerprints via curl-cffi impersonation, retry with exponential backoff on every recoverable failure, and hand back clean, Pydantic-validated typed rows — no sparse dicts, no positional array wrangling. No data, no charge.

The Actor 🚀

The Actor is live: HuggingFace Hub Scraper on Apify Store. Run it from the Apify Console or call it from code. Here is a Python example using the Apify client:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Top 500 trending text-generation models, with GGUF detection
run = client.actor("DevilScrapes/huggingface-hub-scraper").call(
    run_input={
        "repoType": "model",
        "sort": "trending",
        "filterTags": ["text-generation", "transformers"],
        "maxResults": 500,
        "includeDetails": True,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["repo_id"], item["model_size_category"], item["has_gguf"])

Or run a quick trending snapshot with the minimum input:

{
  "repoType": "model",
  "sort": "downloads",
  "maxResults": 100,
  "includeDetails": false
}

A handful of input fields cover most use cases: repoType (model, dataset, space), sort (five options), one optional filter (filterTags, searchQuery, author, or repoId — at most one, validated before any network call), maxResults (1 to 5,000), and includeDetails. Pass repoId in owner/name form for a single-repo deep fetch.

What you would actually use this for 💡

Weekly trending model leaderboard. Set up an Apify Schedule to run weekly, sort by trending, export to CSV, and diff against last week's snapshot to see which model families are gaining ground. At 100 list-mode rows that's $0.25 per run — under $15/year for a clean weekly dataset.

GGUF availability tracker for local inference. Set includeDetails=true and filter on has_gguf=true after the run to find quantized models ready for llama.cpp, Ollama, or LM Studio. The flag comes from scanning sibling filenames in the detail endpoint — something the Hub's own search doesn't expose as a filter.

Hardware-fit analysis before benchmarking. Pull safetensors_total_params and the derived model_size_category bucket (<1B, 1B-7B, 7B-13B, 13B+) to pre-filter models that fit a target GPU memory budget — before downloading a 70B checkpoint to find out it doesn't fit in 24 GB.

VC / analyst adoption dashboard. Export 30-day download counts for an org (author=meta-llama or author=mistralai) weekly and chart download velocity over time to map open-source adoption for an investment thesis. The downloads field is a rolling 30-day count — consistent and directly comparable across runs.

Dataset discovery for training pipelines. Filter datasets with filterTags: ["task_categories:text-classification", "language:de"] to surface German text-classification corpora. The dataset_task_categories and dataset_languages fields are auto-parsed from HuggingFace's tag-prefix convention, so you get structured arrays rather than raw tag strings.

Pricing — exact numbers 💰

Pay-Per-Event. You are charged per row that lands in the dataset, not per page fetched.

Event	Price (USD)	When
`actor-start`	$0.05	Once per run, at boot
`result-row`	$0.002	Per row, list mode (`includeDetails=false`)
`result-row-detailed`	$0.005	Per row, detail mode or `repoId` mode

Example costs:

Rows	Mode	Cost
100	list	$0.25
500	list	$1.05
1,000	list	$2.05
1,000	detail	$5.05
5,000	list	$10.05
5,000	detail	$25.05

The actor-start fee of $0.05 covers warm-up regardless of how many rows you pull. Detail mode is 2.5× the per-row cost because it doubles the request count — factor that in when scoping a run.

The technically interesting part

HuggingFace's list endpoint accepts a full=true query parameter that is supposed to return richer fields. We always send it — but model list rows still omit repo_owner, safetensors, and sibling files, which only appear on the per-repo detail endpoint. The flag meaningfully enriches dataset list rows yet behaves inconsistently across repo types. So we treat full=true as a baseline and fall back to the detail endpoint for any field the list endpoint leaves null, rather than trusting the flag to deliver the same schema every time. That undocumented contract drift is exactly what bites scrapers that assume the API does what the docs say.

Limitations 🚧

Private and gated repos are not accessible. The unauthenticated public API returns only publicly visible data, by design. This Actor never bypasses authentication.
Rate limit: 500 requests per 5 minutes (verified 2026-05-16 via ratelimit-policy response header). In detail mode, throughput drops to roughly 250 enriched rows per minute.
Spaces never have a downloads count. The field is always null for repo_type=space — HuggingFace does not track Spaces downloads. Space list rows are also sparse: repo_owner, last_modified, and space_runtime_stage need includeDetails=true.
Structured metadata only. No model-card or dataset-card README body, no YAML front-matter beyond the tags the list endpoint surfaces.
No cross-run deduplication. Re-running the same input returns the same repos with refreshed metadata; dedupe is a downstream concern.
Maximum 5,000 rows per run. For full-catalog exports (1M+ models) you partition by tag or time window across multiple runs.
The FREE Apify tier retains run-scoped storage for 7 days. Export your dataset immediately or use a named dataset for longer retention.

FAQ

Is scraping the HuggingFace Hub legal?
The Hub API at huggingface.co/api/ is unauthenticated by design for public repo metadata, and HuggingFace's Terms of Service permit accessing public data via the API. This Actor never bypasses authentication and only reads public metadata. Check your own jurisdiction and use case before using scraped data commercially.

Is this the same as the huggingface_hub Python library?
No. The official huggingface_hub library is excellent for one-off queries inside your application. This Actor is for scheduled, batched, cross-author catalog pulls — weekly trending snapshots, full-org exports, recurring GGUF scans — with rate-limit handling, retry logic, and a clean structured dataset waiting when it finishes. Use the library in your app; use this Actor for the data pipeline.

Can I export to CSV, Google Sheets, or a data warehouse?
Yes. Click Export in the Apify Console to download JSON, CSV, Excel, or XML. Webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull via the Apify REST API: GET /datasets/{id}/items?format=csv&clean=true.

Why does my Space row have so many null fields?
Space list rows are sparse by design — HuggingFace's API does not return repo_owner, last_modified, or space_runtime_stage on the list endpoint. Enable includeDetails=true to populate them. Note downloads is always null for Spaces regardless of mode.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/huggingface-hub-scraper.

Free $5 trial credit, no credit card required. A list-mode run on the top 100 trending models costs $0.25 and finishes in under a minute — pipe it into a spreadsheet and you have the open-source AI leaderboard as a live table. Need a field that's not in the schema, or hit a rate-limit edge case you want handled differently? Drop it in the comments — Devil Scrapes ships updates based on what users actually need.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community