Devil Scrapes

Posted on Jun 2

PyPI Metadata API: bulk-scrape package data for $1.50/1K

#webscraping #python #apify #data

Quick answer: PyPI's pypi.org/pypi/<package>/json endpoint returns rich metadata for one package at a time — version, dependencies, classifiers, license, author, release history — but it has no bulk API and no published rate-limit docs. A PyPI package scraper fans out requests in parallel, handles the pacing, and writes every result as a clean typed row. The Apify Actor below does it for $0.0015 per package (~$1.50 per 1,000), with retries, proxy rotation, and Pydantic validation handled for you.

PyPI hosts over 580,000 packages, and there are plenty of reasons to want their metadata in bulk: auditing a requirements.txt with 120 dependencies, running a nightly license-compliance check, or pulling READMEs for 40,000 packages to ground an LLM. The catch isn't authentication — it's doing it reliably at scale: 404s for deprecated packages, pacing so you don't get throttled, parsing the classifier array, and capturing release history without a second request. Here's what that looks like, and how I reduced it to one API call per package.

What is PyPI? 🔎

PyPI — the Python Package Index — is the official third-party package repository for Python, maintained by the Python Software Foundation. When you pip install requests, pip resolves and downloads from PyPI. Every published package exposes a canonical JSON endpoint at pypi.org/pypi/<name>/json (and pypi.org/pypi/<name>/<version>/json for specific versions) returning the full distribution metadata: METADATA file fields, release artifacts with SHA256 checksums, and upload timestamps for every release.

What PyPI does not provide out of the box:

A batch endpoint that accepts a list of package names
A search-by-classifier or search-by-license API
Download counts via the JSON API (separate datasets on BigQuery and pypistats.org)
A webhook or feed for new releases in a watchlist

Does PyPI have a bulk metadata API? 🤔

No. The PyPI JSON API is per-package only — one package's metadata per request. There's a Simple Repository API (PEP 503) for pip-resolver use and a Stats API for download totals, but neither bundles the rich metadata — classifiers, author email, requires-dist, project URLs — you need for dependency analysis or compliance work.

To pull metadata for 1,000 packages you need 1,000 HTTP requests in parallel, 404 handling for packages that were yanked or never existed, and normalization for the dozen ways authors fill (or skip) optional fields.

What the data looks like 📤

Each package resolves to one flat typed row:

{
  "name": "httpx",
  "version": "0.27.0",
  "summary": "The next generation HTTP client.",
  "description": "# httpx\n\nA next-generation HTTP client...",
  "description_content_type": "text/markdown",
  "author": null,
  "author_email": "Tom Christie <tom@tomchristie.com>",
  "maintainer": null,
  "license": "BSD-3-Clause",
  "home_page": null,
  "project_url": "https://pypi.org/project/httpx/",
  "project_urls": {
    "Documentation": "https://www.python-httpx.org",
    "Source": "https://github.com/encode/httpx"
  },
  "requires_python": ">=3.8",
  "requires_dist": [
    "certifi",
    "httpcore==1.*",
    "idna",
    "sniffio",
    "h2>=3,<5; extra == \"http2\""
  ],
  "classifiers": [
    "Development Status :: 4 - Beta",
    "Framework :: AsyncIO",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: BSD License",
    "Programming Language :: Python :: 3 :: Only"
  ],
  "keywords": "async, http, httpx",
  "yanked": false,
  "release_history": [
    {"version": "0.27.0", "upload_time": "2024-02-21T13:15:11"},
    {"version": "0.26.0", "upload_time": "2023-11-15T09:42:33"}
  ],
  "package_url": "https://pypi.org/project/httpx/",
  "scraped_at": "2026-05-31T10:22:04+00:00"
}

Twenty fields, Pydantic-validated before the row lands in your dataset. The requires_dist array preserves PEP 508 markers verbatim, classifiers holds the full triplicate strings, and release_history gives the latest 10 versions with upload timestamps when you ask for it.

The naive approach (and why it falls apart) 🔧

The obvious first pass:

import httpx

for pkg in packages:  # ... times 10,000
    r = httpx.get(f"https://pypi.org/pypi/{pkg}/json")
    data = r.json()["info"]

That works for 20 packages. At 10,000 it frays — four reasons:

1. Rate limits without documentation. PyPI doesn't publish its rate-limit policy. In practice a burst of sequential requests from one IP starts seeing 429s, and the response doesn't always carry a Retry-After header. We pace with configurable concurrency (default: 8), track 429s, and retry with exponential backoff so the queue drains cleanly instead of hammering the server or failing silently.

2. 404s that need distinguishing. A package that was never published and one that was yanked both return 404. We log the status code and skip the row with a warning rather than hard-failing, so a single bad package name doesn't abort a 5,000-package sweep.

3. Field normalization. PyPI's JSON isn't rigidly typed across packages. author may be an empty string or null; license may be a long SPDX expression, a short tag, or "Custom"; home_page was deprecated for project_urls but both still appear. We run every row through a Pydantic ResultRow model — str | None fields, list[Any] for classifiers and requires_dist — so your dataset keeps a stable schema.

4. Release history is in a separate key. It isn't in info — it's in the top-level releases key of the same response, a dict of version → list[artifact]. Flattening it needs a sort + truncation step. We do that when includeReleases is true, at no extra request.

We rotate browser fingerprints via curl-cffi, thread Apify residential proxies for IP hygiene on long runs, and retry with backoff on 408 / 429 / 5xx. Partial results surface with a clear status message — never a silent empty dataset.

The Actor

I packaged this as an Apify Actor: PyPI Package Scraper. Feed it a list of package names in the Apify Console or call it from Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/pypi-package-scraper").call(
    run_input={
        "packages": ["requests", "httpx", "fastapi", "pydantic"],
        "includeReleases": True,
        "concurrency": 8,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["name"], item["version"], item["license"])

Key input parameters, all from the input schema:

Field	Default	Notes
`packages`	`["requests","httpx","selectolax"]`	Any PyPI package name; case-insensitive
`includeReleases`	`true`	Adds the 10 most recent release versions + upload times
`concurrency`	`8`	Parallel fetches; up to 32 for large batches
`proxyConfiguration`	`{useApifyProxy: false}`	Optional; residential for high-volume runs

Use cases 💡

Dependency audit at scale. Parse your requirements.txt (or pyproject.toml, or pip freeze), pass the list to the Actor, and query for packages where yanked == true, requires_python excludes your runtime, or the last release_history entry is over 18 months old. For a 300-package monorepo, that's a one-run, one-query job.

License compliance. Pull the license field and classifiers array for every direct dependency, cross-reference against your allowed list (MIT, BSD, Apache-2.0, ISC), and flag GPL, AGPL, or empty strings. This is the tedious part of an SBOM — handling it once with structured data beats grep-ing PyPI pages.

Maintainer mapping. The author_email, maintainer, and project_urls fields are enough to build a maintainer graph — for security review, contributor outreach, or bus-factor risk across your stack.

Release monitoring. Schedule the Actor daily on a watchlist of 50-100 packages. Diff today's release_history against yesterday's run via the Apify API, and webhook new versions into Slack, Jira, or a CI job that opens an update PR.

RAG corpus for PyPI documentation. The description field carries the package's full README. For LLM-grounded tools — package advisors, dependency explainers — pulling descriptions for tens of thousands of packages gives you a retrieval corpus without scraping individual GitHub repos.

Pricing — exact numbers 💰

Pay-per-event. You pay only for packages that resolve.

Event	Price
Actor start	$0.005 per run
Result emitted	$0.0015 per package

Batch	Cost
100 packages	$0.16
1,000 packages	$1.51
10,000 packages	$15.01
50,000 packages	$75.01

Apify's $5 free-trial credit covers your first ~3,300 packages, no credit card. A full pip freeze on a mid-sized Django app lists 60-120 packages — under $0.20 per audit run.

The part that surprised me when I built this

PyPI's releases dict uses version strings as keys, and each version stores a list of distribution artifacts — wheels, sdists, eggs — each with its own upload_time. A package with a 10-year history can have 200+ entries, which adds up in memory across thousands of concurrent fetches.

The includeReleases option caps this at the 10 most recent versions by sorting releases.keys() with packaging.version.Version and slicing. That's the technically interesting bit: a naive sorted(releases.keys())[-10:] is lexicographic, so it sorts 0.10.0 before 0.9.0 and mishandles pre-releases like 2.0.0a1 — PyPI doesn't guarantee version order in the JSON.

Limitations 🚧

Download counts are not included. PyPI's JSON API does not expose them. Use the pypistats.org API or the pypi-downloads BigQuery dataset — both purpose-built for that question.

Vulnerability data is out of scope. The Actor returns what PyPI's metadata API returns. CVE enrichment requires OSV, GitHub Advisory, or Snyk and is not bundled.

Reverse dependencies not available. PyPI has no API for "what depends on X". Libraries.io is the canonical source for that graph.

Classifier completeness varies. Many packages set no classifiers, or only the License one. The Actor returns whatever is there and never fabricates values — if classifiers is empty, it's empty in the dataset too.

Large description fields. Packages that bundle a full README into their metadata can have description values in the hundreds of kilobytes. Drop the field after export if you don't need it.

FAQ ❓

Is it legal to scrape PyPI this way?
PyPI's Terms of Use allow programmatic access to the public JSON API. The Actor hits only public endpoints, collects no personal data beyond what authors publish, and paces requests to avoid excessive load. As always, review your own use case and jurisdiction.

Is there an official bulk PyPI data export?
Partial yes: Google's PyPI public dataset on BigQuery covers downloads and metadata snapshots, but querying it needs BigQuery and SQL. The Actor gives you on-demand JSON without a BigQuery account.

Can I export the results to a spreadsheet or warehouse?
Yes — export CSV, JSON, or Excel from the Apify Console, trigger a webhook on ACTOR.RUN.SUCCEEDED into Google Sheets, or pull rows via the Apify API into your own pipeline.

What if a package in my list doesn't exist on PyPI?
The Actor logs a 404 and skips it. The dataset contains every package that resolved, and the run status surfaces the count of 404s.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/pypi-package-scraper. Free $5 trial credit, no credit card. Drop in your requirements.txt package list and you'll have a structured metadata snapshot in under a minute. Need a field that isn't there — download counts, vulnerability data, reverse deps? Drop it in the comments and we'll ship it.

Sources used in this post:

Built by Devil Scrapes — Apify Actors that do the dirty work. Pay-per-event, honest pricing, no junk fields. 😈

DEV Community