Devil Scrapes

Posted on Jun 2

npm metadata API: bulk-fetch downloads, license & deprecation flags

#webscraping #python #apify #javascript

Quick answer: The npm registry exposes rich per-package metadata at registry.npmjs.org/<package>, but there is no bulk-query endpoint. Getting data for a list of 500 packages means 500 separate HTTP calls across two different API hosts — the registry for metadata and api.npmjs.org for download counts — with rate limits that kick in well before you finish. The Apify Actor below fans out those requests, merges the two data sources, and returns 20 fields per package as clean JSON for $0.0015 per package (~$1.50 per 1,000).

Every JS/TS codebase eventually faces the same audit: which of our 47 direct dependencies are deprecated? Which have fallen below the weekly-download threshold that signals community abandonment? Which carry a license legal won't approve? The data exists — npm publishes it openly — but assembling it at scale is the friction point nobody mentions.

The registry serves one package at a time, the download counts live on a separate host entirely, and the author shapes have been inconsistent since 2012. Merge all that, handle the rate limits, and you've got an afternoon of plumbing before a single line of business logic. Here's what that plumbing looks like — and how to skip it.

What is the npm registry? 🔎

The npm registry is the package repository that ships JavaScript and TypeScript dependencies to the world. Operated by GitHub (Microsoft) since 2020, it hosts millions of public packages and serves billions of install requests per week. Every package in node_modules that didn't come from a private registry traces back to registry.npmjs.org.

The registry exposes two public data surfaces:

registry.npmjs.org/<package> — full package document: versions, dist-tags, maintainers, dependencies, engines, repository, license, README, and the deprecated message if one has been set.
api.npmjs.org/downloads/point/last-week/<package> — download counts for the trailing 7 days.

What it does not expose: a bulk endpoint, a metadata-search API that filters by license or deprecation status, or a "give me everything for these 500 packages" call.

Does npm have a bulk metadata API?

No. As of 2026, npm publishes no official endpoint that returns full metadata for multiple packages in a single request. The registry accepts one package name per URL path. There is a separate /-/v1/search endpoint for text search, but it returns only a subset of fields and enforces a result cap per query — it cannot be used for arbitrary bulk lookups.

The Actor below works around the missing bulk surface: it fans out individual registry calls at configurable concurrency, pulls each package's last-week download count from api.npmjs.org, merges both payloads, and returns a unified row per package.

What the data looks like

Every package produces one flat, typed row. Here is a real example for express:

{
  "name": "express",
  "version": "4.21.2",
  "description": "Fast, unopinionated, minimalist web framework",
  "homepage": "https://expressjs.com/",
  "license": "MIT",
  "author": "TJ Holowaychuk <tj@vision-media.ca>",
  "maintainers": ["dougwilson", "wesleytodd", "linusu", "ulisesgascon"],
  "keywords": ["express", "framework", "sinatra", "web", "rest", "restful", "router", "app", "api"],
  "repository_url": "git+https://github.com/expressjs/express.git",
  "bugs_url": "https://github.com/expressjs/express/issues",
  "dist_tarball": "https://registry.npmjs.org/express/-/express-4.21.2.tgz",
  "engines": {"node": ">= 0.10.0"},
  "dependencies": {"accepts": "~1.3.8", "array-flatten": "1.1.1"},
  "dev_dependencies": {"eslint": "8.57.1", "mocha": "10.7.3"},
  "peer_dependencies": null,
  "deprecated": null,
  "weekly_downloads": 34000000,
  "package_url": "https://www.npmjs.com/package/express",
  "published_at": "2024-03-25T20:12:51.000Z",
  "scraped_at": "2026-05-31T00:00:00+00:00"
}

Twenty fields, Pydantic-validated before writing. The dependencies and dev_dependencies maps come out as objects keyed by package name — ready for graph analysis, diff comparison, or direct insertion into a PostgreSQL jsonb column.

The naive approach (and why it falls apart) ⚙️

The obvious starting point is two requests.get() calls per package in a loop. It breaks before you've enriched a real dependency tree. Here is what goes wrong:

1. Two hosts, two data shapes, one row. The registry (registry.npmjs.org) and the downloads service (api.npmjs.org) are separate services with different response formats. A naive script has to fetch both, line them up by package name, and reconcile the cases where one host answers and the other 404s. We fetch both per package, bound the parallelism with a semaphore so we don't hammer either host, and emit a single merged row.

2. The author field is not a string. The npm registry's author field varies by package convention: sometimes a string, sometimes an object with name, email, and url sub-keys, sometimes missing entirely. Code that assumes package["author"] is a string fails silently on a large share of packages. We normalise this to a display string consistently across all three shapes.

3. Scoped packages and URL-encoding. A package named @apify/sdk must be URL-encoded as %40apify%2Fsdk in the request path. Getting this wrong produces a 404 with no obvious error to grep for. Our request layer encodes the name correctly before every call.

4. Datacenter IPs get throttled. The registry is permissive from a laptop; at scrape scale from datacenter IPs you get rate-limited. We send requests through curl-cffi, which presents a real Chrome / Firefox / Safari TLS handshake instead of a Python one, and we thread Apify Proxy through the request layer — residential exit IPs on paid plans, datacenter fallback on free-tier runs.

5. The result is what you paid for. We push Pydantic-validated ResultRow objects. If a field is nullable in the registry response, it arrives as null in your dataset — never as an absent key, never as an empty string masking a missing value. No data, no charge: the actor-start warm-up fee is $0.005, and if no packages land in the dataset you pay no per-result fee.

None of this is hard in isolation. All of it is plumbing you don't want to own.

The Actor 🔥

The Actor is on the Apify Store: apify.com/DevilScrapes/npm-package-scraper.

Paste a list of package names in the Apify Console and click Start — or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/npm-package-scraper").call(
    run_input={
        "packages": [
            "express",
            "react",
            "@apify/sdk",
            "lodash",
            "axios",
        ],
        "includeDownloads": True,
        "concurrency": 8,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["name"], item["version"], item["weekly_downloads"])

Key input parameters from the input schema:

Field	Type	Default	Notes
`packages`	array	`["express","react","@apify/sdk"]`	Scoped names (`@scope/name`) work verbatim
`includeDownloads`	boolean	`true`	Adds one request to the downloads API per package
`concurrency`	integer	`8`	Parallel registry fetches (max 32)
`proxyConfiguration`	object	`{useApifyProxy: false}`	Enable residential proxies for high-volume runs

The Actor streams results into the run's dataset as each package resolves — you can watch rows arrive in the Apify Console before the run finishes.

Use cases

Dependency audit before adoption. Run your shortlist of candidate libraries through the Actor and cross-reference deprecated, weekly_downloads, published_at, and license in one pass instead of opening five npmjs.com tabs. A package with 400 weekly downloads, a deprecation message, and a last-publish date of 2019 tells you everything you need to know in three fields.

License compliance sweep. Legal teams need an SPDX license identifier for every dependency. The Actor returns license as a string ("MIT", "Apache-2.0", "GPL-3.0") or null when the registry entry doesn't declare one — which is itself a flag. Feed the output to a policy script: [p for p in packages if p["license"] not in APPROVED_LICENSES].

SDK competitive intelligence. Schedule a weekly run against your own package and three competitors, push the output to a time-series table, and alert when a competitor's trajectory crosses yours. The downloads number is noisy at roughly ±5% (npm's acknowledged bot-filtering variance), but the trend is reliable.

Supply-chain enrichment. Tools like Socket, Snyk, and Tidelift build risk signals on top of the registry. The Actor returns the raw data — maintainers, repository URL, engines constraint, the full dependency and dev-dependency maps — as structured rows you can join against a CVE database or a known-maintainer graph.

RAG knowledge graph over npm. The description, keywords, and repository_url fields make a reasonable embedding target for a "which npm package does X?" retrieval system. A run over the top 5,000 packages gives you a base corpus at about $7.51 of compute.

Pricing — exact numbers 💰

Pay-per-event. You pay for packages you get, nothing for packages that 404 or get skipped.

Event	USD
`actor-start` (once per run)	$0.005
`result` (per package written)	$0.0015

Volume	Cost
100 packages	$0.16
1,000 packages	$1.51
10,000 packages	$15.01
100,000 packages	$150.01

Apify's $5 free trial credit covers your first ~3,300 packages with no credit card required.

The technically interesting bit

The detail most "just curl the registry" snippets miss is the TLS handshake. A plain requests/httpx client negotiates a TLS and HTTP/2 fingerprint that screams "Python script" — and that fingerprint is exactly what edge throttling keys on long before it ever inspects a User-Agent. This Actor opens its session through curl-cffi, which impersonates a real browser's TLS+H2 fingerprint, and it picks one of four profiles at random per run — chrome131, chrome124, firefox147, or safari180. So a 1,000-package run doesn't present as one robotic client making 2,000 identical handshakes; it presents as ordinary browser traffic. That is the difference between a run that completes and a run that stalls at HTTP 429 — and it's the part that's genuinely tedious to reproduce from scratch.

Limitations 🚧

latest dist-tag only. The Actor resolves each package to its latest version. Per-version metadata and version history are not in scope.
deprecated is version-specific. The flag reflects whether the latest version carries a deprecation notice — not whether any older version does. A package with deprecated v1 and an active v2 shows deprecated: null.
Download counts are noisy. The npm downloads API filters out install automation from CI systems inconsistently. Use the weekly number as a trend signal, not an exact count.
No keyword-based search. You supply package names; the Actor does not discover packages matching a description. Use npm's /-/v1/search directly for discovery, then feed the names to this Actor for enrichment.
One request pair per package. Each package costs one registry call plus, when includeDownloads is on, one downloads call. Very large runs (50k+ packages) take proportionally longer in wall-clock time — the per-result price stays the same, but the clock keeps ticking.

FAQ

Is scraping the npm registry legal?
The npm registry is a public package repository. All data returned by this Actor is available without authentication to anyone who makes an HTTP request to registry.npmjs.org, and we read only that public metadata at a paced request rate. As always, review your own jurisdiction and use case.

Can I export to Google Sheets or a database?
Yes — export CSV / Excel / JSON from the Apify Console, connect the dataset via webhook on ACTOR.RUN.SUCCEEDED to Make, Zapier, or n8n, or pull rows programmatically with the Apify API.

Is there an official npm bulk metadata API?
No. The npm registry serves one package per request, and the download counts live on a separate host (api.npmjs.org). This Actor abstracts both surfaces and merges the results into one row per package so you don't have to.

Does it handle scoped packages like @types/node or @apify/sdk?
Yes. Pass the literal scoped name (@scope/name) in the packages array; the Actor handles the URL-encoding correctly for both the registry and the downloads API.

Try it