Devil Scrapes

Posted on Jun 2

PubMed scraper: structured biomedical papers from NCBI for $2/1K

#webscraping #python #apify #datascience

Quick answer: A PubMed scraper turns any PubMed search query into structured JSON rows — PMID, DOI, title, abstract, authors, journal, MeSH terms, publication date. You get there via NCBI's E-utilities API (esearch → efetch), which returns XML that requires careful pagination, rate-limit pacing, and multi-call orchestration to drive reliably. The Apify Actor below absorbs that choreography and ships Pydantic-validated typed rows at $0.002 per paper (~$2.00 per 1,000).

The PubMed database holds more than 36 million biomedical citations. It is where clinical-AI engineers source training corpora, where pharmacovigilance teams track drug-mention frequency, and where bioinformaticians assemble ground truth for retrieval-augmented generation (RAG) pipelines. The data is public. Getting it programmatically without hitting quota walls is not. Here is what the extraction actually takes.

What is PubMed? 🔎

PubMed is a free, publicly accessible bibliographic database maintained by the U.S. National Library of Medicine (NLM) at NCBI. It indexes citations and abstracts primarily from MEDLINE, life-science journals, and online biomedical books. Each record carries structured metadata: unique PMID, DOI, MeSH vocabulary terms assigned by trained NLM indexers, author affiliations, journal ISO abbreviation, and (where available) a PMCID linking to full text in PubMed Central.

What PubMed gives you per record:

A canonical PMID (permanent stable identifier) and DOI
Full author list in Last-Initial format, preserving authorship order
Abstract text — sometimes structured with labeled headings (Background, Methods, Results, Conclusions)
MeSH headings — the controlled vocabulary that makes PubMed genuinely searchable across synonymous terminology
Journal full name, ISO abbreviation, and best-available publication date

What PubMed does not give you: full-text PDFs (those live on publisher sites), citation graphs, or an official bulk-export API.

Does PubMed have a bulk download API? 📦

No — not in any practical sense. NCBI provides E-utilities, a set of HTTP endpoints (esearch, efetch, esummary) for programmatic access. But E-utilities is a pipeline, not a download button. Driving it at scale means running an esearch to collect PMIDs, chunking those IDs into batches for efetch, parsing XML (which differs between article types and changes when NCBI updates its DTD), and staying inside the published rate limit — 3 requests/second without an API key, 10/second with one. Miss any step and you get a 429, an empty result, or silently malformed records.

What the data looks like

Every paper returns as one flat typed row:

{
  "pmid": "38462034",
  "pmcid": "PMC10987321",
  "doi": "10.1038/s41591-024-02847-3",
  "title": "Foundation models for biomedical image segmentation: a systematic review",
  "abstract": "BACKGROUND: Large vision-language foundation models ...\n\nMETHODS: We systematically reviewed ...",
  "authors": ["Zhang W", "Patel A", "Komninos N", "Chen X"],
  "journal": "Nature Medicine",
  "journal_iso": "Nat Med",
  "publication_types": ["Journal Article", "Review", "Systematic Review"],
  "mesh_terms": ["Deep Learning", "Image Segmentation", "Medical Informatics"],
  "keywords": ["foundation models", "biomedical imaging", "SAM"],
  "pub_date": "2024-03-11",
  "pubmed_url": "https://pubmed.ncbi.nlm.nih.gov/38462034/",
  "scraped_at": "2026-05-31T10:14:22+00:00"
}

Fourteen fields. abstract preserves structured headings when they exist. mesh_terms carries the NLM-assigned controlled vocabulary — invaluable for filtering without hand-tuning synonyms. Every field is Pydantic-validated; nullable fields surface null rather than an empty string when NCBI omits them.

The naive approach (and why it falls apart)

The first thing an engineer tries: call esearch.fcgi with the query, collect the PMID list, loop through efetch.fcgi with each ID, parse the XML. It works on a one-off run for 100 records.

It breaks at production scale for three interconnected reasons.

1. Rate-limit choreography across two endpoints. NCBI enforces its 3-req/sec (or 10-req/sec with a key) limit across the combination of esearch and efetch calls. Batching naively without inter-call pacing triggers HTTP 429s, and NCBI's 429 does not always include a Retry-After header — so the backoff has to be inferred from the limit itself. We pace between efetch batches and retry up to 5 times with exponential backoff on 408/429/5xx, then surface a partial-success message and stop cleanly rather than silently underdelivering.

2. XML parsing drift. PubMed's efetch XML does not follow a single fixed schema across record types. Review articles, letters, errata, and retracted papers all embed the metadata differently. A naive article.findtext("AbstractText") fails on structured abstracts, which use multiple <AbstractText Label="BACKGROUND"> child nodes. We walk the element tree and join labeled sections, flatten mixed-content titles with superscripts and special characters, and guard every field with a None fallback so a malformed record drops gracefully rather than crashing the run.

3. TLS fingerprinting. Even though E-utilities is an official NCBI endpoint, the infrastructure sits behind standard network inspection. We rotate through real browser TLS fingerprints via curl-cffi so the ClientHello looks like a genuine browser, not a Python HTTP library. We thread Apify residential proxies with sticky sessions and rotate the session on every block. We fail loud with a set_status_message if we exhaust retries — you never get a silent empty dataset.

None of this is exotic engineering. All of it is what sits between "a script that worked once" and a feed that runs weekly and doesn't page you at 2am.

The Actor ⚙️

I packaged the result as an Apify Actor: PubMed Papers Scraper.

Paste your search query in the Apify Console and click Start, or drive it from Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/pubmed-papers-scraper").call(
    run_input={
        "searchQuery": "CRISPR Cas9 off-target[Title] AND 2023:2025[PDat]",
        "maxResults": 500,
        "sortBy": "most_recent",
        "proxyConfiguration": {"useApifyProxy": True},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["pmid"], item["title"])

The searchQuery field accepts the full PubMed query syntax — field tags like [Author], [Title], [MeSH], and date qualifiers like [PDat] all work. Sort by relevance, most_recent, or pub_date. Maximum 2,000 results per run; attach your NCBI API key to lift throughput from 3 req/s to 10 req/s.

Use cases

Literature review bootstrapping. Feed a systematic-review protocol query (e.g., "myocardial infarction" AND "machine learning"[Title] AND "2020:2025"[PDat]) and get 500–2,000 candidate abstracts as JSON in minutes — ready to drop into a LangChain + ChromaDB pipeline.

Pharma-signal monitoring. Search by drug INN (International Nonproprietary Name) combined with an adverse-event MeSH term — e.g., ozempic[Supplementary Concept] AND "Adverse Effects"[MeSH] — and schedule a weekly run. The pub_date field makes diffing trivial.

Author publication tracking. Query Shah N[Author] and schedule daily runs to detect new publications. The stable pmid field lets you diff against yesterday's dataset without deduplication logic.

MeSH-based corpus assembly. Filter by a MeSH heading — "Neoplasms"[MeSH] AND "Deep Learning"[MeSH] — and pull every indexed paper at that intersection. The mesh_terms field lets you filter further downstream without re-querying NCBI.

RAG dataset seeding. Pull 10k–50k abstracts for a disease area and push straight to a HuggingFace Dataset. The structured output — title, abstract, authors, DOI, journal, MeSH — is exactly what a biomedical embedding model needs as a fine-tuning corpus.

Pricing — exact numbers 💰

Pay-per-event. You pay for papers that land in your dataset, nothing for papers that don't.

Event	Price
Actor start (one-off per run)	$0.005
Per paper written to dataset	$0.002

Pull	Cost
100 papers	$0.21
1,000 papers	$2.01
10,000 papers	$20.01
50,000 papers (monthly sweep)	$100.01

Apify's $5 free trial credit covers your first ~2,500 papers with no credit card. You're paying for the engineering layer, not a data licence — PubMed itself is free.

The part worth calling out specifically

The esearch → efetch two-step is not a minor inconvenience. esearch returns only PMIDs — up to 10,000 per call with usehistory=y. To get actual metadata you must issue a separate efetch per batch and parse XML (NCBI provides no JSON mode for efetch). When a PMID exists in esearch but returns a malformed or retracted record in efetch — which happens — a naive parser either crashes or silently drops the record. We validate every record against the Pydantic ResultRow model before it is pushed; a record that fails validation is logged at WARNING level and skipped, never silently omitted without trace.

MeSH extraction is also non-trivial: <MeshHeading> nodes contain a <DescriptorName> (the main heading) plus zero or more <QualifierName> children. We extract the DescriptorName in v1, which is sufficient for most corpus-assembly use cases.

Limitations 🚧

Metadata only. We call E-utilities (esearch + efetch). Citation graphs (which papers cite which) are not exposed by E-utilities and are out of scope.
Maximum 2,000 results per run. NCBI's retmax parameter caps at 10,000 but we default to 2,000 for reliable delivery within a single run. Larger pulls can be batched by date range across multiple runs.
Some abstracts are absent. Older records, letters, and errata sometimes have no abstract text. The Actor surfaces null for those rather than fabricating or omitting the row.
MeSH enrichment lag. NLM indexers assign MeSH terms after publication — very recent papers (< 4–8 weeks old) may have an empty mesh_terms array until indexing completes.
No full-text. Full text lives on publisher sites and requires separate licensing. We surface pmcid when available for downstream PubMed Central lookups.

FAQ ❓

Is scraping PubMed legal?
PubMed is a publicly funded database designed for open access, and NCBI provides E-utilities specifically for programmatic access. This Actor calls those official endpoints, respects their stated rate limits, and retrieves only abstract-level metadata — never patient data, never PHI. Check your own jurisdiction as always; NCBI's terms of service are the authoritative reference.

Can I export to CSV or Google Sheets?
Yes — the Apify Console lets you export your dataset as JSON, CSV, or Excel directly. You can also webhook on ACTOR.RUN.SUCCEEDED to pipe results into Make, Zapier, or n8n, or pull via the Apify API.

Is there an official PubMed API?
NCBI's E-utilities is the official programmatic interface, and this Actor is built on top of it. What E-utilities does not provide is: a bulk export endpoint, a simple JSON-over-HTTP interface for paper metadata, or managed rate-limiting and retry logic. That gap is what the Actor fills.

How is this different from Biopython's Entrez module?
Biopython Entrez is the right tool for local one-off queries in a Jupyter notebook. This Actor is for batch refresh at scale — when you want 10k records weekly in a cloud dataset with JSON/CSV export and don't want to manage rate-limit state yourself. Use Biopython locally, use this for scheduled bulk pulls.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/pubmed-papers-scraper.

Free $5 trial credit, no credit card. Run it against "CRISPR review"[Title] AND 2024[PDat] and you will have 30 structured paper records in under a minute. Found a use case I missed, or a field missing from the output schema? Drop it in the comments.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community