Devil Scrapes

Posted on May 31

arXiv Scraper: build a RAG-ready paper corpus for $1.50/1K

#webscraping #python #apify

Quick answer: arXiv publishes an Atom feed API at export.arxiv.org/api/query that any program can call — but it rate-limits hard, paginates awkwardly, and ships XML you still have to wrangle into rows. An arXiv scraper wraps that feed with proper pagination, retry logic, and rate-limit pacing, then returns every matching paper as structured JSON. The Apify Actor below does it for $0.0015 per paper (~$1.50 per 1,000), handling the failure modes so your embedding pipeline sees clean rows instead of dropped requests.

There is a moment every ML engineer knows: you have picked your retrieval architecture, your vector store is standing by, and you just need a clean slice of arXiv papers to run benchmarks on. So you open the arXiv export API docs, write twenty lines of Python, and an hour later you are still debugging timeouts, malformed Unicode in author names, and a cursor that resets when you change the start offset without the right headers.

The papers are not the problem. Getting them out is.

What is arXiv? 📄

arXiv (pronounced "archive") is a free, open-access preprint server operated by Cornell University since 1991. Researchers in physics, mathematics, computer science, quantitative biology, statistics, and economics deposit pre-peer-review papers there before — and sometimes instead of — formal journal submission.

For machine learning, it is effectively the primary literature. Every major model, benchmark, and technique shows up on arXiv before it shows up anywhere else. The cs.AI, cs.LG, and cs.CL subcategories receive hundreds of new papers every weekday. As of 2026, the total corpus sits above 2.3 million papers.

For researchers and engineers building retrieval systems, evaluation corpora, or trend-monitoring pipelines, arXiv is an obvious source of truth — if you can get the data out reliably.

Does arXiv have an API?

Yes — but it is more limited than it looks. arXiv runs a public Atom feed API at export.arxiv.org/api/query. You can query by title, author, category, abstract, or combinations. It supports sorting by submission date or last-update date. Pagination works through start and max_results parameters.

What the documentation underplays: arXiv's own guidelines request no more than one request every three seconds from a single client. On a 30,000-paper category sweep at 50 papers per page, that is 600 pages, 1,800 seconds of wait time, and dozens of transient network failures to handle across a 30-minute run. The API also returns Atom XML, not JSON — which is fine for a one-off, and tedious for a pipeline.

This is the gap the Actor fills.

What the data looks like

Each paper comes back as one flat, typed row. Here is a real example with every field the Actor returns:

{
  "arxiv_id": "2401.12345v2",
  "url": "https://arxiv.org/abs/2401.12345v2",
  "pdf_url": "https://arxiv.org/pdf/2401.12345v2",
  "title": "Scaling Laws for Sparse Mixture-of-Experts Language Models",
  "summary": "We investigate scaling laws for sparse mixture-of-experts (MoE) language models...",
  "authors": ["Alex Doe", "Jamie Smith", "Morgan Lee"],
  "primary_category": "cs.CL",
  "categories": ["cs.CL", "cs.LG", "cs.AI"],
  "doi": "10.18653/v1/2024.acl-long.001",
  "journal_ref": "Proceedings of ACL 2024",
  "comment": "18 pages, 7 figures. Accepted at ACL 2024.",
  "published": "2024-01-22T16:00:00+00:00",
  "updated": "2024-03-15T09:30:00+00:00",
  "scraped_at": "2026-05-31T11:00:00+00:00"
}

Fourteen fields, Pydantic-validated before they reach your dataset. arxiv_id includes the version suffix; published and updated are ISO-8601 UTC; doi and journal_ref are null for preprints that have not been formally published. That is the full schema — no hidden fields, no surprise null rows.

The naive approach (and why it falls apart) 🔧

The first script most engineers write looks something like this:

import requests, time
base = "https://export.arxiv.org/api/query"
for start in range(0, 3000, 50):
    r = requests.get(base, params={"search_query": "cat:cs.AI", "start": start, "max_results": 50})
    # parse Atom XML, extract entries
    time.sleep(3)

That script fails in production for three reasons that are not obvious until you have already burned a Sunday on them.

1. The XML response is finicky. arXiv's Atom feed can return an empty <feed> when pagination goes past the result set, a 503 under load, or a malformed Unicode character in an author name that breaks naive XML parsers. A parse failure stops your sweep silently at row 2,400.

2. The rate-limit pushes back mid-sweep. arXiv enforces the three-second guideline inconsistently. During peak hours you can hit a 503 that a simple time.sleep(3) loop will not recover from gracefully. We retry with exponential backoff on 408 / 429 / 5xx, honour Retry-After headers, and surface partial success rather than returning an empty dataset.

3. The page cursor is fragile. If you restart a failed run with different start offsets, arXiv treats it as a fresh query and results shift between pages as new papers are indexed mid-sweep. We log the precise offset for each page so you can audit exactly what was fetched.

We absorb all three failure modes. We also rotate browser fingerprints via curl-cffi so requests look like a browser hitting the feed, not a raw Python requests call — which matters less for arXiv than for more aggressive targets, but the defence is already there.

The Actor ⚙️

The arXiv Papers Scraper is on the Apify Store. You can run it from the Apify Console UI, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/arxiv-papers-scraper").call(
    run_input={
        "searchQuery": "cat:cs.AI AND ti:transformer",
        "sortBy": "submittedDate",
        "sortOrder": "descending",
        "maxResults": 1000,
        "pageSize": 50,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["arxiv_id"], item["title"])

The key input parameters:

Parameter	Default	Notes
`searchQuery`	`cat:cs.AI`	arXiv query string. Supports `ti:`, `au:`, `cat:`, `abs:` prefixes and Boolean operators.
`sortBy`	`submittedDate`	`relevance`, `lastUpdatedDate`, or `submittedDate`.
`sortOrder`	`descending`	`ascending` or `descending`.
`maxResults`	50	Cap at 5,000 per run. arXiv recommends ≤30,000 per query.
`pageSize`	50	Papers per API call; arXiv caps at 2,000.

The output streams live into the run's dataset — export as JSON, CSV, or Excel from the Apify Console, or fetch via the Apify dataset API.

Use cases

RAG corpus ingestion. Build a domain-specific retrieval pipeline — run cat:cs.CL for 12 months, drop arxiv_id, title, summary, and published into your vector store, and you have a searchable knowledge base in an afternoon. The Actor handles pagination; you handle embeddings.

Literature review seeding. Query ti:scaling laws AND cat:cs.LG to seed a systematic review. Every result comes back with DOI and journal reference where available — enough to drop straight into a citation manager.

Trend monitoring. Schedule a daily run on cat:cs.AI sorted by submittedDate descending with maxResults: 50. Pipe into a Slack webhook and get yesterday's new papers in your inbox every morning.

Research attribution tracking. Run au:lecun weekly, diff arxiv_id against last week's run, and alert on new papers. Useful for tracking lab output or advisor submissions.

Pricing — exact numbers 💰

Pay-per-event. You pay for papers that land in your dataset; nothing for the setup or failed retries.

Event	Price
Actor start (one-off per run)	$0.005
Result (per paper written)	$0.0015

Volume	Cost
100 papers	$0.16
1,000 papers	$1.51
10,000 papers	$15.01
30,000 papers (full category sweep)	$45.01

Apify's $5 free trial credit covers your first ~3,300 papers with no credit card required. For context: building and maintaining the pagination, retry, and parsing layer yourself costs engineering time you could spend on the embedding pipeline.

The part worth knowing: arXiv's rate-limit is real

Most arXiv scrapers ignore the official bulk-access guidance. arXiv asks for a 3-second delay between requests and automated clients to identify themselves in the User-Agent header. We follow both: the Actor sets a User-Agent that identifies it as a DevilScrapes Apify Actor, paces requests at the recommended interval, and backs off further when the server signals overload.

This is not altruism — it is the reason arXiv's API is still open. If you hammer arXiv with an aggressive client, you hurt every researcher who relies on programmatic access. We take the slower throughput and keep the access clean.

Limitations

Metadata only. The Actor uses the Atom feed API. Full-text search over PDF content is not supported — queries operate on metadata fields (title, abstract, author, category). For PDF content extraction, you would need a second-stage pipeline.
No canonical author IDs. arXiv does not assign stable author identifiers in the public API. Author-disambiguation across name collisions ("J. Smith" being three different people) is on you.
Version pinning. arxiv_id includes the version suffix (e.g. 2401.12345v2). If you want only the latest version of each paper, filter on the highest v suffix yourself — the API can return multiple versions of the same paper in a sweep.
Result cap. maxResults is capped at 5,000 per run in the Actor input. For larger sweeps (a full cs.AI category, 30k+ papers), run multiple queries partitioned by date range and merge the datasets.
DOI availability. Most preprints have no DOI until the paper is journal-published. doi is null for the majority of recent papers.

FAQ

Is scraping arXiv legal?
arXiv publishes the Atom API specifically for programmatic access and has explicit bulk-access guidance. The Actor follows that guidance: polite request pacing, identified User-Agent, metadata only, no auth bypass. As always, review arXiv's terms and your use case.

Does arXiv have a proper JSON API?
No. The public programmatic surface is the Atom XML feed at export.arxiv.org/api/query — no official REST/JSON API exists for bulk metadata export. The Actor converts Atom responses to clean JSON rows.

Can I download the actual PDFs?
Not through this Actor. We surface pdf_url for each paper; pulling the PDFs is a separate step. Use the URL as input to a follow-up pipeline.

How do I build a RAG pipeline on top of this?
Run the Actor, export as JSON, load arxiv_id + title + summary into your vector store (ChromaDB, Pinecone, Qdrant — any accept a list of dicts), embed summary, and query semantically.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/arxiv-papers-scraper.

Free $5 credit, no credit card. Run cat:cs.AI with maxResults: 100 and you will have a hundred recent AI papers as clean JSON in under two minutes. If you build something interesting on top of the corpus — a paper recommender, a retrieval benchmark, a trend chart — drop it in the comments. The dataset is the interesting part; we just make it easier to get.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community