"arXiv Has One of the Last Truly Open APIs. Here Is How to Build a Paper Monitor on It"

#ai #showdev #machinelearning #api

Every scraping post I write lately is about working around something: bot walls, consent screens, keys that require a developer account. This one is different. arXiv runs one of the last genuinely open APIs on the research web, and it is the right way to keep up with the ~800 AI papers that land there every week.

The whole API is one endpoint

GET https://export.arxiv.org/api/query
      ?search_query=(all:"multi-agent") AND (cat:cs.AI)
      &sortBy=submittedDate&sortOrder=descending
      &start=0&max_results=100

No key. No login. Atom XML out. It is documented, sanctioned for programmatic use, and has been stable for well over a decade — the opposite of reverse-engineered endpoints that break monthly.

The query grammar is small but composes well:

Fields: all: (title + abstract + more), ti:, abs:, au:, cat:
Boolean: AND, OR, ANDNOT, with parentheses
Phrases: double quotes, so all:"retrieval augmented generation" matches the phrase, not the words scattered

So "anything about tool use or agents, in the AI or NLP categories, by this lab" is one query string.

The three things worth knowing before you build

Politeness is the rate limit. The API guidance asks for about one request every 3 seconds. With max_results=100 per page, that is 2,000 papers a minute, which is more than any monitoring workflow needs.
Sort by submittedDate descending and dedupe by ID. arXiv IDs are versioned (2507.01234v2), so decide whether a revision counts as "new" for you. For monitoring, tracking the ID without the version suffix and diffing daily is usually what people want.
Abstracts are full text in the feed. You do not need to touch a PDF to build a useful pipeline: the abstract is enough for embeddings, topic classification, and digest summaries. That turns "paper monitoring" into a pure JSON problem.

The workflow that actually keeps you current

Nobody reads listing pages. The pattern that works is a scheduled diff:

Daily run: query your topics, newest first.
Skip every ID you have seen before.
Push what is left to Slack, a newsletter draft, or an embedding index.

The result is a feed of only-new, only-relevant papers with abstracts, which is what all the "stay current with AI research" tools sell, built on an API that gives the data away.

I packaged this as an Apify actor this week: arXiv Papers Scraper takes keywords, categories, and authors, handles the pagination and politeness delays, normalizes rows (title, full abstract, authors, categories, dates, PDF link, DOI), and has cross-run dedupe built in for exactly the scheduled-diff workflow above. The first 2 rows of every run are free.

One small irony to end on: the hardest part of building AI research tooling is not the AI. It is that most of the web fights being read by machines. arXiv does not, and it is not a coincidence that it is also the most machine-cited corpus in the field.