Grounding an LLM or running a literature review? You need a clean corpus of papers — titles, abstracts, authors, citations, PDF links. Here's how to build one in minutes without writing a scraper, pulling from arXiv, OpenAlex and PubMed.
What you'll build: a structured JSON dataset of academic papers on your topic, ready to drop into a vector database, a notebook, or a RAG pipeline.
Why not just hit the APIs directly?
arXiv, OpenAlex and PubMed are all free and open — but each returns a different format (Atom XML, nested JSON, E-utilities), with its own pagination and rate limits. Wiring that up and flattening it is a half-day of glue code you'll have to maintain.
Instead, we'll use three no-code scrapers that wrap those official APIs and return clean, flat JSON:
- arXiv Scraper — CS/ML/physics preprints
- OpenAlex Scraper — 250M+ works across all fields, with citations
- PubMed Scraper — 37M+ biomedical citations
Step 1 — Pick your source by field
- Machine learning / CS / physics → arXiv
- Anything, with citation counts → OpenAlex
- Medicine / life sciences → PubMed
For an AI/ML RAG corpus, start with arXiv.
Step 2 — Run the arXiv Scraper
Open the arXiv Scraper and set:
{
"allFields": "retrieval augmented generation",
"sortBy": "newest",
"maxResults": 200
}
Or use the advanced query syntax for precision:
{ "searchQuery": "cat:cs.CL AND abs:retrieval augmented", "maxResults": 500 }
Each result includes the abstract and a direct PDF link:
{
"arxiv_id": "2605.30351v1",
"title": "VideoMLA: Low-Rank Latent KV Cache...",
"authors": ["Hidir Yesiltepe", "Jiazhen Hu"],
"abstract": "Long-rollout causal video diffusion...",
"primary_category": "cs.CV",
"pdf_url": "https://arxiv.org/pdf/2605.30351v1"
}
Step 3 — Add breadth and citations with OpenAlex
To rank by impact or cover non-arXiv venues, run the OpenAlex Scraper:
{
"searchQuery": "retrieval augmented generation",
"sortBy": "citations",
"fromYear": 2020,
"maxResults": 200
}
You get cited_by_count, is_open_access and PDF links — perfect for prioritising the most influential papers in your corpus.
Step 4 — Export and load into your pipeline
From the Storage tab, export each dataset as JSON (or pull it via the API). Then in your RAG pipeline:
- Use the
abstract(and thepdf_urlif you want full text) as your documents. - Keep
title,authors,doi,arxiv_idas metadata for citations. - Chunk, embed, and load into your vector DB (Pinecone, Weaviate, pgvector, etc.).
Because the output is already flat JSON with consistent field names across runs, there's no per-source parsing — you merge the three datasets and you're done.
Step 5 — Keep it fresh
Schedule the arXiv Scraper with sortBy: newest to append new papers on your topic each week, so your RAG index stays current without re-scraping everything.
Bonus — add real-world signal
Papers tell you what researchers say; forums tell you what practitioners say. To enrich a dataset with public discussion, add the Reddit Archive Scraper to pull historical threads from subreddits like r/MachineLearning by date range and keyword.
Wrap up
Three no-code scrapers, three short inputs, and you've got a clean, multi-source research corpus — no XML parsing, no rate-limit juggling. Start with the arXiv Scraper and layer in OpenAlex and PubMed as you need breadth.
Top comments (0)