How to Build a Research Paper Dataset for RAG & LLMs (No Code, 2026)

#rag #ai #machinelearning #datascience

Grounding an LLM or running a literature review? You need a clean corpus of papers — titles, abstracts, authors, citations, PDF links. Here's how to build one in minutes without writing a scraper, pulling from arXiv, OpenAlex and PubMed.

What you'll build: a structured JSON dataset of academic papers on your topic, ready to drop into a vector database, a notebook, or a RAG pipeline.

Why not just hit the APIs directly?

arXiv, OpenAlex and PubMed are all free and open — but each returns a different format (Atom XML, nested JSON, E-utilities), with its own pagination and rate limits. Wiring that up and flattening it is a half-day of glue code you'll have to maintain.

Instead, we'll use three no-code scrapers that wrap those official APIs and return clean, flat JSON:

arXiv Scraper — CS/ML/physics preprints
OpenAlex Scraper — 250M+ works across all fields, with citations
PubMed Scraper — 37M+ biomedical citations

Step 1 — Pick your source by field

Machine learning / CS / physics → arXiv
Anything, with citation counts → OpenAlex
Medicine / life sciences → PubMed

For an AI/ML RAG corpus, start with arXiv.

Step 2 — Run the arXiv Scraper

Open the arXiv Scraper and set:

{
  "allFields": "retrieval augmented generation",
  "sortBy": "newest",
  "maxResults": 200
}

Or use the advanced query syntax for precision:

{ "searchQuery": "cat:cs.CL AND abs:retrieval augmented", "maxResults": 500 }

Each result includes the abstract and a direct PDF link:

{
  "arxiv_id": "2605.30351v1",
  "title": "VideoMLA: Low-Rank Latent KV Cache...",
  "authors": ["Hidir Yesiltepe", "Jiazhen Hu"],
  "abstract": "Long-rollout causal video diffusion...",
  "primary_category": "cs.CV",
  "pdf_url": "https://arxiv.org/pdf/2605.30351v1"
}

Step 3 — Add breadth and citations with OpenAlex

To rank by impact or cover non-arXiv venues, run the OpenAlex Scraper:

{
  "searchQuery": "retrieval augmented generation",
  "sortBy": "citations",
  "fromYear": 2020,
  "maxResults": 200
}

You get cited_by_count, is_open_access and PDF links — perfect for prioritising the most influential papers in your corpus.

Step 4 — Export and load into your pipeline

From the Storage tab, export each dataset as JSON (or pull it via the API). Then in your RAG pipeline:

Use the abstract (and the pdf_url if you want full text) as your documents.
Keep title, authors, doi, arxiv_id as metadata for citations.
Chunk, embed, and load into your vector DB (Pinecone, Weaviate, pgvector, etc.).

Because the output is already flat JSON with consistent field names across runs, there's no per-source parsing — you merge the three datasets and you're done.

Step 5 — Keep it fresh

Schedule the arXiv Scraper with sortBy: newest to append new papers on your topic each week, so your RAG index stays current without re-scraping everything.

Bonus — add real-world signal

Papers tell you what researchers say; forums tell you what practitioners say. To enrich a dataset with public discussion, add the Reddit Archive Scraper to pull historical threads from subreddits like r/MachineLearning by date range and keyword.

Wrap up

Three no-code scrapers, three short inputs, and you've got a clean, multi-source research corpus — no XML parsing, no rate-limit juggling. Start with the arXiv Scraper and layer in OpenAlex and PubMed as you need breadth.