Anyone else building training corpora from academic literature?

Curious what your data collection pipeline looks like.

I've been pulling from ScholarAPI for domain-specific RAG datasets.. medical, materials science, chemistry. The structured JSON + PDF access makes chunking for embeddings cleaner than parsing scraped HTML.

Current setup: ScholarAPI → extract → chunk → embed into Chroma. Works well for domain-specific Q&A.

What are you using?

Genuinely curious if there's something better I'm missing for open-access coverage.

(https://scholarapi.net?via=-asig3)

DEV Community

Anyone else building training corpora from academic literature?

Top comments (0)