DEV Community

Cover image for Anyone else building training corpora from academic literature?
Reel Crave
Reel Crave

Posted on

Anyone else building training corpora from academic literature?

Curious what your data collection pipeline looks like.

I've been pulling from ScholarAPI for domain-specific RAG datasets.. medical, materials science, chemistry. The structured JSON + PDF access makes chunking for embeddings cleaner than parsing scraped HTML.

Current setup: ScholarAPI → extract → chunk → embed into Chroma. Works well for domain-specific Q&A.

What are you using?


Genuinely curious if there's something better I'm missing for open-access coverage.

(https://scholarapi.net?via=-asig3)

Top comments (0)