Curious what your data collection pipeline looks like.
I've been pulling from ScholarAPI for domain-specific RAG datasets.. medical, materials science, chemistry. The structured JSON + PDF access makes chunking for embeddings cleaner than parsing scraped HTML.
Current setup: ScholarAPI → extract → chunk → embed into Chroma. Works well for domain-specific Q&A.
What are you using?

Genuinely curious if there's something better I'm missing for open-access coverage.
Top comments (0)