If you're training NLP models or building RAG systems, you need diverse text data. Here are 7 free data sources I built tools for:
1. Reddit — Conversational Data
JSON API (append .json to any URL). 20+ fields per post, full comment trees.
Use for: dialogue systems, sentiment analysis, topic modeling
2. YouTube Comments — Engagement-Weighted Text
Innertube API, no quota limits. Author, text, likes, replies.
Use for: sentiment analysis, opinion mining
3. Stack Overflow — Technical Q&A
Stack Exchange API v2.3. Questions with full answers and code.
Use for: code generation, technical Q&A assistants
4. Wikipedia — Encyclopedic Knowledge
MediaWiki API, 40+ languages. Full article text with categories.
Use for: knowledge grounding, RAG, entity extraction
5. arXiv — Scientific Text
Atom API, 150+ categories. Titles, abstracts, authors.
Use for: scientific Q&A, research assistants
6. Hacker News — Tech Discourse
Firebase + Algolia APIs. Stories with comment trees.
Use for: tech trend detection, opinion mining
7. Bluesky — Social Network Data
AT Protocol, fully open. Posts, engagement metrics.
Use for: social NLP, sentiment analysis
All tools output structured JSON. Free on Apify Store (search knotless_cadence).
What data sources do you use for training?
Top comments (0)