DEV Community

Алексей Спинов
Алексей Спинов

Posted on

Free Tools for Building AI Training Datasets — Reddit, YouTube, Wikipedia, arXiv

If you're training NLP models or building RAG systems, you need diverse text data. Here are 7 free data sources I built tools for:

1. Reddit — Conversational Data

JSON API (append .json to any URL). 20+ fields per post, full comment trees.
Use for: dialogue systems, sentiment analysis, topic modeling

2. YouTube Comments — Engagement-Weighted Text

Innertube API, no quota limits. Author, text, likes, replies.
Use for: sentiment analysis, opinion mining

3. Stack Overflow — Technical Q&A

Stack Exchange API v2.3. Questions with full answers and code.
Use for: code generation, technical Q&A assistants

4. Wikipedia — Encyclopedic Knowledge

MediaWiki API, 40+ languages. Full article text with categories.
Use for: knowledge grounding, RAG, entity extraction

5. arXiv — Scientific Text

Atom API, 150+ categories. Titles, abstracts, authors.
Use for: scientific Q&A, research assistants

6. Hacker News — Tech Discourse

Firebase + Algolia APIs. Stories with comment trees.
Use for: tech trend detection, opinion mining

7. Bluesky — Social Network Data

AT Protocol, fully open. Posts, engagement metrics.
Use for: social NLP, sentiment analysis

All tools output structured JSON. Free on Apify Store (search knotless_cadence).

What data sources do you use for training?

Top comments (0)