How to Scrape Reddit Posts & Comments for AI / RAG (Python + No-Code)

#ai #webscraping #python

Reddit is one of the richest sources of real human opinion on the internet — which makes it gold for RAG pipelines, sentiment analysis, and market research. Here's how to pull Reddit posts and comments in 2026, the limits to know about, and a no-code option that outputs clean markdown ready for embeddings.

Option 1: the official Reddit API (PRAW)

import praw

reddit = praw.Reddit(client_id="...", client_secret="...", user_agent="my-app")
for post in reddit.subreddit("dataengineering").hot(limit=25):
    print(post.title, post.score, post.url)
    post.comments.replace_more(limit=0)
    for c in post.comments.list()[:10]:
        print("  ", c.body[:120])

This works, but note the catches: you need an app + OAuth credentials, the free tier is rate-limited, and listings are capped at ~1000 items — so you can't page back through a large subreddit's full history.

Option 2: historical archives (beyond the 1000-item cap)

For data older than what the API will return, the community archive PullPush (the successor to Pushshift) lets you query historical posts and comments by subreddit, date range, and keyword — useful for backfilling years of data.

Option 3: no-code / markdown output for AI

If your goal is an AI/RAG dataset, you mostly want clean text, not JSON plumbing. The Reddit Scraper returns posts, comments, and user data as AI-ready markdown (with word/token counts), no API keys to manage:

{
  "subreddits": ["MachineLearning", "LocalLLaMA"],
  "sort": "top",
  "maxPosts": 200,
  "includeComments": true
}

For deep history beyond the API's cap, there's a companion Reddit Archive Scraper that pulls years of posts from the archive by date range.

Which should you use?

Small, live pulls? The official API via PRAW is free and fine.
AI/RAG datasets or historical backfill? A managed scraper saves you OAuth, rate-limit handling, and the markdown conversion.

Use cases

RAG / fine-tuning datasets — real Q&A and discussion as training context.
Sentiment & trend analysis — track opinion on products, tickers, or topics.
Market research — mine niche subreddits for pain points and feature requests.
Community monitoring — watch mentions of your brand or competitors.

FAQ

Do I need Reddit API credentials? For PRAW, yes. The managed scraper handles access for you.

Why only ~1000 posts from the API? Reddit caps listing pagination. Use an archive source for deeper history.

What format is best for AI? Markdown or plain text with token counts — which is what the Reddit Scraper outputs.

Building an AI app on Reddit data? The Reddit Scraper gives you posts and comments as clean markdown — no API keys, no rate-limit juggling.

DEV Community