DEV Community

muskert
muskert

Posted on

Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes

Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes

If you've ever wanted to analyze what people are actually saying on Reddit — not just the post titles, but the full thread of comments, the debate, the nuance — you know how painful it is to get that data reliably. Reddit's official API is rate-limited, third-party tools are often paywalled, and browser automation is fragile.

This tutorial shows you how to build your own Reddit Comment Scraper using Apify and Python. By the end, you'll have an Apify Actor that can:

  • Scrape comments from any Reddit post URL
  • Search posts by keyword across Reddit (or within a specific subreddit)
  • Extract rich metadata: author, karma, upvotes, downvotes, gilded status, timestamps, and nested replies
  • Run on Apify's infrastructure for free (up to $5/month on the free tier)

Updated for 2026 — uses Pushshift API + Reddit JSON API for reliable data extraction without needing authentication.


Why Reddit Comment Data Is Valuable

Reddit is one of the richest sources of authentic public discourse on the internet. Unlike Twitter (now X), Reddit threads allow for deep, nested discussions that can span dozens of replies and reveal real sentiment, expertise, and community consensus.

Use cases:

  • Market Research: Understand what users genuinely think about a product
  • Sentiment Analysis: Train NLP models on real consumer opinions from niche communities
  • Academic Research: Study how communities discuss topics like AI, climate change, or political events
  • Competitor Monitoring: Track discussions about competitors in relevant subreddits
  • Content Inspiration: Find high-engagement post topics in your niche

Architecture Overview

Our scraper uses a two-layer approach:

Layer 1: Post Discovery (Pushshift API / Reddit Search)
         ↓ post URLs + metadata
Layer 2: Comment Extraction (Reddit JSON API .json endpoint)
Enter fullscreen mode Exit fullscreen mode

This approach is authentication-free — we use public APIs that don't require Reddit login credentials.


Project Setup

Create the following file structure:

reddit-comment-scraper/
├── .actor/
│   └── actor.json
├── src/
│   └── main.py
├── Dockerfile
└── README.md
Enter fullscreen mode Exit fullscreen mode

The Input Schema

The actor.json defines two input modes: url (scrape a specific post) and search (find posts by keyword). Key parameters include maxComments (up to 500), maxReplies per comment, and sort order (top, hot, new, controversial).


Core Scraper Logic

URL Parsing

import re
from urllib.parse import urlparse

def parse_reddit_url(url):
    path = urlparse(url).path.rstrip("/")
    match = re.match(r"/r/([^/]+)/comments/([a-zA-Z0-9]+)", path)
    if match:
        return match.group(1), match.group(2)
    match = re.match(r"/comments/([a-zA-Z0-9]+)", path)
    if match:
        return None, match.group(1)
    return None, None
Enter fullscreen mode Exit fullscreen mode

Reddit JSON API

import requests
HEADERS = {"User-Agent": "ApifyRedditScraper/1.0 (by u/apify_dev)", "Accept": "application/json"}

def get_post_json(subreddit, post_id):
    if subreddit:
        url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"
    else:
        url = f"https://www.reddit.com/comments/{post_id}.json"
    resp = requests.get(url, headers=HEADERS, timeout=20)
    if resp.status_code == 200:
        return resp.json()
    return None
Enter fullscreen mode Exit fullscreen mode

Reddit's .json endpoint returns a two-element array: post metadata at [0] and the comment tree at [1].

Recursive Comment Extraction

def extract_comment(comment_data, depth=0, max_replies=5):
    data = comment_data.get("data", {})
    author = data.get("author", "[deleted]")
    body = data.get("body", "")
    if author == "[deleted]" or not body:
        return None
    return {
        "author": author,
        "body": body[:2000],
        "score": data.get("score", 0),
        "upvotes": data.get("ups", 0),
        "downvotes": data.get("downs", 0),
        "gilded": data.get("gilded", 0),
        "timestamp_utc": data.get("created_utc", 0),
        "depth": depth,
        "comment_id": data.get("id", ""),
        "permalink": f"https://www.reddit.com{data.get('permalink', '')}",
    }
Enter fullscreen mode Exit fullscreen mode

Comment Tree Traversal

def extract_comments_listing(listing_data, max_comments=100, max_replies=5):
    comments = []
    count = 0
    def traverse(post_data, depth=0):
        nonlocal count
        if count >= max_comments:
            return
        kind = post_data.get("kind", "")
        data = post_data.get("data", {})
        if kind == "t1":
            comment = extract_comment(post_data, depth=depth)
            if comment:
                comments.append(comment)
                count += 1
            replies = data.get("replies", {})
            if isinstance(replies, dict):
                for reply in replies.get("data", {}).get("children", [])[:max_replies]:
                    traverse(reply, depth=depth + 1)
        elif kind == "Listing":
            for child in data.get("children", []):
                traverse(child, depth=depth)
    for post_data in listing_data:
        traverse(post_data)
    return comments
Enter fullscreen mode Exit fullscreen mode

Post Discovery via Pushshift

def search_reddit_posts(query, subreddit="", limit=5):
    url = "https://api.pullpush.io/reddit/search/submission/"
    params = {"q": query, "sort": "desc", "sort_type": "score", "size": limit}
    if subreddit:
        params["subreddit"] = subreddit
    resp = requests.get(url, params=params, timeout=15)
    if resp.status_code == 200:
        return [{"title": p.get("title", ""), "post_id": p.get("id", ""),
                 "subreddit": p.get("subreddit", ""), "score": p.get("score", 0)}
                for p in resp.json().get("data", [])]
    return []
Enter fullscreen mode Exit fullscreen mode

Dockerfile

FROM python:3.11-slim
WORKDIR /actor
RUN apt-get update && apt-get install -y wget curl && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir requests beautifulsoup4 lxml
COPY src/ ./src/
CMD ["python3", "src/main.py"]
Enter fullscreen mode Exit fullscreen mode

Deploy to Apify

npm install -g apify-cli
apify login
cd reddit-comment-scraper
apify actors push --version 0.1
Enter fullscreen mode Exit fullscreen mode

Live at: https://apify.com/yawning_pit/reddit-comment-scraper


Example Output

{
  "mode": "url",
  "items": [{
    "type": "post",
    "title": "LLMs are killing my ability to code. I feel like a fraud.",
    "author": "throwaway_dev_42",
    "subreddit": "r/programming",
    "score": 4281,
    "comments": [{
      "author": "senior_eng_10yrs",
      "body": "This is a natural part of skill evolution...",
      "score": 1247,
      "depth": 0,
      "gilded": 2
    }],
    "comments_count": 100
  }]
}
Enter fullscreen mode Exit fullscreen mode

Limitations

  1. Deleted/Removed Comments: Skipped automatically
  2. Rate Limiting: Space out large crawls
  3. NSFW Content: Set over_18: True in input
  4. Pushshift Stability: Fall back to Reddit native search if Pushshift is down

Next Steps

  • Add GPT-powered sentiment scoring per comment
  • Multi-subreddit simultaneous search
  • Time-series engagement tracking
  • CSV/Google Sheets export integration

Full source on GitHub: xiaclaw2018/devnest

Built with Apify, Python, and the public Reddit JSON API. No authentication required.

Top comments (0)