muskert

Posted on Apr 25

Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes

#apify #reddit #python #datascience

Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes

If you've ever wanted to analyze what people are actually saying on Reddit — not just the post titles, but the full thread of comments, the debate, the nuance — you know how painful it is to get that data reliably. Reddit's official API is rate-limited, third-party tools are often paywalled, and browser automation is fragile.

This tutorial shows you how to build your own Reddit Comment Scraper using Apify and Python. By the end, you'll have an Apify Actor that can:

Scrape comments from any Reddit post URL
Search posts by keyword across Reddit (or within a specific subreddit)
Extract rich metadata: author, karma, upvotes, downvotes, gilded status, timestamps, and nested replies
Run on Apify's infrastructure for free (up to $5/month on the free tier)

Updated for 2026 — uses Pushshift API + Reddit JSON API for reliable data extraction without needing authentication.

Why Reddit Comment Data Is Valuable

Reddit is one of the richest sources of authentic public discourse on the internet. Unlike Twitter (now X), Reddit threads allow for deep, nested discussions that can span dozens of replies and reveal real sentiment, expertise, and community consensus.

Use cases:

Market Research: Understand what users genuinely think about a product
Sentiment Analysis: Train NLP models on real consumer opinions from niche communities
Academic Research: Study how communities discuss topics like AI, climate change, or political events
Competitor Monitoring: Track discussions about competitors in relevant subreddits
Content Inspiration: Find high-engagement post topics in your niche

Architecture Overview

Our scraper uses a two-layer approach:

Layer 1: Post Discovery (Pushshift API / Reddit Search)
         ↓ post URLs + metadata
Layer 2: Comment Extraction (Reddit JSON API .json endpoint)

This approach is authentication-free — we use public APIs that don't require Reddit login credentials.

Project Setup

Create the following file structure:

reddit-comment-scraper/
├── .actor/
│   └── actor.json
├── src/
│   └── main.py
├── Dockerfile
└── README.md

The Input Schema

The actor.json defines two input modes: url (scrape a specific post) and search (find posts by keyword). Key parameters include maxComments (up to 500), maxReplies per comment, and sort order (top, hot, new, controversial).

Core Scraper Logic

URL Parsing

import re
from urllib.parse import urlparse

def parse_reddit_url(url):
    path = urlparse(url).path.rstrip("/")
    match = re.match(r"/r/([^/]+)/comments/([a-zA-Z0-9]+)", path)
    if match:
        return match.group(1), match.group(2)
    match = re.match(r"/comments/([a-zA-Z0-9]+)", path)
    if match:
        return None, match.group(1)
    return None, None

Reddit JSON API

import requests
HEADERS = {"User-Agent": "ApifyRedditScraper/1.0 (by u/apify_dev)", "Accept": "application/json"}

def get_post_json(subreddit, post_id):
    if subreddit:
        url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"
    else:
        url = f"https://www.reddit.com/comments/{post_id}.json"
    resp = requests.get(url, headers=HEADERS, timeout=20)
    if resp.status_code == 200:
        return resp.json()
    return None

Reddit's .json endpoint returns a two-element array: post metadata at [0] and the comment tree at [1].

Recursive Comment Extraction

def extract_comment(comment_data, depth=0, max_replies=5):
    data = comment_data.get("data", {})
    author = data.get("author", "[deleted]")
    body = data.get("body", "")
    if author == "[deleted]" or not body:
        return None
    return {
        "author": author,
        "body": body[:2000],
        "score": data.get("score", 0),
        "upvotes": data.get("ups", 0),
        "downvotes": data.get("downs", 0),
        "gilded": data.get("gilded", 0),
        "timestamp_utc": data.get("created_utc", 0),
        "depth": depth,
        "comment_id": data.get("id", ""),
        "permalink": f"https://www.reddit.com{data.get('permalink', '')}",
    }

Comment Tree Traversal

def extract_comments_listing(listing_data, max_comments=100, max_replies=5):
    comments = []
    count = 0
    def traverse(post_data, depth=0):
        nonlocal count
        if count >= max_comments:
            return
        kind = post_data.get("kind", "")
        data = post_data.get("data", {})
        if kind == "t1":
            comment = extract_comment(post_data, depth=depth)
            if comment:
                comments.append(comment)
                count += 1
            replies = data.get("replies", {})
            if isinstance(replies, dict):
                for reply in replies.get("data", {}).get("children", [])[:max_replies]:
                    traverse(reply, depth=depth + 1)
        elif kind == "Listing":
            for child in data.get("children", []):
                traverse(child, depth=depth)
    for post_data in listing_data:
        traverse(post_data)
    return comments

Post Discovery via Pushshift

def search_reddit_posts(query, subreddit="", limit=5):
    url = "https://api.pullpush.io/reddit/search/submission/"
    params = {"q": query, "sort": "desc", "sort_type": "score", "size": limit}
    if subreddit:
        params["subreddit"] = subreddit
    resp = requests.get(url, params=params, timeout=15)
    if resp.status_code == 200:
        return [{"title": p.get("title", ""), "post_id": p.get("id", ""),
                 "subreddit": p.get("subreddit", ""), "score": p.get("score", 0)}
                for p in resp.json().get("data", [])]
    return []

Dockerfile

FROM python:3.11-slim
WORKDIR /actor
RUN apt-get update && apt-get install -y wget curl && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir requests beautifulsoup4 lxml
COPY src/ ./src/
CMD ["python3", "src/main.py"]

Deploy to Apify

npm install -g apify-cli
apify login
cd reddit-comment-scraper
apify actors push --version 0.1

Live at: https://apify.com/yawning_pit/reddit-comment-scraper

Example Output

{
  "mode": "url",
  "items": [{
    "type": "post",
    "title": "LLMs are killing my ability to code. I feel like a fraud.",
    "author": "throwaway_dev_42",
    "subreddit": "r/programming",
    "score": 4281,
    "comments": [{
      "author": "senior_eng_10yrs",
      "body": "This is a natural part of skill evolution...",
      "score": 1247,
      "depth": 0,
      "gilded": 2
    }],
    "comments_count": 100
  }]
}

Limitations

Deleted/Removed Comments: Skipped automatically
Rate Limiting: Space out large crawls
NSFW Content: Set over_18: True in input
Pushshift Stability: Fall back to Reddit native search if Pushshift is down

Next Steps

Add GPT-powered sentiment scoring per comment
Multi-subreddit simultaneous search
Time-series engagement tracking
CSV/Google Sheets export integration

Full source on GitHub: xiaclaw2018/devnest

Built with Apify, Python, and the public Reddit JSON API. No authentication required.

DEV Community

Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes

Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes

Why Reddit Comment Data Is Valuable

Architecture Overview

Project Setup

The Input Schema

Core Scraper Logic

URL Parsing

Reddit JSON API

Recursive Comment Extraction

Comment Tree Traversal

Post Discovery via Pushshift

Dockerfile

Deploy to Apify

Example Output

Limitations

Next Steps

Top comments (0)