Build a Reddit Comment Scraper with Apify: Extract Deep Discussion Data in Minutes
If you've ever wanted to analyze what people are actually saying on Reddit — not just the post titles, but the full thread of comments, the debate, the nuance — you know how painful it is to get that data reliably. Reddit's official API is rate-limited, third-party tools are often paywalled, and browser automation is fragile.
This tutorial shows you how to build your own Reddit Comment Scraper using Apify and Python. By the end, you'll have an Apify Actor that can:
- Scrape comments from any Reddit post URL
- Search posts by keyword across Reddit (or within a specific subreddit)
- Extract rich metadata: author, karma, upvotes, downvotes, gilded status, timestamps, and nested replies
- Run on Apify's infrastructure for free (up to $5/month on the free tier)
Updated for 2026 — uses Pushshift API + Reddit JSON API for reliable data extraction without needing authentication.
Why Reddit Comment Data Is Valuable
Reddit is one of the richest sources of authentic public discourse on the internet. Unlike Twitter (now X), Reddit threads allow for deep, nested discussions that can span dozens of replies and reveal real sentiment, expertise, and community consensus.
Use cases:
- Market Research: Understand what users genuinely think about a product
- Sentiment Analysis: Train NLP models on real consumer opinions from niche communities
- Academic Research: Study how communities discuss topics like AI, climate change, or political events
- Competitor Monitoring: Track discussions about competitors in relevant subreddits
- Content Inspiration: Find high-engagement post topics in your niche
Architecture Overview
Our scraper uses a two-layer approach:
Layer 1: Post Discovery (Pushshift API / Reddit Search)
↓ post URLs + metadata
Layer 2: Comment Extraction (Reddit JSON API .json endpoint)
This approach is authentication-free — we use public APIs that don't require Reddit login credentials.
Project Setup
Create the following file structure:
reddit-comment-scraper/
├── .actor/
│ └── actor.json
├── src/
│ └── main.py
├── Dockerfile
└── README.md
The Input Schema
The actor.json defines two input modes: url (scrape a specific post) and search (find posts by keyword). Key parameters include maxComments (up to 500), maxReplies per comment, and sort order (top, hot, new, controversial).
Core Scraper Logic
URL Parsing
import re
from urllib.parse import urlparse
def parse_reddit_url(url):
path = urlparse(url).path.rstrip("/")
match = re.match(r"/r/([^/]+)/comments/([a-zA-Z0-9]+)", path)
if match:
return match.group(1), match.group(2)
match = re.match(r"/comments/([a-zA-Z0-9]+)", path)
if match:
return None, match.group(1)
return None, None
Reddit JSON API
import requests
HEADERS = {"User-Agent": "ApifyRedditScraper/1.0 (by u/apify_dev)", "Accept": "application/json"}
def get_post_json(subreddit, post_id):
if subreddit:
url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"
else:
url = f"https://www.reddit.com/comments/{post_id}.json"
resp = requests.get(url, headers=HEADERS, timeout=20)
if resp.status_code == 200:
return resp.json()
return None
Reddit's .json endpoint returns a two-element array: post metadata at [0] and the comment tree at [1].
Recursive Comment Extraction
def extract_comment(comment_data, depth=0, max_replies=5):
data = comment_data.get("data", {})
author = data.get("author", "[deleted]")
body = data.get("body", "")
if author == "[deleted]" or not body:
return None
return {
"author": author,
"body": body[:2000],
"score": data.get("score", 0),
"upvotes": data.get("ups", 0),
"downvotes": data.get("downs", 0),
"gilded": data.get("gilded", 0),
"timestamp_utc": data.get("created_utc", 0),
"depth": depth,
"comment_id": data.get("id", ""),
"permalink": f"https://www.reddit.com{data.get('permalink', '')}",
}
Comment Tree Traversal
def extract_comments_listing(listing_data, max_comments=100, max_replies=5):
comments = []
count = 0
def traverse(post_data, depth=0):
nonlocal count
if count >= max_comments:
return
kind = post_data.get("kind", "")
data = post_data.get("data", {})
if kind == "t1":
comment = extract_comment(post_data, depth=depth)
if comment:
comments.append(comment)
count += 1
replies = data.get("replies", {})
if isinstance(replies, dict):
for reply in replies.get("data", {}).get("children", [])[:max_replies]:
traverse(reply, depth=depth + 1)
elif kind == "Listing":
for child in data.get("children", []):
traverse(child, depth=depth)
for post_data in listing_data:
traverse(post_data)
return comments
Post Discovery via Pushshift
def search_reddit_posts(query, subreddit="", limit=5):
url = "https://api.pullpush.io/reddit/search/submission/"
params = {"q": query, "sort": "desc", "sort_type": "score", "size": limit}
if subreddit:
params["subreddit"] = subreddit
resp = requests.get(url, params=params, timeout=15)
if resp.status_code == 200:
return [{"title": p.get("title", ""), "post_id": p.get("id", ""),
"subreddit": p.get("subreddit", ""), "score": p.get("score", 0)}
for p in resp.json().get("data", [])]
return []
Dockerfile
FROM python:3.11-slim
WORKDIR /actor
RUN apt-get update && apt-get install -y wget curl && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir requests beautifulsoup4 lxml
COPY src/ ./src/
CMD ["python3", "src/main.py"]
Deploy to Apify
npm install -g apify-cli
apify login
cd reddit-comment-scraper
apify actors push --version 0.1
Live at: https://apify.com/yawning_pit/reddit-comment-scraper
Example Output
{
"mode": "url",
"items": [{
"type": "post",
"title": "LLMs are killing my ability to code. I feel like a fraud.",
"author": "throwaway_dev_42",
"subreddit": "r/programming",
"score": 4281,
"comments": [{
"author": "senior_eng_10yrs",
"body": "This is a natural part of skill evolution...",
"score": 1247,
"depth": 0,
"gilded": 2
}],
"comments_count": 100
}]
}
Limitations
- Deleted/Removed Comments: Skipped automatically
- Rate Limiting: Space out large crawls
-
NSFW Content: Set
over_18: Truein input - Pushshift Stability: Fall back to Reddit native search if Pushshift is down
Next Steps
- Add GPT-powered sentiment scoring per comment
- Multi-subreddit simultaneous search
- Time-series engagement tracking
- CSV/Google Sheets export integration
Full source on GitHub: xiaclaw2018/devnest
Built with Apify, Python, and the public Reddit JSON API. No authentication required.
Top comments (0)