Why I stopped using Puppeteer for Reddit scraping
If you've ever tried to scrape Reddit comments programmatically, you've hit one of two walls:
- Reddit's official API — rate-limited, requires app registration, and got far more restrictive in 2023.
- Headless browser scrapers — slow, resource-hungry, and break every time Reddit updates its Shreddit DOM.
I maintained a Puppeteer-based Reddit scraper for over a year and was constantly patching it. So I rebuilt it from scratch with a completely different architecture.
The architecture: bridge pattern
Instead of spinning up a browser on every run, the Actor acts as an authenticated bridge to a dedicated scraping cluster. The Actor itself is lightweight — it:
- Receives the Reddit post URL as input
- Forwards the request securely to the cluster
- Receives pre-scraped, structured data
- Pushes it to Apify's dataset storage
Results:
- Sub-second response times vs. 5–15s for browser scrapers
- No DOM dependency — Reddit can update its frontend without breaking anything
- Built-in rotating proxies and fingerprint emulation at the cluster level
The output structure
{
"post_url": "https://www.reddit.com/r/technology/comments/...",
"data": [
{
"id": "odffoln",
"author": "redditor_123",
"body": "This is a comment from the Reddit community.",
"score": 15,
"created_utc": 1774923322,
"permalink": "/r/subreddit/comments/.../id/"
}
],
"success": true,
"processed_at": "2026-05-05T05:27:45.136Z"
}
The output is deliberately flat. No nested reply trees, no HTML markup — just clean text, ready for:
- RAG pipelines with LangChain or LlamaIndex
- Sentiment analysis with zero preprocessing
- LLM fine-tuning datasets from niche subreddits
How to use it
The input couldn't be simpler:
{
"post_url": "https://www.reddit.com/r/MachineLearning/comments/..."
}
Via the Apify Python SDK:
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("akash9078/reddit-comment-scraper").call(
run_input={"post_url": "https://www.reddit.com/r/MachineLearning/comments/..."}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["author"], item["score"], item["body"])
Pricing model
Uses Apify's Pay-Per-Event (PPE) model:
- Small base fee per successfully processed post
- Per-comment fee for each comment returned
- Zero charge if zero results are returned
Much fairer than flat per-run pricing for posts with varying comment volumes.
Integration use cases
n8n / Make / Zapier: Webhook support lets you trigger it from any automation workflow and pipe data to Airtable, Notion, Google Sheets, or a database.
AI agents: Fully compatible with Apify's MCP server — can be called directly by Claude, GPT-based agents, or any LLM with tool use.
Sentiment pipeline: Feed the body field from each comment into any sentiment model. No HTML stripping. It's already clean text.
Try it
Live on Apify Store: https://apify.com/akash9078/reddit-comment-scraper
Happy to discuss the architecture, edge cases, or specific use cases in the comments.
Top comments (0)