How I built a sub-second Reddit comment scraper (without Puppeteer)

#ai #productivity #programming #webdev

Why I stopped using Puppeteer for Reddit scraping

If you've ever tried to scrape Reddit comments programmatically, you've hit one of two walls:

Reddit's official API — rate-limited, requires app registration, and got far more restrictive in 2023.
Headless browser scrapers — slow, resource-hungry, and break every time Reddit updates its Shreddit DOM.

I maintained a Puppeteer-based Reddit scraper for over a year and was constantly patching it. So I rebuilt it from scratch with a completely different architecture.

The architecture: bridge pattern

Instead of spinning up a browser on every run, the Actor acts as an authenticated bridge to a dedicated scraping cluster. The Actor itself is lightweight — it:

Receives the Reddit post URL as input
Forwards the request securely to the cluster
Receives pre-scraped, structured data
Pushes it to Apify's dataset storage

Results:

Sub-second response times vs. 5–15s for browser scrapers
No DOM dependency — Reddit can update its frontend without breaking anything
Built-in rotating proxies and fingerprint emulation at the cluster level

The output structure

{
  "post_url": "https://www.reddit.com/r/technology/comments/...",
  "data": [
    {
      "id": "odffoln",
      "author": "redditor_123",
      "body": "This is a comment from the Reddit community.",
      "score": 15,
      "created_utc": 1774923322,
      "permalink": "/r/subreddit/comments/.../id/"
    }
  ],
  "success": true,
  "processed_at": "2026-05-05T05:27:45.136Z"
}

The output is deliberately flat. No nested reply trees, no HTML markup — just clean text, ready for:

RAG pipelines with LangChain or LlamaIndex
Sentiment analysis with zero preprocessing
LLM fine-tuning datasets from niche subreddits

How to use it

The input couldn't be simpler:

{
  "post_url": "https://www.reddit.com/r/MachineLearning/comments/..."
}

Via the Apify Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("akash9078/reddit-comment-scraper").call(
    run_input={"post_url": "https://www.reddit.com/r/MachineLearning/comments/..."}
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["author"], item["score"], item["body"])

Pricing model

Uses Apify's Pay-Per-Event (PPE) model:

Small base fee per successfully processed post
Per-comment fee for each comment returned
Zero charge if zero results are returned

Much fairer than flat per-run pricing for posts with varying comment volumes.

Integration use cases

n8n / Make / Zapier: Webhook support lets you trigger it from any automation workflow and pipe data to Airtable, Notion, Google Sheets, or a database.

AI agents: Fully compatible with Apify's MCP server — can be called directly by Claude, GPT-based agents, or any LLM with tool use.

Sentiment pipeline: Feed the body field from each comment into any sentiment model. No HTML stripping. It's already clean text.