DEV Community

Akash Kumar Naik
Akash Kumar Naik

Posted on

How I built a sub-second Reddit comment scraper (without Puppeteer)

Why I stopped using Puppeteer for Reddit scraping

If you've ever tried to scrape Reddit comments programmatically, you've hit one of two walls:

  1. Reddit's official API — rate-limited, requires app registration, and got far more restrictive in 2023.
  2. Headless browser scrapers — slow, resource-hungry, and break every time Reddit updates its Shreddit DOM.

I maintained a Puppeteer-based Reddit scraper for over a year and was constantly patching it. So I rebuilt it from scratch with a completely different architecture.


The architecture: bridge pattern

Instead of spinning up a browser on every run, the Actor acts as an authenticated bridge to a dedicated scraping cluster. The Actor itself is lightweight — it:

  1. Receives the Reddit post URL as input
  2. Forwards the request securely to the cluster
  3. Receives pre-scraped, structured data
  4. Pushes it to Apify's dataset storage

Results:

  • Sub-second response times vs. 5–15s for browser scrapers
  • No DOM dependency — Reddit can update its frontend without breaking anything
  • Built-in rotating proxies and fingerprint emulation at the cluster level

The output structure

{
  "post_url": "https://www.reddit.com/r/technology/comments/...",
  "data": [
    {
      "id": "odffoln",
      "author": "redditor_123",
      "body": "This is a comment from the Reddit community.",
      "score": 15,
      "created_utc": 1774923322,
      "permalink": "/r/subreddit/comments/.../id/"
    }
  ],
  "success": true,
  "processed_at": "2026-05-05T05:27:45.136Z"
}
Enter fullscreen mode Exit fullscreen mode

The output is deliberately flat. No nested reply trees, no HTML markup — just clean text, ready for:

  • RAG pipelines with LangChain or LlamaIndex
  • Sentiment analysis with zero preprocessing
  • LLM fine-tuning datasets from niche subreddits

How to use it

The input couldn't be simpler:

{
  "post_url": "https://www.reddit.com/r/MachineLearning/comments/..."
}
Enter fullscreen mode Exit fullscreen mode

Via the Apify Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("akash9078/reddit-comment-scraper").call(
    run_input={"post_url": "https://www.reddit.com/r/MachineLearning/comments/..."}
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["author"], item["score"], item["body"])
Enter fullscreen mode Exit fullscreen mode

Pricing model

Uses Apify's Pay-Per-Event (PPE) model:

  • Small base fee per successfully processed post
  • Per-comment fee for each comment returned
  • Zero charge if zero results are returned

Much fairer than flat per-run pricing for posts with varying comment volumes.


Integration use cases

n8n / Make / Zapier: Webhook support lets you trigger it from any automation workflow and pipe data to Airtable, Notion, Google Sheets, or a database.

AI agents: Fully compatible with Apify's MCP server — can be called directly by Claude, GPT-based agents, or any LLM with tool use.

Sentiment pipeline: Feed the body field from each comment into any sentiment model. No HTML stripping. It's already clean text.


Try it

Live on Apify Store: https://apify.com/akash9078/reddit-comment-scraper

Happy to discuss the architecture, edge cases, or specific use cases in the comments.

Top comments (0)