DEV Community

Sanjeev Kumar
Sanjeev Kumar

Posted on

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite

Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite

December 2024 | By Sanjeev Kumar


TL;DR
I built a complete Reddit scraper suite that requires zero API keys. It comes with a beautiful Streamlit dashboard, REST API for integration with tools like Grafana and Metabase, plugin system for post-processing, scheduled scraping, notifications, and much more. Best of all—it’s completely open source.
🔗 GitHub: reddit-universal-scraper


The Problem
If you’ve ever tried to scrape Reddit data for analysis, research, or just personal projects, you know the pain:

  1. Reddit’s API is heavily rate-limited (especially after the 2023 API changes)
  2. API keys require approval and are increasingly restricted
  3. Existing scrapers are often single-purpose - scrape posts OR comments, not both
  4. No easy way to visualize or analyze the data after scraping
  5. Running scrapes manually is tedious - you want automation I decided to solve all of these problems at once. ________________________________________ The Solution: Universal Reddit Scraper Suite After weeks of development, I created a full-featured scraper that: Feature What It Does 📊 Full Scraping Posts, comments, images, videos, galleries—everything 🚫 No API Keys Uses Reddit’s public JSON endpoints and mirrors 📈 Web Dashboard Beautiful 7-tab Streamlit UI for analysis 🚀 REST API Connect Metabase, Grafana, DuckDB, and more 🔌 Plugin System Extensible post-processing (sentiment analysis, deduplication, keywords) 📅 Scheduled Scraping Cron-style automation 📧 Notifications Discord & Telegram alerts when scrapes complete 🐳 Docker Ready One command to deploy anywhere ________________________________________ Architecture Deep Dive How It Works Without API Keys The secret sauce is in the approach. Instead of using Reddit’s official (and restricted) API, I leverage:
  6. Reddit’s public JSON endpoints: Every Reddit page has a .json suffix that returns structured data
  7. Multiple mirror fallbacks: When one source is rate-limited, the scraper automatically rotates through alternatives like Redlib instances
  8. Smart rate limiting: Built-in delays and cool-down periods to stay under the radar MIRRORS = [ "https://old.reddit.com", "https://redlib.catsarch.com", "https://redlib.vsls.cz", "https://r.nf", "https://libreddit.northboot.xyz", "https://redlib.tux.pizza" ] When one source fails, it automatically tries the next. No manual intervention needed. The Core Scraping Engine The scraper operates in three modes:
  9. Full Mode - The complete package python main.py python --mode full --limit 100 This scrapes posts, downloads all media (images, videos, galleries), and fetches comments with their full thread hierarchy.
  10. History Mode - Fast metadata-only python main.py python --mode history --limit 500 Perfect for quickly building a dataset of post metadata without the overhead of media downloads.
  11. Monitor Mode - Live watching
    python main.py python --mode monitor
    Continuously checks for new posts every 5 minutes. Ideal for tracking breaking news or trending discussions.


    The Dashboard Experience
    One of the standout features is the 7-tab Streamlit dashboard that makes data exploration a joy:
    📊 Overview Tab
    At a glance, see: - Total posts and comments - Cumulative score across all posts - Media post breakdown - Posts-over-time chart - Top 10 posts by score
    📈 Analytics Tab
    This is where it gets interesting: - Sentiment Analysis: Run VADER-based sentiment scoring on your entire dataset - Keyword Cloud: See the most frequently used terms - Best Posting Times: Data-driven insights on when posts get the most engagement
    🔍 Search Tab
    Full-text search across all scraped data with filters for: - Minimum score - Post type (text, image, video, gallery, link) - Author - Custom sorting
    💬 Comments Analysis
    • View top-scoring comments
    • See who the most active commenters are
    • Track comment patterns over time
    ⚙️ Scraper Controls
    Start new scrapes right from the dashboard! Configure: - Target subreddit/user - Post limits - Mode (full/history) - Media and comment toggles
    📋 Job History
    Full observability into every scrape job: - Status tracking (running, completed, failed) - Duration metrics - Post/comment/media counts - Error logging
    🔌 Integrations
    Pre-configured instructions for connecting: - Metabase - Grafana - DreamFactory - DuckDB


    The Plugin Architecture
    I designed a plugin system to allow extensible post-processing. The architecture is simple but powerful:
    class Plugin:
    """Base class for all plugins."""
    name = "base"
    description = "Base plugin"
    enabled = True

    def process_posts(self, posts):
    return posts

    def process_comments(self, comments):
    return comments
    Built-in Plugins

  12. Sentiment Tagger Analyzes the emotional tone of every post and comment using VADER sentiment analysis:
    class SentimentTagger(Plugin):
    name = "sentiment_tagger"
    description = "Adds sentiment scores and labels to posts"

    def process_posts(self, posts):
    for post in posts:
    text = f"{post.get('title', '')} {post.get('selftext', '')}"
    score, label = analyze_sentiment(text)
    post['sentiment_score'] = score
    post['sentiment_label'] = label
    return posts

  13. Deduplicator Removes duplicate posts that may appear across multiple scraping sessions.

  14. Keyword Extractor Pulls out the most significant terms from your scraped content for trend analysis.
    Creating Your Own Plugin
    Drop a new Python file in the plugins/ directory:
    from plugins import Plugin

class MyCustomPlugin(Plugin):
name = "my_plugin"
description = "Does something cool"
enabled = True

def process_posts(self, posts):
    # Your logic here
    return posts
Enter fullscreen mode Exit fullscreen mode

Enable plugins during scraping:
python main.py python --mode full --plugins


REST API for External Integrations
The REST API opens up the scraper to a whole ecosystem of tools:
python main.py --api

API at http://localhost:8000

Docs at http://localhost:8000/docs

Key Endpoints
Endpoint Description
GET /posts List posts with filters (subreddit, limit, offset)
GET /comments List comments
GET /subreddits All scraped subreddits
GET /jobs Job history
GET /query?sql=... Raw SQL queries for power users
GET /grafana/query Grafana-compatible time-series data
Real-World Integration: Grafana Dashboard

  1. Install the “JSON API” or “Infinity” plugin in Grafana
  2. Add datasource pointing to http://localhost:8000
  3. Use the /grafana/query endpoint for time-series panels SELECT date(created_utc) as time, COUNT(*) as posts FROM posts GROUP BY date(created_utc) Now you have a real-time dashboard tracking Reddit activity! ________________________________________ Scheduled Scraping & Notifications Automation Made Easy Set up recurring scrapes with cron-style scheduling: # Scrape every 60 minutes python main.py --schedule delhi --every 60

With custom options

python main.py --schedule delhi --every 30 --mode full --limit 50
Get Notified
Configure Discord or Telegram alerts when scrapes complete:

Environment variables

export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_CHAT_ID="987654321"
Now you get notified with scrape summaries directly in your preferred platform.


Dry Run Mode: Test Before You Commit
One of my favorite features is dry run mode. It simulates the entire scrape without saving any data:
python main.py python --mode full --limit 50 --dry-run
Output:
🧪 DRY RUN MODE - No data will be saved
🧪 DRY RUN COMPLETE!
📊 Would scrape: 100 posts
💬 Would scrape: 245 comments
Perfect for: - Testing your scrape configuration - Estimating data volume before committing - Debugging without cluttering your dataset


Docker Deployment
Quick Start

Build

docker build -t reddit-scraper .

Run a scrape

docker run -v ./data:/app/data reddit-scraper python --limit 100

Run with plugins

docker run -v ./data:/app/data reddit-scraper python --plugins
Full Stack with Docker Compose
docker-compose up -d
This spins up: - Dashboard at http://localhost:8501 - REST API at http://localhost:8000
Deploy to Any VPS
ssh user@your-server-ip
git clone https://github.com/ksanjeev284/reddit-universal-scraper.git
cd reddit-universal-scraper
docker-compose up -d
Open the firewall:
sudo ufw allow 8000
sudo ufw allow 8501
You now have a production-ready Reddit scraping platform!


Data Export Options
CSV (Default)
All scraped data is saved as CSV files: - data/r_/posts.csv - data/r_/comments.csv
Parquet (Analytics-Optimized)
Export to columnar format for analytics tools:
python main.py --export-parquet python
Query directly with DuckDB:
import duckdb
duckdb.query("SELECT * FROM 'data/parquet/*.parquet'").df()
Database Maintenance

Backup

python main.py --backup

Optimize/vacuum

python main.py --vacuum

View job history

python main.py --job-history


Data Schema
Posts Table
Column Description
id Reddit post ID
title Post title
author Username
score Net upvotes
num_comments Comment count
post_type text/image/video/gallery/link
selftext Post body (for text posts)
created_utc Timestamp
permalink Reddit URL
is_nsfw NSFW flag
flair Post flair
sentiment_score -1.0 to 1.0 (with plugins)
Comments Table
Column Description
comment_id Comment ID
post_permalink Parent post URL
author Username
body Comment text
score Upvotes
depth Nesting level
is_submitter Whether author is OP


Use Cases

  1. Academic Research • Analyze subreddit community dynamics • Track sentiment over time during events • Study user engagement patterns
  2. Market Research • Monitor brand mentions • Track product feedback • Identify emerging trends
  3. Content Creation • Find popular topics in your niche • Analyze what makes posts go viral • Discover optimal posting times
  4. Data Journalism • Archive discussions around breaking news • Analyze public sentiment during events • Track narrative evolution
  5. Personal Projects • Build a dataset for ML training • Create Reddit-based recommendation systems • Archive communities you care about ________________________________________ Performance Considerations Respect Reddit’s Servers The scraper includes built-in delays: - 3 second cooldown between API requests - 30 second wait if all mirrors fail - Automatic mirror rotation to distribute load Optimize Your Scrapes • Use --mode history for faster metadata-only scrapes • Use --no-media if you don’t need images/videos • Use --no-comments for post-only data Handle Large Datasets • Parquet export for analytics queries • SQLite database for structured storage • Automatic deduplication to avoid bloat ________________________________________ What’s Next? Roadmap I’m actively developing new features: • ☐ Async scraping for even faster data collection • ☐ Multi-subreddit monitoring in a single command • ☐ Email notifications in addition to Discord/Telegram • ☐ Cloud deployment templates (AWS, GCP, Azure) • ☐ Web-based scraper configuration (no CLI needed) ________________________________________ Getting Started Prerequisites • Python 3.10+ • pip Installation # Clone the repo git clone https://github.com/ksanjeev284/reddit-universal-scraper.git cd reddit-universal-scraper

Install dependencies

pip install -r requirements.txt

Your first scrape

python main.py python --mode full --limit 50

Launch the dashboard

python main.py --dashboard
That’s it! You’re now scraping Reddit like a pro.


Contributing
This is an open-source project and contributions are welcome! Whether it’s: - Bug fixes - New plugins - Documentation improvements - Feature suggestions
Open an issue or submit a PR on GitHub.


If you found this useful, consider giving the project a ⭐ on GitHub!


Connect
• GitHub: @ksanjeev284
• Project: reddit-universal-scraper

Top comments (0)