Lakshay Nasa for Extract Data by Zyte

Posted on Aug 14

Building a Discord Controlled Web Scraper with Scrapy & Zyte API

#webscraping #data #discord #programming

Introduction

It all started with a simple question in our Extract Data Discord: Extract Data Community:

“Hey, I’m trying to scrape this gaming leaderboard, but I keep getting blocked. Any idea how to get around it?”

A familiar problem for anyone in web scraping: modern websites block regular scraping with JavaScript rendering, rate limits, and IP restrictions. What began as a quick fix with Zyte API soon grew into a bigger idea.

After sharing a working demo, I asked myself:

What if this could be more than just a script?

What if it could scrape reliably, filter intelligently, and notify automatically - all while plugging into Discord?

And that’s how this project came to life!

In this article, I’ll share what I learned while building the system, from scraping and filtering data to sending real-time updates in Discord, and even triggering scrapes directly via a Discord bot. No heavy code walkthroughs, just insights you can apply to your own projects.

🧭 Overview: What We Built

At its core, the project does five simple but powerful things:

Scrapes leaderboard data from a gaming site using Scrapy
Bypasses anti-bot protections using Zyte API’s browser automation
Filters players based on customizable level thresholds
Notifies your Discord channel about new high-level players
Runs continuously on autopilot with scheduled checks

🎯 Scraping Goal: Build a scraper that scans the game’s leaderboard using custom input filters with in a defined page range, then instantly alerts our Discord community when matching high-level players are found.

Key Components

🕷️ Scrapy handles the scraping.
🛡️ Zyte API bypasses tough protections.
⏱️ Monitoring: Automated scheduling system
🤖 A Discord bot control center for commands/results

No manual refreshing. No getting blocked. Just clean, filtered data delivered where your community hangs out.

Architecture

Project Structure

scrape_filter_notify/
├── main.py                    # Main CLI entry point
├── discord_bot.py             # Discord bot with all commands
├── continuous_monitor.py      # Automated monitoring scheduler
├── requirements.txt           # Python dependencies
├── .env                       # Environment variables (create this)
├── .gitignore                 # Git ignore rules
│
└── scrape_filter_notify/     # Scrapy project
    ├── scrapy.cfg            # Scrapy configuration ( Default )
    └── scrape_filter_notify/
        ├── settings.py       # Scrapy settings ( Modified )
        ├── items.py          # Scrapy data models ( Modified )
        ├── pipelines.py      # Data processing ( Modified )
        ├── discord_notifier.py  # Discord integration ( New )
        └── spiders/
            └── leaderboard_spider.py  # Main web scraper ( Modified )

⚙️ Getting Started: Setting Up the Spider ( with Scrapy + Zyte API )

It began by setting up the scraper engine using a Scrapy spider, the gaming site in focus wasn’t friendly, it threw up JavaScript, rate limits, and the occasional CAPTCHAs at us.

Scrapy alone couldn’t get through, so we brought in Zyte API to handle rendering, retries, and anti-bot defenses. That way, the spider could focus on what matters: pulling clean data.

🧭 New to Scrapy?

If you’re just getting started, this tutorial will walk you through setting up your first Scrapy project from scratch.

Scraping Process

Here’s is the architecture for a smart and robust leaderboard_spider.py:

See, the Scrapy setup crawls through paginated leaderboard pages and extracts player info, with Zyte smart backend helping it navigate the websites tricky parts under the hood.

To keep things clean and easy to maintain, I split the logic into three main files - each doing exactly one job::

leaderboard_spider.py - does the crawling and parsing
items.py - defines the structure for raw data
pipelines.py - filters, saves, and notifies

One important step before starting the spider is configuring Scrapy to use Zyte API as the backend for all requests. This goes into our Scrapy settings.py file:

# Load Zyte API key securely from environment (recommended)
ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

# Enable transparent mode for better debugging and easier dev experience
ZYTE_API_TRANSPARENT_MODE = True

# Use Zyte’s download handler and middleware
DOWNLOAD_HANDLERS = {
    "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}

DOWNLOADER_MIDDLEWARES = {
    "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 633,
}

📝 Tip: keep sensitive info like API keys in environment variables, never hardcode credentials directly.

Defining Raw Data

In Scrapy, items basically define how we want to shape the raw data we’re scraping. They’re like organized containers that hold everything the spider grabs. Later on, pipelines handle cleaning and validation.

For instance, here’s the simple RawPlayerItem class I used:

# items.py

import scrapy

# 🧱 Defines the structure of raw player data scraped from the leaderboard
class RawPlayerItem(scrapy.Item):
    player_name_raw = scrapy.Field()
    kingdom_raw = scrapy.Field()
    level_raw = scrapy.Field()
    game_exp_raw = scrapy.Field()
    page = scrapy.Field()

💡 Tip: If your setup includes pagination, it’s helpful to capture the page number as part of your data. For me, this was useful for estimating how long the scraping would take & debugging issues...

I set the spider to run on three main functions:

Initialize settings
Send requests
Parse the data

Inside the __init__ method, I just set up some basic configurations, like:

Minimum player level to consider
Number of pages to scrape
Output location
Whether or not to send a Discord notification

When the spider starts sending requests, it’s not just grabbing plain HTML. Because the site uses a lot of JavaScript, we rely on Zyte API’s browser automation to fully load content before scraping.

A couple of things to keep in mind while sending requests:

Add a little wait time using actions with a timeout, because sometimes the page content takes a few seconds to fully load.
Set the geolocation to US – this was a key discovery. The site sometimes shows incomplete or blocked content depending on the request’s region. Setting it to the US gave consistent, clean data every time.

Example request setup:

meta = {
    'zyte_api': {
        'browserHtml': True,          # Get full browser-rendered HTML
        'javascript': True,           # Enable JS execution
        'actions': [                  # Wait time before scraping
            {'action': 'waitForTimeout', 'timeout': 15}
        ],
        'geolocation': 'US'           # Set location to US for consistent data
    },
    'page': page,
}

One of the nice things Scrapy handles for us behind the scenes is retries and error handling.
If you were working with just plain Python + Zyte API, you’d have to write your own retry logic for bans, 520 errors, and other hiccups.

Just add these to your settings.py to handle retries automatically:

# Retry settings
RETRY_HTTP_CODES = [403, 429, 500, 502, 503, 504, 520, 524]
RETRY_TIMES = 5

Once Zyte sends back the fully rendered HTML, the spider’s parse() method gets to work. It uses CSS selectors to sift through the messy HTML and pick out exactly what we need: player names, kingdoms, levels, and more.

📝Pro Tip: I actually use two CSS selectors as a backup plan, because sometimes the page’s HTML is a little different, like text wrapped in <font> tags on some pages but not others. This helps the spider stay flexible and not break - something you learn while debugging!

Pipelines: Cleaning, Filtering & Notifying

Once the spider scrapes raw data, the pipelines take over to clean, validate, save, and notify. I split the pipeline into two main parts:

PlayerProcessingPipeline: This part cleans up the raw data, filters out players below the minimum level, avoids duplicates, and saves the final list.
DiscordNotificationPipeline: At the end, this pipeline checks for any new players and shoots a neat summary over to Discord to keep everyone in the loop.

One cool thing I learned about how pipelines work in Scrapy is that each pipeline class gets its own “wrap-up” moment when the spider finishes running.

Scrapy lets every pipeline class define its own finishing method - close_spider(), and it runs these automatically in the order you set in ITEM_PIPELINES in settings.py.

# Enable pipelines
ITEM_PIPELINES = {
    "scrape_filter_notify.pipelines.PlayerProcessingPipeline": 300,
    "scrape_filter_notify.pipelines.DiscordNotificationPipeline": 800,
}

That’s why in my case the processing pipeline runs first, to clean and save the data and the Discord pipeline runs right after to send notifications based on that data.

Remember this all happens under the hood, once the spider starts crawling, the pipeline quietly takes over in the background, it filters out duplicates, skips players below the level threshold, and stores clean data into a JSON file.

That wraps up our Scraper Engine.

But scraping data is only useful if it reaches the people who need it.

Next I set up the Discord notifier to deliver the fresh data we just scraped right to the Discord server.

Sending Updates with Discord Notifier

Building a Discord bot isn’t hard, libraries like:

Discord.py - a modern, easy to use, feature-rich, and async ready API wrapper for Discord.
Discord.js - a powerful Node.js module that allows you to interact with the Discord API very easily.

…make it pretty straightforward.

Since our scraper is all in Python, I went with Discord.py. That way, everything runs in one language with no extra headaches, no child processes, no separate API layer just to talk to the scraper engine. That said, Discord.js has its own perks and can be the better pick if you’re already deep in the Node.js. We’ll explore that route another time.

With discord_notifier.py, the workflow is pretty simple: :

1️⃣ Load secrets (bot token & channel ID) securely via environment variables.

2️⃣ Log in to Discord, find the target channel, and build a polished embed message with the top new players.

3️⃣ Send the message, then log out cleanly.

The fun part was dealing with the event loop clash between Scrapy and Discord.py

Here’s the thing, Scrapy runs asynchronously on top of Twisted, which is its own networking framework - Twisted is a networking library that provides the asynchronous framework that Scrapy uses for its operations. Which means scrapy manages a lot of things (like web requests and processing ) concurrently within its own Twisted event loop.

When the spider finishes scraping, Scrapy begins shutting down. But in my second pipeline class (DiscordNotificationPipeline), we still need to run the notifier - but we’re still inside Scrapy’s Twisted event loop.

On the other hand, when we run discord_notifier using discord.py library, it uses asyncio, which runs its own separate event loop. And the key problem is that:

🔥 You cannot start an asyncio loop while another event loop (like Twisted’s) is already running.

Python will raise a RuntimeError, because you're trying to start one event loop inside another.

To avoid that, I added a check:

If the Scrapy loop is already active, the notification runs in a separate thread with its own event loop.
If not, it runs normally on the main loop.

Something like this does the trick:

loop = asyncio.get_event_loop()
if loop.is_running():
    # Run Discord notifier in a new thread
else:
    # Run notifier on the current loop

That little workaround ensures the scraper finishes, and your Discord server gets a clean summary message every time the job completes, no crashes, no conflicts.

Note: The discord_notfier.py script we discussed isn’t a full-fledged bot - it just logs in, sends a summary message, and logs out. It’s great for running the scraper on a schedule, and pushing updates to Discord automatically. I created a separate Discord bot that gives us full control over the scraping process directly from Discord. This setup keeps the scraper independent and flexible!

Before we move on,

here’s a quick visual that ties everything together, from fetching the rendered HTML to storing filtered data to JSON, sending updates to Discord, and setting up the scheduler in the next step::

Pretty solid, right?

Now that we’ve got scraping and notifications working, the next question is: what if we want this whole flow to run automatically, without having to trigger it manually every hour or so?

That is exactly what continuous_monitor.py was set up for. It's a smart loop that runs our spider at regular intervals...

🔁Autopilot Mode: Let the Spider Run Itself

Here’s what I did for scheduling the scraper:

Keeps track of run stats: started time, last run, next scheduled run, and total runs completed.
Handles shutdown signals cleanly, so we never leave half-finished runs hanging.
Launches the spider as a subprocess, waits for it to finish, and then sleeps for the interval you’ve set

monitor = ContinuousMonitor()
monitor.start_monitoring(min_level=75, max_pages=2, interval_minutes=60)

Then it runs again… and again… automatically. Something like this captures the core idea:

while monitoring_active:
    run_spider_subprocess()
    report_status()  # print or send to Discord
    sleep_for_interval()

I set it up using asyncio so everything runs smoothly without blocking, even when integrated with Discord notifications. The async loop handles spider runs, reporting, and sleep intervals efficiently — no interference between tasks.

This script could run in two ways:

Standalone: Just schedule it with a cron job, or even run it manually. It scrapes, saves JSON data, and optionally sends Discord notifications.
Inside a bot: Later, we can plug it into a Discord bot to give us full control - start, stop, or check stats directly from Discord.

Now everything wired up — the spider does the scraping, the notifier sends updates, and the monitor keeps things running on a loop.
But let’s be real, running a spider manually every time wasn’t exactly goal. So we built a Discord bot!

Here’s a quick look at the full bot lifecycle to visualize how it all works (just make sure the .env file has the bot token and channel ID set up before running it)::

This bot isn't just a helper. It’s a full control panel for our scraper, right inside the Discord server. Want to scrape once on-demand? Run /scrape. Want it to auto-run every 60 minutes? Do /monitor_start interval:60. Want to stop it? Check status? It’s all there, and the responses look good too (with progress bars, timestamps, and interactive result buttons).

🤖 Discord Bot: A quick Walkthrough

The bot launches and registers slash commands like /scrape, /monitor_start, /monitor_status, etc.
We can interact with it via those commands, depending on the command, it either:
- Runs a single scraping job using the parameters we give (or defaults),
- Starts the monitor, which loops and runs jobs periodically,
- Or just gives helpful info with /help_sccrape or lets us stop ongoing monitoring with /monitor_stop
While the scraping is in progress, we get live updates with visually satisfying progress bars, estimated times, and player counts.
Once it's done, it gives back a clean summary with a “View Results” button that opens an embedded, paginated view of the players it found in Discord itself...

So far, I built out all the pieces >

A spider that scrapes,
A Discord bot that commands it
A monitor that loops it in the background.

But I needed one more thing… A way to integrate all components together, for ease of control.

That’s why I created main.py: it’s the single command-line interface that ties the whole project together. Whether we want to… Run a quick scrape, Start the Discord bot Or launch background monitoring, it does all for us.

Next up! Let’s see how the results actually look when this thing runs.

🌀 Output Preview: What Happens When It Runs

Triggering a Scrape (Terminal Output) Here’s what it looks like when we run a scrape directly from the CLI. It kicks off the spider, runs through the pages, and wraps up with notifying us on our Discord Channel..

Notified on Discord

Running Background Monitoring (Terminal Output)

When we want the spider to keep working in the background, automatically running every X minutes, just trigger:

python main.py monitor --interval 30

We’ll see something like this in our terminal:

💬 And Just like before, it’ll ping us on Discord with updates!

Live Discord Bot in Action

Once we start the Discord Bot using the below command.

python main.py bot

…it boots up and gets right to work behind the scenes. It registers all the slash commands we built. Now we don’t need to touch the terminal. Just head to Discord and start interacting with the bot directly -

Available Commands on Discord

a. Run scrape Instantly

Just type /scrape, hit enter, and the bot takes care of the rest.

Controlling the monitoring loop right from Discord:

And that’s it, we’ve seen it all in action.

From scraping and filtering to live Discord alerts and full automation via CLI and bot commands, every part of this project works together to keep us and the community updated on the latest leaderboard shifts with minimal effort.

🎯 Final Thoughts: Scraping That Talks Back

This project started with a simple goal to help someone to get past anti-bot walls and grab some game data with Zyte API. But along the way, it became something more - a full system that scrapes, filters, and talks back to us in real-time via Discord.

The best part? It’s modular. Want to tweak the filter logic? Modify the pipeline. Want to plug it into another Discord server? Just update the .env. Need to scrape something entirely different? Swap out the spider logic, and keep the rest.

Just imagine, a single question turned into a full-fledged project…

That’s exactly the kind of spark our community runs on. If you're into this kind of stuff, scraping tricky sites, building smarter automations, or just geeking out over ideas, come hang out in the Extract Data Discord. We’re 20,000+ strong and growing, with data lovers, scraping pros, and creative hackers sharing projects, questions, and solutions every single day.

And as for this project - I hope walking through this gave you a solid blueprint for how to go beyond just writing a spider and instead, build scraping workflows that feel more interactive, automated, and fun.

Play around, and let us know what you build next!

Thanks for reading, 🙂
Catch you in the Discord!

Top comments (1)

OnlineProxy • Aug 15

In 2025, real-time in-game leaderboard scraping + Discord alerts is straight for gaming communities keeping tabs on ranks, resets, you name it. I’ve cooked up some bots with discord.py that kick off spiders using subprocesses - rate-limited and Redis-cached, so they don’t spam the whole server. When I hit JS-heavy or anti-bot sites - Playwright with stealth plugins and CAPTCHA solvers is my secret sauce. For scale, Scrapy + Zyte API + Redis - Discord bots = low-maintenance and super modular. Honestly, a plug-and-play leaderboard scraper with slash commands, diff tracking, slick embeds, role-based alerts, and retry queues total dream setup. Someone build it already - or I just might.