DEV Community

Cover image for From 403 Forbidden to 200 OK: Stealth Scraping AppSumo
Jerry A. Henley
Jerry A. Henley

Posted on

From 403 Forbidden to 200 OK: Stealth Scraping AppSumo

The "Vibe Coding" era has arrived. With AI-powered IDEs like Cursor and Windsurf, you can describe a scraper in plain English and receive a valid Python or Node.js script in seconds. On paper, these scripts look flawless. However, the moment you point that code at a high-value target like AppSumo, reality sets in.

Your code might work once. Then, suddenly, every request returns a 403 Forbidden error, or you find yourself staring at a Cloudflare CAPTCHA your script can't solve.

The problem isn't the AI's coding ability. It's that AI writes code for a "polite" web that no longer exists. To scrape AppSumo at scale, you need more than just logic; you need stealth. This guide explains how to bypass these defenses using the production-ready tools found in the AppSumo Scrapers repository.

Why Standard Scrapers Get Blocked

Modern anti-bot solutions like Cloudflare and Akamai analyze the "vibe" of your entire connection. If any detail feels artificial, you're blocked. Three main factors usually trigger these defenses:

  1. TLS Fingerprinting: When you use a library like requests in Python or axios in Node.js, the way your script negotiates the encrypted connection (the TLS handshake) differs from a real browser. Anti-bots identify this fingerprint and immediately flag the traffic as automated.

  2. Browser Leaks: Standard Selenium or Playwright instances leave traces. For example, the property navigator.webdriver is set to true by default. Real browsers don't do this.

  3. IP Reputation: Scraping from an AWS or DigitalOcean server uses a Datacenter IP. These are heavily monitored. AppSumo expects traffic from real humans on home Wi-Fi or mobile networks.

To move from a 403 to a 200 OK, we have to patch these leaks.

The Python Approach: Undetected ChromeDriver

For Python developers, the most effective way to mimic a human is the undetected-chromedriver (uc) library. Unlike standard Selenium, uc modifies the Chrome binary on the fly to remove common flags that trigger anti-bot sensors.

The Python implementation in the repository handles it like this:


import undetected_chromedriver as uc

from seleniumwire import webdriver



def get_driver():

    options = uc.ChromeOptions()

    options.add_argument("--headless=new")

    options.add_argument("--disable-blink-features=AutomationControlled")

    options.add_argument("--window-size=1920,1080")

    # selenium-wire integrates proxy support

    driver = webdriver.Chrome(

        options=options,

        seleniumwire_options=PROXY_CONFIG,

    )

    # Mask the driver via CDP commands

    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {

        "source": """

            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

            Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });

        """

    })

    return driver

Enter fullscreen mode Exit fullscreen mode

Why this works:

  • --disable-blink-features=AutomationControlled: This critical flag prevents the browser from notifying the website that it is being controlled by automated software.

  • CDP Commands: Manually overwriting the navigator.webdriver property to undefined makes the browser environment indistinguishable from a standard user session.

The Node.js Approach: Playwright with Stealth

For JavaScript developers, Playwright is the standard, but it is easily detectable out of the box. To stay under the radar, use playwright-extra combined with the puppeteer-extra-plugin-stealth.

Here is how the Node.js scraper configures stealth:


const { chromium } = require('playwright-extra');

const StealthPlugin = require('puppeteer-extra-plugin-stealth');



// Inject the stealth plugin into the chromium engine

chromium.use(StealthPlugin());



async function runScraper() {

    const browser = await chromium.launch({

        proxy: PROXY_CONFIG,

        headless: true

    });

    const page = await browser.newPage();

    await page.goto('https://appsumo.com/browse/');

}

Enter fullscreen mode Exit fullscreen mode

The role of the Stealth Plugin

The stealth plugin is a collection of "evasions." It mocks the WebGL vendor to make it look like you have a real Nvidia or Intel GPU, fixes the navigator.languages array, and mocks the permissions API. It addresses dozens of small leaks that AI-generated code usually misses.

Integrating Residential Proxies

Even a stealthy browser will eventually be blocked if it sends 1,000 requests from a single IP. AppSumo blocks most datacenter IPs immediately. To succeed, you need Residential Proxies.

Residential proxies route your traffic through real home devices, providing a clean IP that looks like a standard customer. Every scraper in the repository includes a PROXY_CONFIG block designed for providers like ScrapeOps.

Python Proxy Integration:


# selenium-wire handles authenticated proxies

PROXY_CONFIG = {

    'proxy': {

        'http': f'http://scrapeops:{API_KEY}@residential-proxy.scrapeops.io:8181',

        'https': f'http://scrapeops:{API_KEY}@residential-proxy.scrapeops.io:8181',

        'no_proxy': 'localhost:127.0.0.1'

    }

}

Enter fullscreen mode Exit fullscreen mode

Node.js Proxy Integration:


const PROXY_CONFIG = {

    server: 'http://residential-proxy.scrapeops.io:8181',

    username: 'scrapeops',

    password: API_KEY

};

Enter fullscreen mode Exit fullscreen mode

Using a proxy gateway removes the need to rotate IPs manually. Every request automatically receives a new, high-reputation residential IP address.

Handling Data at Scale

Once you’ve bypassed the blocks, you need a reliable way to store the data. The AppSumo repository uses the JSONL (JSON Lines) format. Unlike a standard JSON array, JSONL stores one object per line. This is much safer for scraping; if your script crashes on item #500, the first 499 items are already saved to the disk.

The repository also uses a DataPipeline class to handle Deduplication. This ensures you don't pay for proxy traffic only to scrape the same deal twice.


class DataPipeline {

    constructor(outputFile) {

        this.itemsSeen = new Set();

        this.outputFile = outputFile;

    }



    isDuplicate(data) {

        const itemKey = data.productId;

        if (this.itemsSeen.has(itemKey)) {

            console.warn(`Duplicate found: ${itemKey}`);

            return true;

        }

        this.itemsSeen.add(itemKey);

        return false;

    }

}

Enter fullscreen mode Exit fullscreen mode

Testing and Validation

To run these scrapers, clone the repository and install the dependencies:


git clone https://github.com/scraper-bank/AppSumo.com-Scrapers.git

cd AppSumo.com-Scrapers/python/selenium

pip install -r requirements.txt

python product_data/scraper/appsumo.com_scraper_product_v1.py

Enter fullscreen mode Exit fullscreen mode

Check the output folder for a .jsonl file. You should see structured data including prices, ratings, and features. If you see empty fields or "Just a moment..." text, your stealth settings or proxy credentials likely need adjustment.

To Wrap Up

Scraping AppSumo requires a defense-in-depth strategy. You cannot rely on AI-generated logic alone. Successful extraction combines:

  • Patched Browsers: Use undetected-chromedriver or Playwright Stealth to hide automation flags.

  • High-Reputation IPs: Use residential proxies to avoid being flagged as a datacenter.

  • Reliable Data Pipelines: Use JSONL and deduplication to ensure data integrity.

The AppSumo Scrapers repository provides the blueprint for this. Fork it, add your API key, and move past the 403 errors.

Top comments (0)