Erika S. Adkins

Posted on Jan 24

Stop Writing Selectors: How I Vibe Coded a Production AppSumo Scraper

#vibecoding #webscraping #appsumoautomation #productiondevelopment

We’ve all been there. You ask an LLM like ChatGPT or Claude to write a simple web scraper for a site like AppSumo. It confidently spits out a script using soup.select('.price-tag-123'). You run it, and nothing happens. The classes are dynamic, the data is buried in a Next.js hydration blob, or the site’s anti-bot protection kicks you out before the page even loads.

This is the "Vibe Coding" bottleneck. You want to move from idea to execution using AI, but web scraping often forces you back into the weeds of manual DOM inspection and brittle CSS selectors.

We can break that cycle. This guide covers how to build a production-ready AppSumo scraper using Python and Playwright without writing a single manual CSS selector. Instead, we’ll use "hidden" data structures and AI-generated architecture to create a script that lasts.

Why Standard LLMs Fail on AppSumo

If you try to build a scraper using a generic prompt, you’ll likely run into three major roadblocks:

Dynamic/Tailwind Classes: AppSumo uses utility-first CSS (Tailwind) and dynamic class names. An LLM might guess a selector like .text-midnight, but if the developers change the padding or color scheme, the scraper breaks.
Client-Side Rendering: As a modern Next.js application, much of AppSumo’s data isn't in the initial HTML. It’s loaded dynamically. If you use a simple requests and BeautifulSoup approach, you’ll often find yourself staring at an empty div.
Hallucination: LLMs often imagine that websites have logical IDs like #product-price. AppSumo doesn't work that way.

To build something reliable, stop looking at what the website looks like and start looking at how it stores its data.

The Solution: The AI Scraper Builder

Instead of asking a general-purpose AI to guess selectors, I used the ScrapeOps AI Scraper Builder. This tool analyzes a target URL and generates a Playwright script that targets the most stable data sources on the page: JSON-LD and NEXT_DATA.

By pasting an AppSumo product URL into the builder, we get a script that doesn't care if a button turns from blue to green. It targets the raw data blobs the website uses to render itself.

Code Walkthrough: Analyzing the Generated Script

Let’s look at the core script from the AppSumo Scrapers repository. We’ll focus on the Playwright implementation found in python/playwright/product_data/scraper/appsumo.com_scraper_product_v1.py

1. The Data Schema

First, we define the requirements. Using Python dataclasses ensures the script remains type-safe and structured.


@dataclass

class ScrapedData:

    name: str = ""

    brand: str = ""

    price: float = 0.0

    preDiscountPrice: float = 0.0

    currency: str = "USD"

    availability: str = "in_stock"

    aggregateRating: Dict[str, Any] = field(default_factory=dict)

    description: str = ""

    features: List[str] = field(default_factory=list)

    images: List[Dict[str, str]] = field(default_factory=list)

    url: str = ""

2. Extraction Without Selectors

This is the most critical part of the script. Instead of searching for a price inside a <span>, the script evaluates a JavaScript block to find the JSON-LD (Structured Data) and NEXT_DATA (Next.js state) objects.


async def extract_data(page: Page) -> Optional[ScrapedData]:

    # Extraction via JSON-LD

    json_ld_data = await page.evaluate("""() => {

        const scripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]'));

        for (const s of scripts) {

            try {

                const data = JSON.parse(s.innerText);

                const findProduct = (obj) => {

                    if (Array.isArray(obj)) return obj.find(item => item['@type'] === 'Product');

                    if (obj['@type'] === 'Product') return obj;

                    return null;

                };

                const product = findProduct(data);

                if (product) return product;

            } catch (e) {}

        }

        return null;

    }""")

AppSumo, like many modern sites, embeds a JSON object containing the product name, price, and reviews for SEO purposes. This JSON is highly structured and rarely changes, making it significantly more reliable than CSS selectors.

3. Handling Proxies and Anti-Bot Measures

AppSumo employs anti-bot measures that block standard headless browsers. The generated script handles this using playwright-stealth and the ScrapeOps Proxy integrated directly into the browser launch:


# ScrapeOps Residential Proxy Configuration

PROXY_CONFIG = {

    "server": "http://residential-proxy.scrapeops.io:8181",

    "username": "scrapeops",

    "password": API_KEY

}

async def run_scraper():

    async with async_playwright() as p:

        browser = await p.chromium.launch(

            headless=True,

            proxy=PROXY_CONFIG

        )

        context = await browser.new_context()

        page = await context.new_page()

        await stealth_async(page) # Apply stealth patterns

Handling Concurrency and Pipelines

To make this production-ready, the script includes a DataPipeline class that handles deduplication and saves data in JSONL format.


class DataPipeline:

    def __init__(self, jsonl_filename="output.jsonl"):

        self.items_seen = set()

        self.jsonl_filename = jsonl_filename


    def is_duplicate(self, input_data):

        item_key = input_data.get("productId")

        if item_key in self.items_seen:

            return True

        self.items_seen.add(item_key)

        return False

    def add_data(self, scraped_data: ScrapedData):

        data_dict = asdict(scraped_data)

        if not self.is_duplicate(data_dict):

            with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f:

                f.write(json.dumps(data_dict) + "\n")

JSONL is ideal for scraping because it allows you to stream data to a file line-by-line. If the script crashes on the 500th page, you preserve the first 499 results.

Running the Scraper

To run this yourself, follow these steps. The repository includes implementations for Python, Node.js, Selenium, and BeautifulSoup.

Clone the Repo:


git clone https://github.com/scraper-bank/AppSumo.com-Scrapers.git

cd AppSumo.com-Scrapers/python/playwright

Install Dependencies:


pip install playwright playwright-stealth

playwright install chromium

Add your API Key:

Get a free key from ScrapeOps and paste it into the API_KEY variable in the script.

Execute:


python product_data/scraper/appsumo.com_scraper_product_v1.py

The Result: Structured Data

The result is a clean, structured JSONL file. There are no HTML tags or messy whitespace—just data ready for a database or spreadsheet:


{

  "name": "Triplo AI",

  "brand": "Triplo AI",

  "price": 59.0,

  "preDiscountPrice": 102.0,

  "currency": "USD",

  "availability": "in_stock",

  "aggregateRating": {"ratingValue": 4.9, "reviewCount": 128},

  "category": "Productivity"

}

To Wrap Up

Vibe coding is a fast way to build, but it requires a specific strategy for the web. By moving away from brittle CSS selectors and toward structured data blobs like JSON-LD, you can build scrapers that are both faster to write and harder to break.

Key Takeaways:

Don't fight the DOM: Look for __NEXT_DATA__ or ld+json scripts first.
Use specialized tools: The ScrapeOps AI Scraper Builder handles the heavy lifting of script generation.
Think in Pipelines: Use JSONL and deduplication for production-grade data.

For more examples, including Node.js versions and search page scrapers, check out the full AppSumo Scrapers GitHub repository.

DEV Community