We’ve all been there. You ask an LLM like ChatGPT or Claude to write a simple web scraper for a site like AppSumo. It confidently spits out a script using soup.select('.price-tag-123'). You run it, and nothing happens. The classes are dynamic, the data is buried in a Next.js hydration blob, or the site’s anti-bot protection kicks you out before the page even loads.
This is the "Vibe Coding" bottleneck. You want to move from idea to execution using AI, but web scraping often forces you back into the weeds of manual DOM inspection and brittle CSS selectors.
We can break that cycle. This guide covers how to build a production-ready AppSumo scraper using Python and Playwright without writing a single manual CSS selector. Instead, we’ll use "hidden" data structures and AI-generated architecture to create a script that lasts.
Why Standard LLMs Fail on AppSumo
If you try to build a scraper using a generic prompt, you’ll likely run into three major roadblocks:
Dynamic/Tailwind Classes: AppSumo uses utility-first CSS (Tailwind) and dynamic class names. An LLM might guess a selector like
.text-midnight, but if the developers change the padding or color scheme, the scraper breaks.Client-Side Rendering: As a modern Next.js application, much of AppSumo’s data isn't in the initial HTML. It’s loaded dynamically. If you use a simple
requestsandBeautifulSoupapproach, you’ll often find yourself staring at an emptydiv.Hallucination: LLMs often imagine that websites have logical IDs like
#product-price. AppSumo doesn't work that way.
To build something reliable, stop looking at what the website looks like and start looking at how it stores its data.
The Solution: The AI Scraper Builder
Instead of asking a general-purpose AI to guess selectors, I used the ScrapeOps AI Scraper Builder. This tool analyzes a target URL and generates a Playwright script that targets the most stable data sources on the page: JSON-LD and NEXT_DATA.
By pasting an AppSumo product URL into the builder, we get a script that doesn't care if a button turns from blue to green. It targets the raw data blobs the website uses to render itself.
Code Walkthrough: Analyzing the Generated Script
Let’s look at the core script from the AppSumo Scrapers repository. We’ll focus on the Playwright implementation found in python/playwright/product_data/scraper/appsumo.com_scraper_product_v1.py
1. The Data Schema
First, we define the requirements. Using Python dataclasses ensures the script remains type-safe and structured.
@dataclass
class ScrapedData:
name: str = ""
brand: str = ""
price: float = 0.0
preDiscountPrice: float = 0.0
currency: str = "USD"
availability: str = "in_stock"
aggregateRating: Dict[str, Any] = field(default_factory=dict)
description: str = ""
features: List[str] = field(default_factory=list)
images: List[Dict[str, str]] = field(default_factory=list)
url: str = ""
2. Extraction Without Selectors
This is the most critical part of the script. Instead of searching for a price inside a <span>, the script evaluates a JavaScript block to find the JSON-LD (Structured Data) and NEXT_DATA (Next.js state) objects.
async def extract_data(page: Page) -> Optional[ScrapedData]:
# Extraction via JSON-LD
json_ld_data = await page.evaluate("""() => {
const scripts = Array.from(document.querySelectorAll('script[type="application/ld+json"]'));
for (const s of scripts) {
try {
const data = JSON.parse(s.innerText);
const findProduct = (obj) => {
if (Array.isArray(obj)) return obj.find(item => item['@type'] === 'Product');
if (obj['@type'] === 'Product') return obj;
return null;
};
const product = findProduct(data);
if (product) return product;
} catch (e) {}
}
return null;
}""")
AppSumo, like many modern sites, embeds a JSON object containing the product name, price, and reviews for SEO purposes. This JSON is highly structured and rarely changes, making it significantly more reliable than CSS selectors.
3. Handling Proxies and Anti-Bot Measures
AppSumo employs anti-bot measures that block standard headless browsers. The generated script handles this using playwright-stealth and the ScrapeOps Proxy integrated directly into the browser launch:
# ScrapeOps Residential Proxy Configuration
PROXY_CONFIG = {
"server": "http://residential-proxy.scrapeops.io:8181",
"username": "scrapeops",
"password": API_KEY
}
async def run_scraper():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=PROXY_CONFIG
)
context = await browser.new_context()
page = await context.new_page()
await stealth_async(page) # Apply stealth patterns
Handling Concurrency and Pipelines
To make this production-ready, the script includes a DataPipeline class that handles deduplication and saves data in JSONL format.
class DataPipeline:
def __init__(self, jsonl_filename="output.jsonl"):
self.items_seen = set()
self.jsonl_filename = jsonl_filename
def is_duplicate(self, input_data):
item_key = input_data.get("productId")
if item_key in self.items_seen:
return True
self.items_seen.add(item_key)
return False
def add_data(self, scraped_data: ScrapedData):
data_dict = asdict(scraped_data)
if not self.is_duplicate(data_dict):
with open(self.jsonl_filename, mode="a", encoding="UTF-8") as f:
f.write(json.dumps(data_dict) + "\n")
JSONL is ideal for scraping because it allows you to stream data to a file line-by-line. If the script crashes on the 500th page, you preserve the first 499 results.
Running the Scraper
To run this yourself, follow these steps. The repository includes implementations for Python, Node.js, Selenium, and BeautifulSoup.
-
Clone the Repo:
git clone https://github.com/scraper-bank/AppSumo.com-Scrapers.git cd AppSumo.com-Scrapers/python/playwright -
Install Dependencies:
pip install playwright playwright-stealth playwright install chromium -
Add your API Key:
Get a free key from ScrapeOps and paste it into the
API_KEYvariable in the script. -
Execute:
python product_data/scraper/appsumo.com_scraper_product_v1.py
The Result: Structured Data
The result is a clean, structured JSONL file. There are no HTML tags or messy whitespace—just data ready for a database or spreadsheet:
{
"name": "Triplo AI",
"brand": "Triplo AI",
"price": 59.0,
"preDiscountPrice": 102.0,
"currency": "USD",
"availability": "in_stock",
"aggregateRating": {"ratingValue": 4.9, "reviewCount": 128},
"category": "Productivity"
}
To Wrap Up
Vibe coding is a fast way to build, but it requires a specific strategy for the web. By moving away from brittle CSS selectors and toward structured data blobs like JSON-LD, you can build scrapers that are both faster to write and harder to break.
Key Takeaways:
Don't fight the DOM: Look for
__NEXT_DATA__orld+jsonscripts first.Use specialized tools: The ScrapeOps AI Scraper Builder handles the heavy lifting of script generation.
Think in Pipelines: Use JSONL and deduplication for production-grade data.
For more examples, including Node.js versions and search page scrapers, check out the full AppSumo Scrapers GitHub repository.
Top comments (0)