Jonathan D. Fisher

Posted on Mar 10

Beyond requests.get: Analyzing the Architecture of an AI-Generated Spider

#ai #webdev #node #webscraping

There is a common stigma that AI-generated code is "toy-grade"—fine for a quick script, but too messy for a production pipeline. We often expect to see spaghetti code that lacks error handling, deduplication, or stealth.

However, the reality is shifting. Modern AI-generated scrapers increasingly use sophisticated design patterns that many developers miss on their first pass. We’ve seen this in the Beautylish.com-Scrapers repository, which contains production-ready spiders for both Python and Node.js.

By dissecting the beautylish_scraper_product_data_v1.py script, we can look past simple requests.get calls to see how to implement stealth, robust data pipelines, and intelligent extraction strategies that withstand modern anti-bot measures.

Why requests.get Fails

Modern e-commerce sites like Beautylish present significant hurdles for basic scraping scripts. Fetching a product page using a standard HTTP client usually leads to three major problems:

Dynamic Content: Beautylish uses frontend frameworks like React or Next.js. Much of the product data is hydrated into the DOM via JavaScript after the initial page load. A simple GET request sees only an empty shell.
Anti-Bot Measures: High-traffic retail sites use fingerprinting to detect automated scripts, looking for "headless" browser signatures and non-residential IP ranges.
Data Fragility: Layouts change. If a scraper relies on a single CSS selector for the price, it breaks the moment the UI is updated.

To solve this, the architecture in our repository moves away from simple requests toward a "Browser-First" approach using Playwright and Puppeteer integrated with residential proxies.

Architecture and Configuration

A professional scraper should be maintainable. The Beautylish script follows a clear separation of concerns, splitting logic into three distinct layers:

Configuration: Centralized settings for API keys, retries, and browser timeouts.
Data Pipeline: A dedicated class for handling deduplication and storage.
Extraction Logic: A strategy-based function that tries multiple ways to find data.

The script also focuses on dynamic output. Rather than overwriting files, it uses a timestamping utility to ensure every run is isolated, preventing data corruption and simplifying debugging.

def generate_output_filename() -> str:
    """Generate output filename with current timestamp."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    return f"beautylish_com_product_page_scraper_data_{timestamp}.jsonl"

The Stealth Layer

To bypass anti-bot protections, the script implements a "Stealth Layer." It uses playwright_stealth to mask common automation signals, such as the navigator.webdriver flag, that websites use to identify bots.

It also integrates ScrapeOps Residential Proxies. Unlike data center IPs, which are easily flagged, residential proxies route traffic through home devices, making it indistinguishable from a standard user.

The architecture initializes the browser context like this:

async def scrape_page(browser: Browser, url: str, pipeline: DataPipeline, retries: int = 3) -> None:
    context = await browser.new_context(
        ignore_https_errors=True,
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
    )

    page = await context.new_page()
    await stealth_async(page) 

    # Block unnecessary resources to save bandwidth and speed up load
    async def block_resources(route, request):
        if request.resource_type in ["image", "media", "font"]:
            await route.abort()
        else:
            await route.continue_()

    await page.route("**/*", block_resources)
    await page.goto(url, wait_until="domcontentloaded", timeout=60000)

Blocking images and fonts reduces proxy load while still allowing JavaScript to execute and populate the data.

The DataPipeline Class: Handling Scale

One of the most effective parts of this architecture is the DataPipeline class. Beginners often store scraped data in a list and write it to a JSON file at the end. This is risky: if the script crashes at item 999 of 1,000, you lose everything.

The DataPipeline avoids this by using JSON Lines (JSONL) and atomic writes.

class DataPipeline:
    def __init__(self, jsonl_filename="output.jsonl"):
        self.items_seen = set()
        self.jsonl_filename = jsonl_filename

    def is_duplicate(self, input_data: ScrapedData):
        item_key = input_data.url
        if item_key in self.items_seen:
            logger.warning(f"Duplicate item found: {item_key}. Skipping.")
            return True
        self.items_seen.add(item_key)
        return False

    def add_data(self, scraped_data: ScrapedData):
        if not self.is_duplicate(scraped_data):
            # Append mode ('a') ensures we don't lose data if the script restarts
            with open(self.jsonl_filename, mode="a", encoding="UTF-8") as output_file:
                json_line = json.dumps(asdict(scraped_data), ensure_ascii=False)
                output_file.write(json_line + "\n")
            logger.info(f"Saved item to {self.jsonl_filename}")

This approach provides three main benefits:

Memory Efficiency: Writing line-by-line means the script doesn't keep the entire dataset in RAM.
Resume Capability: If the scraper stops, the .jsonl file contains all data collected up to that moment.
Deduplication: The items_seen set prevents saving the same product twice if the crawler hits a circular link.

Intelligent Extraction and Fallback Strategies

The extract_data function doesn't just look for a CSS class; it uses a multi-tiered strategy to remain resilient against website updates.

Strategy 1: JSON-LD

Most modern e-commerce sites embed JSON-LD (Linked Data) for SEO. This structured JSON object is hidden in a <script> tag and is highly reliable because it follows the Schema.org standard.

json_ld_scripts = await page.locator("script[type='application/ld+json']").all_text_contents()
json_data = None
for script in json_ld_scripts:
    try:
        data = json.loads(script)
        if isinstance(data, dict) and data.get("@type", "").lower() == "product":
            json_data = data
            break
    except Exception:
        continue

Strategy 2: CSS Fallbacks

If JSON-LD is missing or incomplete, the script falls back to DOM scraping.

if not brand:
    brand_el = page.locator(".product-brand").first
    brand = (await brand_el.inner_text()).strip() if await brand_el.count() > 0 else ""

By prioritizing invisible data (JSON-LD) over the visible UI, the scraper survives even if the website theme changes entirely.

Concurrency and Error Handling

The architecture is built on Python's asyncio, allowing the script to handle non-blocking I/O. While one request waits for a proxy response, the CPU processes data from another page.

The script wraps execution in try/except blocks and uses the logging module rather than print statements. This is essential for production, as it allows you to pipe logs to a file or a monitoring service.

async def main():
    tasks = [scrape_page(browser, url, pipeline) for url in urls]
    await asyncio.gather(*tasks) # Run multiple scrapes concurrently

To Wrap Up

This Beautylish scraper demonstrates that AI can implement design patterns that ensure data integrity and stealth at scale.

Key Takeaways:

Stealth is mandatory: Use plugins like playwright-stealth and residential proxies to avoid detection.
JSONL over JSON: Use streamable formats to protect data from crashes and minimize RAM usage.
Extract structure, not style: Prioritize JSON-LD and Schema.org data over brittle CSS selectors.
Deduplicate at the source: Use a DataPipeline class to manage state and prevent duplicate records.

To see these patterns in action, you can clone the full repository:

git clone https://github.com/scraper-bank/Beautylish.com-Scrapers.git

Use this architecture as a template for your next project. It solves the common problems of scraping—blocking, duplicates, and storage—right from the start.

DEV Community