DEV Community

Cover image for How to Scrape E-Commerce Sites for AI Agents Using Playwright and LLMs
AlterLab
AlterLab

Posted on • Originally published at alterlab.io

How to Scrape E-Commerce Sites for AI Agents Using Playwright and LLMs

TL;DR

AI agents require structured JSON data (prices, specifications, availability), but modern e-commerce sites serve heavily obfuscated, JavaScript-rendered HTML. To bridge this gap, modern scraping pipelines use headless browsers like Playwright to execute JavaScript and normalize browser fingerprints, combined with LLMs to extract schema-validated JSON directly from the rendered DOM. This approach eliminates brittle CSS selectors and scales across diverse retail layouts.

The AI Agent Data Bottleneck

Autonomous agents and LLM-powered applications rely on real-time external data. When an AI agent needs to analyze market trends, compare product specifications, or track inventory, it cannot parse raw, minified HTML effectively. Traditional rules-based web scraping relies heavily on XPath or CSS selectors to parse this HTML.

The problem is that retail engineering teams constantly deploy A/B tests, obfuscate class names using CSS-in-JS frameworks, and alter page structures. A pipeline relying on soup.select('.price-tag-v2') will inevitably fail.

To build a robust data ingestion pipeline for AI agents, you need two distinct layers:

  1. The Rendering Layer: A headless browser configuration capable of executing React/Vue applications and returning the final, hydrated DOM.
  2. The Extraction Layer: An LLM configured to read the hydrated DOM and map the unstructured text into a deterministic JSON schema.

Handling JavaScript Rendering and Fingerprinting

Standard HTTP clients like the Python requests library or Go's net/http only retrieve the initial HTML payload. For modern retail sites, this payload is often just an empty <div id="root"></div> waiting for JavaScript to fetch and render the actual product data.

Headless browsers solve the rendering issue, but they introduce a new problem: fingerprinting. Headless Chrome leaks its automated nature through dozens of browser APIs. For instance, the navigator.webdriver property is set to true by default in headless mode.

To reliably access public e-commerce data without being blocked by automated security challenges, you must implement stealth techniques. This involves patching the browser environment before the page loads.

Implementing Playwright Stealth Locally

If you are managing your own scraping infrastructure, you need to configure Playwright to mask its default fingerprint. The Python playwright-stealth package applies common evasions, such as overriding the webdriver property, mocking the languages array, and normalizing WebGL vendor strings.

```python title="local_renderer.py" {8-10,13}

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def render_page(url: str):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
# Apply stealth patches to a new browser context
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
)
page = await context.new_page()
await stealth_async(page)

    # Navigate and wait for network idle to ensure JS executes
    await page.goto(url, wait_until="networkidle")
    html = await page.content()

    await browser.close()
    return html
Enter fullscreen mode Exit fullscreen mode

if name == "main":
asyncio.run(render_page("https://shop.example.com/product/123"))




While this local approach works for small-scale operations, maintaining these evasion scripts is a full-time engineering effort. Browser fingerprinting techniques evolve weekly. 

### Scaling with Managed Infrastructure

When deploying AI agents to production, running clusters of Playwright instances becomes a massive resource drain. Memory consumption spikes, and IP addresses get rate-limited. 

Rather than maintaining your own browser cluster, you can offload this to an API that handles the headless rendering and proxy rotation automatically. Utilizing a dedicated [anti-bot handling](https://alterlab.io/smart-rendering-api) layer allows your pipeline to focus strictly on data extraction.

Here is how you achieve the same result using the [Python SDK](https://alterlab.io/web-scraping-api-python) to handle the rendering infrastructure server-side:



```python title="managed_renderer.py" {4-6}

client = alterlab.Client("YOUR_API_KEY")

# The API automatically handles headless rendering and proxy rotation
response = client.scrape(
    url="https://shop.example.com/product/123",
    render_js=True
)

html_content = response.text
print(f"Retrieved {len(html_content)} bytes of rendered HTML.")
Enter fullscreen mode Exit fullscreen mode

LLM-Powered JSON Extraction

Once you possess the fully hydrated HTML, the next step is extracting the data. Passing raw HTML to an LLM is inefficient. A typical e-commerce product page can contain 500,000 characters of HTML, heavily bloated with inline SVG icons, analytics scripts, and CSS styling. This consumes massive amounts of context window tokens and increases latency.

Before extraction, the DOM must be sanitized. You should strip out <script>, <style>, <svg>, and <path> tags. You only care about the semantic HTML containing text nodes and relevant attributes like href or src.

After sanitizing the payload, you instruct the LLM to act as a structured data extractor. You provide a rigid JSON schema defining the exact fields your AI agent expects.

Defining the Extraction Schema

Your AI agent requires deterministic keys. If the agent expects current_price as a float, the LLM must not return "$49.99" as a string. You define these constraints using standard JSON Schema definitions.

```json title="schema.json"
{
"name": "ecommerce_product",
"description": "Extract product details from the page.",
"parameters": {
"type": "object",
"properties": {
"product_name": { "type": "string" },
"current_price": { "type": "number", "description": "Numeric price only" },
"in_stock": { "type": "boolean" },
"specifications": {
"type": "object",
"additionalProperties": { "type": "string" }
}
},
"required": ["product_name", "current_price", "in_stock"]
}
}




### Executing the AI Extraction

Instead of building a separate microservice to sanitize HTML and call OpenAI or Anthropic, you can use built-in Cortex AI extraction capabilities. You pass the target URL and your JSON schema in a single request. The platform renders the page, sanitizes the DOM, executes the LLM extraction, and returns only the validated JSON.



```bash title="Terminal" {6-17}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com/product/123",
    "extract": {
      "schema": {
        "product_name": "string",
        "current_price": "number",
        "currency": "string",
        "in_stock": "boolean",
        "features": ["string"]
      },
      "system_prompt": "Extract the core product details. Convert prices to float."
    }
  }'
Enter fullscreen mode Exit fullscreen mode

The response payload strips away all the rendering complexity and delivers exactly what your agent needs:

```json title="response.json"
{
"data": {
"product_name": "Wireless Mechanical Keyboard v2",
"current_price": 149.99,
"currency": "USD",
"in_stock": true,
"features": [
"Hot-swappable switches",
"Bluetooth 5.1",
"Aluminum frame"
]
},
"metadata": {
"tokens_used": 4120,
"latency_ms": 2450
}
}




<div data-infographic="try-it" data-url="https://shop.example.com/dp/B09V3KXJPB" data-description="Test schema-based AI extraction on a generic product URL."></div>

## Ethical Data Collection and Resiliency

When operating web scraping pipelines at scale, strict adherence to engineering best practices and ethical guidelines is required. The goal is to collect publicly accessible data without degrading the performance of the target infrastructure.

1. **Respect Concurrency Limits:** Do not flood a single domain with hundreds of concurrent headless browser sessions. Implement token bucket algorithms or distributed queues to enforce strict rate limits per domain.
2. **Implement Jittered Backoff:** When requests fail due to rate limiting (HTTP 429), implement exponential backoff with randomized jitter to prevent thundering herd problems on retries.
3. **Target Public Endpoints Only:** LLM extraction should be restricted to publicly accessible content. Never configure agents to bypass authentication walls or scrape paywalled data.
4. **Cache Aggressively:** E-commerce product details do not change every minute. Implement a caching layer (like Redis) keyed by the product URL and a time-to-live (TTL) of 6 to 24 hours depending on the volatility of the specific category. Check the cache before dispatching a rendering request.

## Takeaways

Building a data ingestion pipeline for AI agents requires moving beyond basic HTTP requests and rigid CSS selectors. By leveraging headless browsers for accurate JavaScript rendering and LLMs for semantic data mapping, you create scraping pipelines that are resilient to UI changes and A/B tests. 

* Use Playwright and stealth configurations to reliably render client-side web applications.
* Sanitize DOM payloads heavily before passing them to LLMs to optimize token usage and latency.
* Enforce strict JSON schemas to ensure your AI agents receive predictable, strongly-typed data structures.

For advanced schema configurations and detailed parameter structures for extraction, consult the [API docs](https://alterlab.io/docs) to optimize your agent's data ingestion capabilities.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)