DEV Community

Cover image for Extract Structured Data from Websites Using AI Instead of CSS Selectors
AlterLab
AlterLab

Posted on • Originally published at alterlab.io

Extract Structured Data from Websites Using AI Instead of CSS Selectors

The Problem with CSS Selectors

You write a scraper targeting .product-price .amount. It works. Two weeks later, the site ships a redesign and your selector returns null. You inspect the DOM, find the new class, patch your code, and move on. This repeats every few months for every site you scrape.

CSS selectors couple your extraction logic to implementation details you do not control. Class names change. DOM structures shift. A/B tests swap element order. Each change breaks your pipeline silently until you notice missing data downstream.

AI extraction removes this coupling. You describe the data you want in plain text. The model reads the page, understands the semantic structure, and returns clean JSON. No selectors to maintain. No DOM inspection when layouts change.

How AI Extraction Works

The process has three steps:

  1. Fetch the page content (rendered, with JavaScript executed)
  2. Pass the content and your extraction schema to a language model
  3. Return structured JSON matching your schema

The model does not guess. It reads the actual rendered DOM, identifies elements matching your description, and extracts their values. If a product page has a price, name, and rating, you describe those fields and get them back as typed JSON.

Setting Up

Install the Python SDK:

```bash title="Terminal"
pip install alterlab




Or use the REST API directly with curl. Both approaches are covered below. You will need an API key from your [dashboard](https://alterlab.io/signup).

## Example: Extracting Product Data

Here is a product page on an e-commerce site. You need the product name, price, rating, and number of reviews. With CSS selectors, you would inspect the DOM, write four selectors, and hope they survive the next deploy.

With AI extraction, you describe the fields:



```python title="extract_product.py" {5-12}

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://example-store.com/products/wireless-headphones",
    formats=["json"],
    cortex={
        "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
)

data = response.json["cortex"]
print(data)
Enter fullscreen mode Exit fullscreen mode

Output:

```json title="response.json"
{
"product_name": "Sony WH-1000XM5 Wireless Headphones",
"price": 348.00,
"rating": 4.7,
"review_count": 2841
}




The same request via curl:



```bash title="Terminal" {4-7}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/products/wireless-headphones",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Structured Schemas with JSON Schema

For production pipelines, you want type guarantees. Pass a JSON Schema instead of a plain text prompt. The model validates its output against your schema before returning it.

```python title="extract_with_schema.py" {8-25}

client = alterlab.Client("YOUR_API_KEY")

schema = {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"},
"sku": {"type": "string"}
},
"required": ["name", "price", "in_stock"]
}
}
}
}

response = client.scrape(
url="https://example-store.com/category/electronics",
formats=["json"],
cortex={"prompt": "Extract all products from this category page", "schema": schema}
)

for product in response.json["cortex"]["products"]:
print(f"{product['name']}: ${product['price']}")




This returns an array of products with typed fields. Missing optional fields are omitted. Required fields are always present. If the model cannot confidently extract a required field, it returns an error you can handle in your pipeline.

## Handling Dynamic Content

Many sites load data client-side. A product listing might render empty HTML, then populate via JavaScript fetches. Traditional scrapers that only fetch raw HTML get nothing back.

AI extraction requires the rendered DOM. The platform handles this automatically: it launches a headless browser, waits for the page to stabilize, then passes the rendered content to the model. You do not need to configure wait times or detect network idle.

For sites with aggressive bot detection, the [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) layer handles fingerprint rotation, TLS fingerprint matching, and challenge solving before the page ever reaches the extraction step.

## When to Use AI Extraction vs CSS Selectors

AI extraction is not a replacement for every scraping pattern. It is a tool for specific scenarios.

<div data-infographic="comparison">
  <table>
    <thead><tr><th>Criteria</th><th>AI Extraction</th><th>CSS Selectors</th></tr></thead>
    <tbody>
      <tr><td>Setup time</td><td>Seconds &mdash; describe fields in text</td><td>Minutes &mdash; inspect DOM, write selectors</td></tr>
      <tr><td>Maintenance</td><td>None &mdash; model adapts to layout changes</td><td>Ongoing &mdash; selectors break on redesign</td></tr>
      <tr><td>Cost per request</td><td>Higher &mdash; includes model inference</td><td>Lower &mdash; raw extraction only</td></tr>
      <tr><td>Type safety</td><td>Strong &mdash; JSON Schema validation</td><td>Manual &mdash; you parse and validate</td></tr>
      <tr><td>Best for</td><td>Dynamic pages, complex layouts, prototyping</td><td>Stable pages, high volume, simple structures</td></tr>
    </tbody>
  </table>
</div>

Use AI extraction when:
- The site changes its layout frequently
- You are prototyping and need data fast
- The page structure is complex or inconsistent
- You need to extract from many different sites with one pipeline

Use CSS selectors when:
- The page structure is stable and predictable
- You are scraping at very high volume and cost matters
- You need sub-second response times
- The data is in simple, consistent locations

You can mix both approaches in the same pipeline. Use AI extraction for complex pages and selectors for stable ones. The [Python SDK](https://alterlab.io/web-scraping-api-python) supports both patterns with the same client interface.

## Real-World Pattern: Monitoring Competitor Prices

Here is a practical pipeline that combines scheduling with AI extraction. You want to track prices for a list of competitor products daily.



```python title="price_monitor.py" {10-18}

client = alterlab.Client("YOUR_API_KEY")

competitors = [
    {"url": "https://competitor-a.com/product/123", "name": "Competitor A"},
    {"url": "https://competitor-b.com/p/abc", "name": "Competitor B"},
]

for competitor in competitors:
    response = client.scrape(
        url=competitor["url"],
        formats=["json"],
        cortex={
            "prompt": "Extract: product_name (string), price (float), availability (string)"
        }
    )

    data = response.json["cortex"]
    print(f"{competitor['name']}: {data['product_name']} @ ${data['price']} - {data['availability']}")
Enter fullscreen mode Exit fullscreen mode

Wrap this in a scheduled job and store results in your database. When prices change, your pipeline detects the delta automatically. The monitoring feature can also handle this natively by watching pages for content changes and pushing diffs to your webhook endpoint.

Error Handling

AI extraction can fail when the page does not contain the requested data, the model cannot parse the structure, or the schema validation fails. Handle these cases explicitly:

```python title="error_handling.py" {12-18}

client = alterlab.Client("YOUR_API_KEY")

try:
response = client.scrape(
url="https://example.com/page",
formats=["json"],
cortex={"prompt": "Extract: email (string), phone (string)"}
)

if "error" in response.json.get("cortex", {}):
    print(f"Extraction failed: {response.json['cortex']['error']}")
else:
    print(response.json["cortex"])
Enter fullscreen mode Exit fullscreen mode

except alterlab.APIError as e:
print(f"API error: {e.status_code} - {e.message}")




Common errors include pages that require authentication, content behind CAPTCHAs that exceed your tier, and schemas with impossible constraints. The API returns structured error messages so you can retry, adjust your prompt, or skip the page.

## Performance Considerations

AI extraction adds latency compared to raw HTML fetching. A typical request takes 3-8 seconds depending on page complexity and model load. For most pipelines, this is acceptable. Price monitoring, lead generation, and market research do not require sub-second responses.

If you need speed, use a two-tier approach:
1. Fetch raw HTML with a basic tier (fast, cheap)
2. Only escalate to AI extraction when the raw response is insufficient

Set `min_tier` in your request to skip lower tiers for known-difficult sites. This avoids the retry loop and gets you to the rendering tier on the first attempt.

Check the [pricing page](https://alterlab.io/pricing) for current tier costs and rate limits.

## Takeaway

CSS selectors tie your scraping logic to markup you do not control. AI extraction breaks that dependency. Describe the data you need, get back typed JSON, and stop maintaining selectors every time a site redesigns.

Use AI extraction for dynamic pages, prototyping, and multi-site pipelines. Use selectors for stable, high-volume targets. Mix both in the same pipeline based on each site's characteristics.

The [quickstart guide](https://alterlab.io/docs/quickstart/installation) covers installation and your first request in under five minutes.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)