Yukendiran Jayachandiran

Posted on Feb 24

Stop Writing CSS Selectors That Break - Extract Web Data with Plain English Using AI

#ai #python #tutorial #webdev

If you have ever scraped websites, you know the pain. You spend hours crafting the perfect CSS selectors, XPath expressions, or regex patterns. Everything works beautifully... until the website updates their HTML structure. Then your entire scraper breaks overnight.

I got tired of this cycle. So I built something different.

The Problem: Fragile Selectors

Here is a typical web scraping scenario. You want to extract product data from an e-commerce site:

# The traditional approach - brittle CSS selectors
import requests
from bs4 import BeautifulSoup

response = requests.get("https://example-store.com/products")
soup = BeautifulSoup(response.text, "html.parser")

products = []
for item in soup.select("div.product-card__wrapper > div.content"):
    name = item.select_one("h3.product-title__text span.name")
    price = item.select_one("div.price-container span.current-price")
    rating = item.select_one("div.reviews-wrapper span.avg-rating")

    products.append({
        "name": name.text.strip() if name else None,
        "price": price.text.strip() if price else None,
        "rating": rating.text.strip() if rating else None,
    })

This works today. But what happens when the site changes div.product-card__wrapper to div.product-item? Or wraps the price in a different element? Your scraper silently returns empty data or crashes entirely.

The maintenance burden is real:

E-commerce sites update layouts every 2-4 weeks
News sites redesign quarterly
Dynamic SPAs change DOM structure on every deploy
A/B tests create inconsistent page structures

The Solution: Tell the AI What You Want in Plain English

What if instead of specifying HOW to find data (CSS selectors), you just described WHAT data you want? That is the core idea behind AI-powered extraction.

Here is the same task, but using an AI extraction approach:

import requests

response = requests.post(
    "https://api.lucidextractor.liceron.in/api/v1/scrape",
    json={
        "url": "https://example-store.com/products",
        "ai_prompt": "Extract all products with their name, price, and customer rating",
        "output_format": "json"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

products = response.json()["data"]
# Returns clean, structured JSON - no selectors needed

That is it. No CSS selectors. No XPath. No regex. Just plain English describing what you want.

How Does It Work Under The Hood?

The AI extraction pipeline works in several stages:

1. Intelligent Page Rendering
The system uses a headless browser (not just HTTP requests) to fully render JavaScript-heavy pages. This means SPAs, infinite scroll pages, and dynamically loaded content all work out of the box.

2. Content Understanding
Instead of pattern-matching HTML tags, a large language model (in this case, Gemini 2.5 Flash) reads and understands the actual page content - just like a human would. It identifies the semantic meaning of elements, not just their CSS classes.

3. Structured Extraction
Based on your natural language prompt, the AI extracts exactly what you asked for and returns it as clean, structured JSON. The output format is consistent even if the underlying HTML structure changes.

Real-World Example: Extracting Job Listings

Let me walk through a practical example. Say you want to extract job listings from multiple job boards with different HTML structures.

Traditional approach - you need different selectors for each site:

# Indeed - one set of selectors
jobs_indeed = soup.select("div.job_seen_beacon")
for job in jobs_indeed:
    title = job.select_one("h2.jobTitle span")
    company = job.select_one("span.companyName")

# LinkedIn - completely different selectors  
jobs_linkedin = soup.select("div.base-card")
for job in jobs_linkedin:
    title = job.select_one("h3.base-search-card__title")
    company = job.select_one("h4.base-search-card__subtitle")

# Glassdoor - yet another structure
jobs_glassdoor = soup.select("li.react-job-listing")
# ... you get the idea

AI extraction approach - same prompt works everywhere:

sites = [
    "https://indeed.com/jobs?q=python+developer",
    "https://linkedin.com/jobs/search?keywords=python",
    "https://glassdoor.com/Job/python-developer-jobs.htm"
]

for site in sites:
    response = requests.post(
        "https://api.lucidextractor.liceron.in/api/v1/scrape",
        json={
            "url": site,
            "ai_prompt": "Extract job listings with: job title, company name, location, salary if available, and posting date",
            "output_format": "json"
        },
        headers={"Authorization": "Bearer YOUR_API_KEY"}
    )
    jobs = response.json()["data"]
    # Same structured output regardless of source site

The AI understands that a "job title" is a job title whether it is in an <h2>, an <h3>, a <span>, or a <div>. It extracts based on meaning, not markup.

When Should You Use AI Extraction vs Traditional Scraping?

AI extraction is not always the right choice. Here is a comparison to help you choose:

Use AI Extraction when:

You are scraping multiple sites with different structures
The target site changes layout frequently
You need to extract unstructured or semi-structured content
Speed of development matters more than per-request cost
You want to extract data from content-heavy pages (articles, reviews, listings)

Use Traditional Selectors when:

You are scraping a single stable API or site
You need sub-second response times at massive scale
The data structure is extremely consistent (like XML feeds)
You have strict budget constraints at millions of requests/day

Performance and Cost

A fair question is: does the AI overhead make this impractical?

In my testing with LucidExtractor (the tool I built around this concept):

Average extraction time: 5-15 seconds per page (including full rendering)
Cost per extraction: roughly $0.003-0.01 depending on page complexity
Accuracy: 90-95% on first attempt for most websites
Zero maintenance time when sites update their HTML

Compare that to the traditional approach where you might spend 2-4 hours fixing broken selectors every month per site. The math works out quickly in favor of AI extraction for most use cases.

Getting Started

If you want to try AI-powered extraction, LucidExtractor offers 500 free credits to get started. Here is a quick setup:

1. Get your API key at lucidextractor.liceron.in

2. Make your first extraction:

import requests

result = requests.post(
    "https://api.lucidextractor.liceron.in/api/v1/scrape",
    json={
        "url": "https://news.ycombinator.com",
        "ai_prompt": "Extract the top 10 stories with title, points, author, and comment count",
        "output_format": "json"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

for story in result.json()["data"]:
    print(f"{story['title']} - {story['points']} points by {story['author']}")

3. Explore advanced features:

Batch scraping for multiple URLs
Scheduled extractions
Custom output schemas
Screenshot capture alongside data

The Future of Web Scraping

I believe we are at an inflection point in web scraping. The same way that high-level programming languages replaced assembly for most tasks, AI-powered extraction will replace manual selector writing for most scraping use cases.

The websites of 2025 are more complex than ever - SPAs, shadow DOMs, web components, dynamic rendering. Fighting this complexity with increasingly fragile selectors is a losing battle.

Instead, let the AI understand the page the way a human would, and just tell it what you need.

What do you think? Have you tried AI-powered extraction in your projects? I would love to hear about your experiences in the comments.

LucidExtractor is open for early access at lucidextractor.liceron.in - 500 free credits, no credit card required.

DEV Community