DEV Community: Yukendiran Jayachandiran

Stop Writing CSS Selectors That Break - Extract Web Data with Plain English Using AI

Yukendiran Jayachandiran — Tue, 24 Feb 2026 23:53:22 +0000

If you have ever scraped websites, you know the pain. You spend hours crafting the perfect CSS selectors, XPath expressions, or regex patterns. Everything works beautifully... until the website updates their HTML structure. Then your entire scraper breaks overnight.

I got tired of this cycle. So I built something different.

The Problem: Fragile Selectors

Here is a typical web scraping scenario. You want to extract product data from an e-commerce site:

# The traditional approach - brittle CSS selectors
import requests
from bs4 import BeautifulSoup

response = requests.get("https://example-store.com/products")
soup = BeautifulSoup(response.text, "html.parser")

products = []
for item in soup.select("div.product-card__wrapper > div.content"):
    name = item.select_one("h3.product-title__text span.name")
    price = item.select_one("div.price-container span.current-price")
    rating = item.select_one("div.reviews-wrapper span.avg-rating")

    products.append({
        "name": name.text.strip() if name else None,
        "price": price.text.strip() if price else None,
        "rating": rating.text.strip() if rating else None,
    })

This works today. But what happens when the site changes div.product-card__wrapper to div.product-item? Or wraps the price in a different element? Your scraper silently returns empty data or crashes entirely.

The maintenance burden is real:

E-commerce sites update layouts every 2-4 weeks
News sites redesign quarterly
Dynamic SPAs change DOM structure on every deploy
A/B tests create inconsistent page structures

The Solution: Tell the AI What You Want in Plain English

What if instead of specifying HOW to find data (CSS selectors), you just described WHAT data you want? That is the core idea behind AI-powered extraction.

Here is the same task, but using an AI extraction approach:

import requests

response = requests.post(
    "https://api.lucidextractor.liceron.in/api/v1/scrape",
    json={
        "url": "https://example-store.com/products",
        "ai_prompt": "Extract all products with their name, price, and customer rating",
        "output_format": "json"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

products = response.json()["data"]
# Returns clean, structured JSON - no selectors needed

That is it. No CSS selectors. No XPath. No regex. Just plain English describing what you want.

How Does It Work Under The Hood?

The AI extraction pipeline works in several stages:

1. Intelligent Page Rendering
The system uses a headless browser (not just HTTP requests) to fully render JavaScript-heavy pages. This means SPAs, infinite scroll pages, and dynamically loaded content all work out of the box.

2. Content Understanding
Instead of pattern-matching HTML tags, a large language model (in this case, Gemini 2.5 Flash) reads and understands the actual page content - just like a human would. It identifies the semantic meaning of elements, not just their CSS classes.

3. Structured Extraction
Based on your natural language prompt, the AI extracts exactly what you asked for and returns it as clean, structured JSON. The output format is consistent even if the underlying HTML structure changes.

Real-World Example: Extracting Job Listings

Let me walk through a practical example. Say you want to extract job listings from multiple job boards with different HTML structures.

Traditional approach - you need different selectors for each site:

# Indeed - one set of selectors
jobs_indeed = soup.select("div.job_seen_beacon")
for job in jobs_indeed:
    title = job.select_one("h2.jobTitle span")
    company = job.select_one("span.companyName")

# LinkedIn - completely different selectors  
jobs_linkedin = soup.select("div.base-card")
for job in jobs_linkedin:
    title = job.select_one("h3.base-search-card__title")
    company = job.select_one("h4.base-search-card__subtitle")

# Glassdoor - yet another structure
jobs_glassdoor = soup.select("li.react-job-listing")
# ... you get the idea

AI extraction approach - same prompt works everywhere:

sites = [
    "https://indeed.com/jobs?q=python+developer",
    "https://linkedin.com/jobs/search?keywords=python",
    "https://glassdoor.com/Job/python-developer-jobs.htm"
]

for site in sites:
    response = requests.post(
        "https://api.lucidextractor.liceron.in/api/v1/scrape",
        json={
            "url": site,
            "ai_prompt": "Extract job listings with: job title, company name, location, salary if available, and posting date",
            "output_format": "json"
        },
        headers={"Authorization": "Bearer YOUR_API_KEY"}
    )
    jobs = response.json()["data"]
    # Same structured output regardless of source site

The AI understands that a "job title" is a job title whether it is in an <h2>, an <h3>, a <span>, or a <div>. It extracts based on meaning, not markup.

When Should You Use AI Extraction vs Traditional Scraping?

AI extraction is not always the right choice. Here is a comparison to help you choose:

Use AI Extraction when:

You are scraping multiple sites with different structures
The target site changes layout frequently
You need to extract unstructured or semi-structured content
Speed of development matters more than per-request cost
You want to extract data from content-heavy pages (articles, reviews, listings)

Use Traditional Selectors when:

You are scraping a single stable API or site
You need sub-second response times at massive scale
The data structure is extremely consistent (like XML feeds)
You have strict budget constraints at millions of requests/day

Performance and Cost

A fair question is: does the AI overhead make this impractical?

In my testing with LucidExtractor (the tool I built around this concept):

Average extraction time: 5-15 seconds per page (including full rendering)
Cost per extraction: roughly $0.003-0.01 depending on page complexity
Accuracy: 90-95% on first attempt for most websites
Zero maintenance time when sites update their HTML

Compare that to the traditional approach where you might spend 2-4 hours fixing broken selectors every month per site. The math works out quickly in favor of AI extraction for most use cases.

Getting Started

If you want to try AI-powered extraction, LucidExtractor offers 500 free credits to get started. Here is a quick setup:

1. Get your API key at lucidextractor.liceron.in

2. Make your first extraction:

import requests

result = requests.post(
    "https://api.lucidextractor.liceron.in/api/v1/scrape",
    json={
        "url": "https://news.ycombinator.com",
        "ai_prompt": "Extract the top 10 stories with title, points, author, and comment count",
        "output_format": "json"
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)

for story in result.json()["data"]:
    print(f"{story['title']} - {story['points']} points by {story['author']}")

3. Explore advanced features:

Batch scraping for multiple URLs
Scheduled extractions
Custom output schemas
Screenshot capture alongside data

The Future of Web Scraping

I believe we are at an inflection point in web scraping. The same way that high-level programming languages replaced assembly for most tasks, AI-powered extraction will replace manual selector writing for most scraping use cases.

The websites of 2025 are more complex than ever - SPAs, shadow DOMs, web components, dynamic rendering. Fighting this complexity with increasingly fragile selectors is a losing battle.

Instead, let the AI understand the page the way a human would, and just tell it what you need.

What do you think? Have you tried AI-powered extraction in your projects? I would love to hear about your experiences in the comments.

LucidExtractor is open for early access at lucidextractor.liceron.in - 500 free credits, no credit card required.

How I Built an AI Web Scraper That Understands Plain English

Yukendiran Jayachandiran — Mon, 23 Feb 2026 04:58:42 +0000

The Problem Every Developer Knows

If you've ever built a web scraper, you know the drill:

Inspect the page
Find the right CSS selectors
Write brittle code that breaks when the site changes
Repeat forever

I spent years doing this. Every time a website updated its layout, my scrapers would break. I'd spend hours fixing selectors just to have them break again next week.

There had to be a better way.

The "Aha" Moment

What if instead of telling a scraper where data is on a page, you could tell it what you want?

Instead of:

price = soup.select_one('.product-price .sale-value span')

What if you could just say:

"Get me the product name, price, and customer rating"

That's exactly what I built.

Introducing LucidExtractor

LucidExtractor is an AI-powered web scraping API that understands natural language. You describe the data you want in plain English, and it returns clean, structured JSON.

How It Works

Send a URL + description - Tell the API what data you want
AI analyzes the page - LLMs understand the page structure
Get structured data - Clean JSON/CSV output, every time

Example API Call

import requests

response = requests.post("https://lucidextractor.liceron.in/api/scrape", json={
    "url": "https://example-store.com/product/123",
    "prompt": "Extract product name, price, rating, and availability"
})

data = response.json()
# {
#   "product_name": "Wireless Headphones Pro",
#   "price": "$79.99",
#   "rating": "4.5/5",
#   "availability": "In Stock"
# }

No CSS selectors. No XPath. No breaking when layouts change.

Key Features

Natural Language Input - Describe data in plain English
Anti-Bot Bypass - Handles Cloudflare, CAPTCHAs, and other protections automatically
Dynamic JS Sites - Full browser rendering for JavaScript-heavy pages
130+ API Endpoints - Specialized endpoints for different scraping needs
Bulk Processing - Process hundreds of URLs in parallel
Proxy Rotation - Built-in rotating proxies and browser fingerprinting
Multiple Formats - JSON, CSV, and structured data output

The Tech Stack

Backend: FastAPI + Python
AI/LLM: Multiple model support for understanding page structure
Browser Engine: Playwright for dynamic rendering
Frontend: React + Vite + Tailwind CSS
Infrastructure: Google Cloud Run, Firebase

Why This Approach Works

Traditional scrapers are fragile because they rely on the DOM structure. Change a div class name? Scraper breaks.

LucidExtractor is resilient because it understands what data looks like, not where it is. A price is a price whether it's in a <span class="price"> or a <div data-value="cost">.

Who Is This For?

Data analysts who need web data without coding expertise
Developers tired of maintaining brittle scrapers
Researchers collecting data from multiple sources
Businesses needing competitive intelligence or market data
Marketers tracking competitor pricing and content

Try It Free

LucidExtractor starts at $29/month with 10,000 credits. Early adopters get bonus credits when signing up.

Try it here: lucidextractor.liceron.in

I'd love to hear your feedback. What's your biggest web scraping pain point? Drop a comment below!

Follow me for more updates on building AI-powered developer tools.