kryptonation

Posted on Jun 21

How We Built a WAF-Resilient, Selector-Free Web Scraper in Python (Using Gemini 2.5 & Playwright)

#webscraping #ai #webdev #python

Every developer who has built web scrapers knows the pain:

Fragile CSS selectors/XPaths: The target website updates its Tailwind classes or shifts its React component tree, and your data pipeline crashes.
Web Application Firewalls (WAFs): Cloudflare, DataDome, and Akamai block your requests at the edge, returning a 403 Forbidden or challenge page.
We wanted to build a scraping engine that bypasses selectors entirely and handles anti-bot systems resilience.

Here is the exact architecture we used to build QueryScrape AI using FastAPI, Playwright, and Gemini 2.5 Flash.

🛠️ The Architecture Stack
Our scraper relies on a three-stage pipeline:

The Stealth Crawler (Playwright): Launches a headless Chromium instance with custom user-agents, screen sizes, and browser flags to bypass basic anti-bot blockers and execute client-side JavaScript.
The Dom Cleaner (BeautifulSoup & html2text): Strips noisy scripts, styles, headers, and footers, converting raw HTML into token-efficient Markdown.
Dynamic Pydantic Schema compiler & Gemini structured output: Compiles fields submitted in the API request into Pydantic models at runtime, forcing the LLM to output validated JSON matching that model.
POST URL + Schema

Fetch Page

HTML Source

Clean Markdown

Compile fields

Output validation schema

Validated JSON output

API User

FastAPI Server

Playwright Stealth Crawler

DOM Cleaner

Gemini 2.5 Flash

Dynamic Pydantic Model

💻 Under The Hood: The Core Code

Cleaning raw HTML to Markdown Passing a raw DOM tree with thousands of lines of HTML to an LLM wastes tokens and blows up latency. We strip non-content tags and convert the HTML structure to Markdown:

from bs4 import BeautifulSoup
import html2text

def clean_html(html_content: str) -> str:
soup = BeautifulSoup(html_content, "html.parser")

# Remove script/style boilerplate and noisy non-content elements
for element in soup(["script", "style", "nav", "footer", "header", "svg", "noscript", "iframe"]):
    element.decompose()

# Convert remaining DOM structure to clean Markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.ignore_tables = False
h.body_width = 0  # Do not wrap lines

return h.handle(str(soup)).strip()

Compiling Pydantic schemas dynamically at runtime When a user calls our API, they pass an array of fields they want to extract, like:

[
{"name": "title", "type": "string", "description": "The title of the product"},
{"name": "price", "type": "float", "description": "The numerical price in USD"}
]
We use Pydantic's create_model function to compile these fields into a validated Pydantic class dynamically:

from pydantic import BaseModel, Field, create_model
from typing import List, Dict, Any, Type

TYPE_MAP = {
"string": str, "integer": int, "float": float, "boolean": bool
}

def compile_schema(fields: List[Dict[str, Any]], is_list: bool = False) -> Type[BaseModel]:
pydantic_fields = {}
for f in fields:
name = f.get("name")
type_str = f.get("type", "string").lower()
desc = f.get("description", f"Extracted value for {name}")

    py_type = TYPE_MAP.get(type_str, str)
    pydantic_fields[name] = (py_type, Field(description=desc))

ItemModel = create_model("ExtractedItem", **pydantic_fields)

if is_list:
    return create_model(
        "ExtractedList",
        items=(List[ItemModel], Field(description="A collection of extracted items."))
    )
return ItemModel

Invoking Gemini with Structured Schema Enforcement We utilize the modern google-genai SDK. By passing the compiled Pydantic schema class directly in response_schema of the generation configuration, Gemini guarantees the output format, saving us from writing fragile regex retry loops:

from google import genai
from google.genai import types

async def extract_structured_data(content: str, fields: list, is_list: bool = False):
client = genai.Client(api_key=YOUR_API_KEY)
target_model = compile_schema(fields, is_list)

prompt = f"Extract structured information according to the schema:\n\n{content}"

response = await client.aio.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=target_model,
        temperature=0.1
    )
)

return response.parsed.model_dump()

🛡️ Bypassing Edge Firewalls (WAF)
To show this in action, we built a public WAF Shield Detector tool directly into our landing page.

It does two things:

Performs a naive static HTTP connection (which easily triggers WAF response headers like cf-ray or x-datadome-cid and yields a 403 Blocked page).
Runs our dynamic stealth-Playwright crawler (which loads the target page, executes JS, and extracts clean markdown).
The side-by-side comparison on our playground visually proves that AI-powered crawlers can bypass edge bot-guards and feed structured data directly to LLMs.

📈 Learn More & Try it Out
We have deployed the fully operational API pipeline on AWS App Runner.

👉 Try the live playground, check your target URLs for WAF headers, and get 1,000 free API extractions per month: 🔗 Live Demo Playground

The full source code for the server and playground is open-source. What are your thoughts on using generative models for resilient data pipelines? Let's discuss in the comments below!

Top comments (1)

Roberto Kerber • Jun 28

Selector-free via LLM is a clever way to buy resilience. My only hesitation is cost/latency per page at scale - an LLM pass on every page adds up fast. Where it's paid off for me is the opposite end: probing /api/ paths first, because when a site exposes a clean internal JSON endpoint there's no DOM to parse at all. Curious how the token cost holds up on high-volume runs for you?