DEV Community

Cover image for How We Built a WAF-Resilient, Selector-Free Web Scraper in Python (Using Gemini 2.5 & Playwright)
kryptonation
kryptonation

Posted on

How We Built a WAF-Resilient, Selector-Free Web Scraper in Python (Using Gemini 2.5 & Playwright)

Every developer who has built web scrapers knows the pain:

Fragile CSS selectors/XPaths: The target website updates its Tailwind classes or shifts its React component tree, and your data pipeline crashes.
Web Application Firewalls (WAFs): Cloudflare, DataDome, and Akamai block your requests at the edge, returning a 403 Forbidden or challenge page.
We wanted to build a scraping engine that bypasses selectors entirely and handles anti-bot systems resilience.

Here is the exact architecture we used to build QueryScrape AI using FastAPI, Playwright, and Gemini 2.5 Flash.

πŸ› οΈ The Architecture Stack
Our scraper relies on a three-stage pipeline:

The Stealth Crawler (Playwright): Launches a headless Chromium instance with custom user-agents, screen sizes, and browser flags to bypass basic anti-bot blockers and execute client-side JavaScript.
The Dom Cleaner (BeautifulSoup & html2text): Strips noisy scripts, styles, headers, and footers, converting raw HTML into token-efficient Markdown.
Dynamic Pydantic Schema compiler & Gemini structured output: Compiles fields submitted in the API request into Pydantic models at runtime, forcing the LLM to output validated JSON matching that model.
POST URL + Schema

Fetch Page

HTML Source

Clean Markdown

Compile fields

Output validation schema

Validated JSON output

API User

FastAPI Server

Playwright Stealth Crawler

DOM Cleaner

Gemini 2.5 Flash

Dynamic Pydantic Model

πŸ’» Under The Hood: The Core Code

  1. Cleaning raw HTML to Markdown Passing a raw DOM tree with thousands of lines of HTML to an LLM wastes tokens and blows up latency. We strip non-content tags and convert the HTML structure to Markdown:

from bs4 import BeautifulSoup
import html2text

def clean_html(html_content: str) -> str:
soup = BeautifulSoup(html_content, "html.parser")

# Remove script/style boilerplate and noisy non-content elements
for element in soup(["script", "style", "nav", "footer", "header", "svg", "noscript", "iframe"]):
    element.decompose()

# Convert remaining DOM structure to clean Markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.ignore_tables = False
h.body_width = 0  # Do not wrap lines

return h.handle(str(soup)).strip()
Enter fullscreen mode Exit fullscreen mode
  1. Compiling Pydantic schemas dynamically at runtime When a user calls our API, they pass an array of fields they want to extract, like:

[
{"name": "title", "type": "string", "description": "The title of the product"},
{"name": "price", "type": "float", "description": "The numerical price in USD"}
]
We use Pydantic's create_model function to compile these fields into a validated Pydantic class dynamically:

from pydantic import BaseModel, Field, create_model
from typing import List, Dict, Any, Type

TYPE_MAP = {
"string": str, "integer": int, "float": float, "boolean": bool
}

def compile_schema(fields: List[Dict[str, Any]], is_list: bool = False) -> Type[BaseModel]:
pydantic_fields = {}
for f in fields:
name = f.get("name")
type_str = f.get("type", "string").lower()
desc = f.get("description", f"Extracted value for {name}")

    py_type = TYPE_MAP.get(type_str, str)
    pydantic_fields[name] = (py_type, Field(description=desc))

ItemModel = create_model("ExtractedItem", **pydantic_fields)

if is_list:
    return create_model(
        "ExtractedList",
        items=(List[ItemModel], Field(description="A collection of extracted items."))
    )
return ItemModel
Enter fullscreen mode Exit fullscreen mode
  1. Invoking Gemini with Structured Schema Enforcement We utilize the modern google-genai SDK. By passing the compiled Pydantic schema class directly in response_schema of the generation configuration, Gemini guarantees the output format, saving us from writing fragile regex retry loops:

from google import genai
from google.genai import types

async def extract_structured_data(content: str, fields: list, is_list: bool = False):
client = genai.Client(api_key=YOUR_API_KEY)
target_model = compile_schema(fields, is_list)

prompt = f"Extract structured information according to the schema:\n\n{content}"

response = await client.aio.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=target_model,
        temperature=0.1
    )
)

return response.parsed.model_dump()
Enter fullscreen mode Exit fullscreen mode

πŸ›‘οΈ Bypassing Edge Firewalls (WAF)
To show this in action, we built a public WAF Shield Detector tool directly into our landing page.

It does two things:

Performs a naive static HTTP connection (which easily triggers WAF response headers like cf-ray or x-datadome-cid and yields a 403 Blocked page).
Runs our dynamic stealth-Playwright crawler (which loads the target page, executes JS, and extracts clean markdown).
The side-by-side comparison on our playground visually proves that AI-powered crawlers can bypass edge bot-guards and feed structured data directly to LLMs.

πŸ“ˆ Learn More & Try it Out
We have deployed the fully operational API pipeline on AWS App Runner.

πŸ‘‰ Try the live playground, check your target URLs for WAF headers, and get 1,000 free API extractions per month: πŸ”— Live Demo Playground

The full source code for the server and playground is open-source. What are your thoughts on using generative models for resilient data pipelines? Let's discuss in the comments below!

Top comments (0)