Every developer who has built web scrapers knows the pain:
Fragile CSS selectors/XPaths: The target website updates its Tailwind classes or shifts its React component tree, and your data pipeline crashes.
Web Application Firewalls (WAFs): Cloudflare, DataDome, and Akamai block your requests at the edge, returning a 403 Forbidden or challenge page.
We wanted to build a scraping engine that bypasses selectors entirely and handles anti-bot systems resilience.
Here is the exact architecture we used to build QueryScrape AI using FastAPI, Playwright, and Gemini 2.5 Flash.
π οΈ The Architecture Stack
Our scraper relies on a three-stage pipeline:
The Stealth Crawler (Playwright): Launches a headless Chromium instance with custom user-agents, screen sizes, and browser flags to bypass basic anti-bot blockers and execute client-side JavaScript.
The Dom Cleaner (BeautifulSoup & html2text): Strips noisy scripts, styles, headers, and footers, converting raw HTML into token-efficient Markdown.
Dynamic Pydantic Schema compiler & Gemini structured output: Compiles fields submitted in the API request into Pydantic models at runtime, forcing the LLM to output validated JSON matching that model.
POST URL + Schema
Fetch Page
HTML Source
Clean Markdown
Compile fields
Output validation schema
Validated JSON output
API User
FastAPI Server
Playwright Stealth Crawler
DOM Cleaner
Gemini 2.5 Flash
Dynamic Pydantic Model
π» Under The Hood: The Core Code
- Cleaning raw HTML to Markdown Passing a raw DOM tree with thousands of lines of HTML to an LLM wastes tokens and blows up latency. We strip non-content tags and convert the HTML structure to Markdown:
from bs4 import BeautifulSoup
import html2text
def clean_html(html_content: str) -> str:
soup = BeautifulSoup(html_content, "html.parser")
# Remove script/style boilerplate and noisy non-content elements
for element in soup(["script", "style", "nav", "footer", "header", "svg", "noscript", "iframe"]):
element.decompose()
# Convert remaining DOM structure to clean Markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.ignore_tables = False
h.body_width = 0 # Do not wrap lines
return h.handle(str(soup)).strip()
- Compiling Pydantic schemas dynamically at runtime When a user calls our API, they pass an array of fields they want to extract, like:
[
{"name": "title", "type": "string", "description": "The title of the product"},
{"name": "price", "type": "float", "description": "The numerical price in USD"}
]
We use Pydantic's create_model function to compile these fields into a validated Pydantic class dynamically:
from pydantic import BaseModel, Field, create_model
from typing import List, Dict, Any, Type
TYPE_MAP = {
"string": str, "integer": int, "float": float, "boolean": bool
}
def compile_schema(fields: List[Dict[str, Any]], is_list: bool = False) -> Type[BaseModel]:
pydantic_fields = {}
for f in fields:
name = f.get("name")
type_str = f.get("type", "string").lower()
desc = f.get("description", f"Extracted value for {name}")
py_type = TYPE_MAP.get(type_str, str)
pydantic_fields[name] = (py_type, Field(description=desc))
ItemModel = create_model("ExtractedItem", **pydantic_fields)
if is_list:
return create_model(
"ExtractedList",
items=(List[ItemModel], Field(description="A collection of extracted items."))
)
return ItemModel
- Invoking Gemini with Structured Schema Enforcement We utilize the modern google-genai SDK. By passing the compiled Pydantic schema class directly in response_schema of the generation configuration, Gemini guarantees the output format, saving us from writing fragile regex retry loops:
from google import genai
from google.genai import types
async def extract_structured_data(content: str, fields: list, is_list: bool = False):
client = genai.Client(api_key=YOUR_API_KEY)
target_model = compile_schema(fields, is_list)
prompt = f"Extract structured information according to the schema:\n\n{content}"
response = await client.aio.models.generate_content(
model="gemini-2.5-flash",
contents=prompt,
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=target_model,
temperature=0.1
)
)
return response.parsed.model_dump()
π‘οΈ Bypassing Edge Firewalls (WAF)
To show this in action, we built a public WAF Shield Detector tool directly into our landing page.
It does two things:
Performs a naive static HTTP connection (which easily triggers WAF response headers like cf-ray or x-datadome-cid and yields a 403 Blocked page).
Runs our dynamic stealth-Playwright crawler (which loads the target page, executes JS, and extracts clean markdown).
The side-by-side comparison on our playground visually proves that AI-powered crawlers can bypass edge bot-guards and feed structured data directly to LLMs.
π Learn More & Try it Out
We have deployed the fully operational API pipeline on AWS App Runner.
π Try the live playground, check your target URLs for WAF headers, and get 1,000 free API extractions per month: π Live Demo Playground
The full source code for the server and playground is open-source. What are your thoughts on using generative models for resilient data pipelines? Let's discuss in the comments below!
Top comments (0)