<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kryptonation</title>
    <description>The latest articles on DEV Community by kryptonation (@kryptonation).</description>
    <link>https://dev.to/kryptonation</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1816137%2Fa8494b15-cbc9-40e9-b6b7-44b570dd4784.png</url>
      <title>DEV Community: kryptonation</title>
      <link>https://dev.to/kryptonation</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kryptonation"/>
    <language>en</language>
    <item>
      <title>How We Built a WAF-Resilient, Selector-Free Web Scraper in Python (Using Gemini 2.5 &amp; Playwright)</title>
      <dc:creator>kryptonation</dc:creator>
      <pubDate>Sun, 21 Jun 2026 18:29:14 +0000</pubDate>
      <link>https://dev.to/kryptonation/how-we-built-a-waf-resilient-selector-free-web-scraper-in-python-using-gemini-25-playwright-2iof</link>
      <guid>https://dev.to/kryptonation/how-we-built-a-waf-resilient-selector-free-web-scraper-in-python-using-gemini-25-playwright-2iof</guid>
      <description>&lt;p&gt;Every developer who has built web scrapers knows the pain:&lt;/p&gt;

&lt;p&gt;Fragile CSS selectors/XPaths: The target website updates its Tailwind classes or shifts its React component tree, and your data pipeline crashes.&lt;br&gt;
Web Application Firewalls (WAFs): Cloudflare, DataDome, and Akamai block your requests at the edge, returning a 403 Forbidden or challenge page.&lt;br&gt;
We wanted to build a scraping engine that bypasses selectors entirely and handles anti-bot systems resilience.&lt;/p&gt;

&lt;p&gt;Here is the exact architecture we used to build QueryScrape AI using FastAPI, Playwright, and Gemini 2.5 Flash.&lt;/p&gt;

&lt;p&gt;🛠️ The Architecture Stack&lt;br&gt;
Our scraper relies on a three-stage pipeline:&lt;/p&gt;

&lt;p&gt;The Stealth Crawler (Playwright): Launches a headless Chromium instance with custom user-agents, screen sizes, and browser flags to bypass basic anti-bot blockers and execute client-side JavaScript.&lt;br&gt;
The Dom Cleaner (BeautifulSoup &amp;amp; html2text): Strips noisy scripts, styles, headers, and footers, converting raw HTML into token-efficient Markdown.&lt;br&gt;
Dynamic Pydantic Schema compiler &amp;amp; Gemini structured output: Compiles fields submitted in the API request into Pydantic models at runtime, forcing the LLM to output validated JSON matching that model.&lt;br&gt;
POST URL + Schema&lt;/p&gt;

&lt;p&gt;Fetch Page&lt;/p&gt;

&lt;p&gt;HTML Source&lt;/p&gt;

&lt;p&gt;Clean Markdown&lt;/p&gt;

&lt;p&gt;Compile fields&lt;/p&gt;

&lt;p&gt;Output validation schema&lt;/p&gt;

&lt;p&gt;Validated JSON output&lt;/p&gt;

&lt;p&gt;API User&lt;/p&gt;

&lt;p&gt;FastAPI Server&lt;/p&gt;

&lt;p&gt;Playwright Stealth Crawler&lt;/p&gt;

&lt;p&gt;DOM Cleaner&lt;/p&gt;

&lt;p&gt;Gemini 2.5 Flash&lt;/p&gt;

&lt;p&gt;Dynamic Pydantic Model&lt;/p&gt;

&lt;p&gt;💻 Under The Hood: The Core Code&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cleaning raw HTML to Markdown
Passing a raw DOM tree with thousands of lines of HTML to an LLM wastes tokens and blows up latency. We strip non-content tags and convert the HTML structure to Markdown:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;from bs4 import BeautifulSoup&lt;br&gt;
import html2text&lt;/p&gt;

&lt;p&gt;def clean_html(html_content: str) -&amp;gt; str:&lt;br&gt;
    soup = BeautifulSoup(html_content, "html.parser")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Remove script/style boilerplate and noisy non-content elements
for element in soup(["script", "style", "nav", "footer", "header", "svg", "noscript", "iframe"]):
    element.decompose()

# Convert remaining DOM structure to clean Markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.ignore_tables = False
h.body_width = 0  # Do not wrap lines

return h.handle(str(soup)).strip()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Compiling Pydantic schemas dynamically at runtime
When a user calls our API, they pass an array of fields they want to extract, like:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;[&lt;br&gt;
  {"name": "title", "type": "string", "description": "The title of the product"},&lt;br&gt;
  {"name": "price", "type": "float", "description": "The numerical price in USD"}&lt;br&gt;
]&lt;br&gt;
We use Pydantic's create_model function to compile these fields into a validated Pydantic class dynamically:&lt;/p&gt;

&lt;p&gt;from pydantic import BaseModel, Field, create_model&lt;br&gt;
from typing import List, Dict, Any, Type&lt;/p&gt;

&lt;p&gt;TYPE_MAP = {&lt;br&gt;
    "string": str, "integer": int, "float": float, "boolean": bool&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;def compile_schema(fields: List[Dict[str, Any]], is_list: bool = False) -&amp;gt; Type[BaseModel]:&lt;br&gt;
    pydantic_fields = {}&lt;br&gt;
    for f in fields:&lt;br&gt;
        name = f.get("name")&lt;br&gt;
        type_str = f.get("type", "string").lower()&lt;br&gt;
        desc = f.get("description", f"Extracted value for {name}")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    py_type = TYPE_MAP.get(type_str, str)
    pydantic_fields[name] = (py_type, Field(description=desc))

ItemModel = create_model("ExtractedItem", **pydantic_fields)

if is_list:
    return create_model(
        "ExtractedList",
        items=(List[ItemModel], Field(description="A collection of extracted items."))
    )
return ItemModel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Invoking Gemini with Structured Schema Enforcement
We utilize the modern google-genai SDK. By passing the compiled Pydantic schema class directly in response_schema of the generation configuration, Gemini guarantees the output format, saving us from writing fragile regex retry loops:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;from google import genai&lt;br&gt;
from google.genai import types&lt;/p&gt;

&lt;p&gt;async def extract_structured_data(content: str, fields: list, is_list: bool = False):&lt;br&gt;
    client = genai.Client(api_key=YOUR_API_KEY)&lt;br&gt;
    target_model = compile_schema(fields, is_list)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;prompt = f"Extract structured information according to the schema:\n\n{content}"

response = await client.aio.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=target_model,
        temperature=0.1
    )
)

return response.parsed.model_dump()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;🛡️ Bypassing Edge Firewalls (WAF)&lt;br&gt;
To show this in action, we built a public WAF Shield Detector tool directly into our landing page.&lt;/p&gt;

&lt;p&gt;It does two things:&lt;/p&gt;

&lt;p&gt;Performs a naive static HTTP connection (which easily triggers WAF response headers like cf-ray or x-datadome-cid and yields a 403 Blocked page).&lt;br&gt;
Runs our dynamic stealth-Playwright crawler (which loads the target page, executes JS, and extracts clean markdown).&lt;br&gt;
The side-by-side comparison on our playground visually proves that AI-powered crawlers can bypass edge bot-guards and feed structured data directly to LLMs.&lt;/p&gt;

&lt;p&gt;📈 Learn More &amp;amp; Try it Out&lt;br&gt;
We have deployed the fully operational API pipeline on AWS App Runner.&lt;/p&gt;

&lt;p&gt;👉 Try the live playground, check your target URLs for WAF headers, and get 1,000 free API extractions per month: 🔗 Live Demo Playground&lt;/p&gt;

&lt;p&gt;The full source code for the server and playground is open-source. What are your thoughts on using generative models for resilient data pipelines? Let's discuss in the comments below!&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
