Zee

Posted on Apr 24

We Built a Custom Playwright Rendering Pipeline for Our MCP Server

#ai #mcp #showdev #webscraping

We Built a Custom Playwright Rendering Pipeline for Our MCP Server — Here's What We Learned

At Haunt API, we build web extraction tools for AI agents. Our MCP server lets Claude and other AI assistants extract structured data from any URL. Simple enough on paper — fetch a page, parse the HTML, return JSON.

The problem? Half the internet doesn't want to be fetched.

The Problem With "Just Use Playwright"

Most web scraping tutorials go something like this:

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto(url)
    html = await page.content()

And that works! For a demo. For a product that real users depend on, it falls apart fast:

Sites detect headless browsers and serve captchas or empty pages
SPA pages need time to render — how long do you wait? 2 seconds? 5? 10?
You're burning resources loading images, fonts, and CSS when you only need text
Every render costs the same — no caching, no intelligence

We went through all of these. Here's how we solved each one.

Lesson 1: Don't Use One Tool For Everything

Our pipeline has three tiers, and most requests never hit Playwright:

Direct HTTP — Works for ~80% of the web. Fast, cheap, no browser needed.
FlareSolverr — Handles Cloudflare challenges and basic JS rendering.
Playwright — Full browser rendering for JS-heavy SPAs that return empty skeletons.

The key insight: we detect skeleton pages — HTML that has a <div id="root"></div> but no actual content — and only spin up the browser when we need to. Most pages don't need it.

def is_skeleton_html(html: str) -> bool:
    """Detect if HTML is an unrendered JS skeleton."""
    if len(html) < 500:
        return True

    # Strip scripts/styles and check for visible text
    text = strip_tags(html)
    if len(text) < 100:
        return True

    # Common SPA markers
    skeleton_markers = [
        '<div id="root"></div>',
        '<div id="__next"></div>',
        'You need to enable JavaScript',
    ]
    return any(marker in html for marker in skeleton_markers)

Lesson 2: Smart Wait Strategies Beat Fixed Timers

The worst thing about browser automation is the waiting. time.sleep(5) is either too short (page hasn't loaded) or too long (wasting time on pages that loaded instantly).

We built three concurrent wait strategies. First one to trigger wins:

Content Stability — Poll the page's visible text every 200ms. If it hasn't changed for 1 second, the content has loaded.

Network Idle — Wait for no new network requests for 500ms. Good for pages that make API calls after initial load.

Meaningful Content — Wait until the page has at least 500 characters of visible text. Catches pages that load something but aren't done yet.

async def wait_for_content(page, timeout=10):
    """Smart wait — detect when content has actually loaded."""
    tasks = [
        wait_for_content_stability(page),
        wait_for_network_idle(page),
        wait_for_meaningful_content(page),
    ]
    done, pending = await asyncio.wait(
        tasks, timeout=timeout, return_when=asyncio.FIRST_COMPLETED
    )
    for t in pending:
        t.cancel()
    return done.pop().result() if done else {"strategy": "timeout"}

This cut our average render time from 6 seconds to under 3.

Lesson 3: Fingerprint Rotation Matters

Headless Chromium has tells. Sites check for them. If every request comes from the same user agent with the same viewport on the same timezone, you get blocked.

We rotate fingerprints per-URL — same site sees a consistent browser (so cookies and sessions work), but different sites see different browsers:

FINGERPRINTS = [
    {"ua": "Chrome/120.0 Windows", "viewport": [1920, 1080], "locale": "en-US"},
    {"ua": "Chrome/119.0 macOS", "viewport": [1440, 900], "locale": "en-GB"},
    {"ua": "Chrome/120.0 Linux", "viewport": [1366, 768], "locale": "en-US"},
    # ... 10 total variants
]

def get_fingerprint(url: str) -> dict:
    """Deterministic per-URL fingerprint selection."""
    idx = int(hashlib.md5(url.encode()).hexdigest(), 16) % len(FINGERPRINTS)
    return FINGERPRINTS[idx]

Lesson 4: Block What You Don't Need

When you're extracting text data, images and fonts are dead weight. We block them at the network level:

BLOCKED_RESOURCES = {
    "image", "font", "media", "texttrack", "object",
    "beacon", "csp_report", "eventsource",
}

BLOCKED_DOMAINS = {
    "google-analytics.com", "facebook.net", "doubleclick.net",
    "hotjar.com", "mixpanel.com", "segment.io",
    # ... 20+ tracking domains
}

async def route_handler(route):
    if route.request.resource_type in BLOCKED_RESOURCES:
        await route.abort()
    elif any(d in route.request.url for d in BLOCKED_DOMAINS):
        await route.abort()
    else:
        await route.continue_()

This cuts HTML payload by 40-60% on most pages, which means faster renders and less RAM.

Lesson 5: Cache Renders, Not Requests

If two users extract data from the same URL within 5 minutes, the page probably hasn't changed. We cache the rendered HTML with a TTL:

class RenderCache:
    def __init__(self, max_size=50, default_ttl=300):
        self.cache = OrderedDict()
        self.max_size = max_size
        self.default_ttl = default_ttl

    def get(self, url):
        if url in self.cache:
            entry = self.cache[url]
            if time.time() - entry["cached_at"] < entry["ttl"]:
                return entry
            del self.cache[url]
        return None

Cache hits return in 0ms. For an API that charges per request, this saves users money and makes responses instant.

The Architecture

Final structure — 6 modules, each with a single job:

playwright-service/
├── server.py          # FastAPI orchestration, browser lifecycle
├── fingerprint.py     # UA/viewport/locale rotation
├── smart_wait.py      # Content stability + network idle detection
├── site_detect.py     # Static vs SPA classification
├── cache.py           # LRU render cache with TTL
└── stealth.py         # Resource blocking + headless detection evasion

Each module is ~100 lines. Easy to test, easy to modify, easy to explain to new contributors.

What We Learned

Don't reach for the browser first. Most pages are server-rendered. Direct HTTP is 10x faster and 100x cheaper.
Wait smarter, not longer. Detecting when content has actually loaded saves seconds per request.
Be a moving target. Rotating fingerprints and blocking trackers keeps you under the radar.
Cache aggressively. Web pages don't change every second. A 5-minute render cache saves users money and makes your API feel fast.
Build modules, not monoliths. Each piece of the pipeline has its own concerns. Keep them separate.

The Playwright browser engine is the oven. Everything around it — the routing, the waiting, the caching, the stealth — is the recipe. That's where the actual engineering lives.

We're Haunt API — web extraction built for AI agents. If you're building with Claude, Cursor, or any AI assistant, our MCP server gives your agent the ability to extract data from any URL in one line.

DEV Community