mariatanbobo

Posted on May 30

I Tested Every Web Scraping Tool Against Lazada — Here's What Actually Works (May 2026)

#python #webscraping #ai #devops

I came across Scrapling through a recommendation on X and decided to put it through its paces — not against a demo page, but against Lazada Singapore, a production site with Google reCAPTCHA and a custom slider verification. The setup: a single 4GB VPS, no residential proxies, no credits, just open-source tools.

Here's the full journey: installation pitfalls, wiring it into an AI agent, choosing the right browser for the job, and the real-world benchmarks that followed.

What Is Scrapling?

Scrapling is an adaptive web scraping framework for Python (BSD-3, v0.4.8). It handles everything from single HTTP requests to full-scale concurrent crawls. What sets it apart from the BeautifulSoup/Scrapy world:

Adaptive element tracking — saves fingerprints of targeted elements and relocates them after site redesigns using similarity scoring. Your scrapers survive CSS changes without maintenance.
Three fetchers, one API — HTTP (Fetcher, curl_cffi), browser (DynamicFetcher, Playwright Chromium), and stealth (StealthyFetcher, Chromium + anti-bot patches). Swap with one line.
Spider framework — Scrapy-like API with async, concurrent crawling, Ctrl+C pause/resume via checkpoint persistence, multi-session support.
MCP server — 14 tools exposed natively for AI coding agents. Your agent can call mcp_scrapling_get, mcp_scrapling_fetch, mcp_scrapling_stealthy_fetch directly.

It's open source, pip-installable, and designed to be the backbone of a scraping stack — not just another tool in the toolbox.

Installation on a 4GB VPS

This is where the real story starts. The VPS has 4GB RAM, 2 vCPUs, 77GB disk, and runs an AI agent gateway (615MB baseline). Every browser installation decision matters.

What we installed

pip install scrapling[fetchers,ai]   # HTTP + Chromium + MCP server
scrapling install                     # Downloads Playwright browsers

This pulls in Playwright Chromium, Firefox, and WebKit (~1.3GB disk), plus curl_cffi for HTTP requests and patchright (Playwright fork) for browser automation.

What we deliberately skipped (at first)

Camoufox. Every discussion about Scrapling mentions a GitHub thread where someone's VPS hit 1.4GB of RAM running Camoufox. That was enough to scare me off — on a 4GB machine, 1.4GB for one browser is a non-starter. So we skipped it and let Scrapling's StealthyFetcher fall back to Chromium.

Turns out this was the wrong call. More on that later.

First test

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/', timeout=15)
quotes = page.css('.quote .text::text').getall()
# 0.88s, 200 OK, 10 quotes parsed
# Memory: 56MB RSS

Clean. Fast. No browser needed. The HTTP fetcher uses curl_cffi with TLS fingerprint impersonation — it looks like Chrome to the server but costs nothing in RAM.

Wiring into an AI Agent

Scrapling ships with a built-in MCP (Model Context Protocol) server. Start it with scrapling mcp and your AI coding agent gets 14 native tools:

Tool	What it does
`get` / `bulk_get`	HTTP fetch with CSS selector extraction
`fetch` / `bulk_fetch`	Browser fetch with JS rendering
`stealthy_fetch` / `bulk_stealthy_fetch`	Anti-bot browser fetch
`open_session` / `close_session` / `list_sessions`	Persistent browser management
`screenshot`	Full-page PNG/JPEG capture

The key advantage: CSS selector support means the agent extracts only relevant elements instead of dumping entire pages into context. Token savings compound fast.

Session management is critical

The MCP server's session tools aren't optional — they're the difference between stable and catastrophic:

# ❌ Don't do this in a loop
for url in urls:
    page = StealthyFetcher.fetch(url)  # New browser every time

# ✅ Do this instead
session_id = open_session(type="dynamic")
for url in urls:
    page = fetch(url, session_id=session_id)  # Reuses same browser
close_session(session_id)

One browser, reused. Without sessions, each one-shot fetch spawns a new Chromium process. After 5+ calls, memory pressure spikes. After 20+, you're in OOM territory.

Browser Selection — The Three-Tier Architecture

Scrapling's three fetchers form a natural escalation ladder:

Tier	Fetcher	Engine	Best for
1	`Fetcher`	curl_cffi (HTTP)	Static pages, APIs
2	`DynamicFetcher`	Playwright Chromium	JS-rendered SPAs
3	`StealthyFetcher`	Chromium + anti-bot patches	Cloudflare, bot detection

Same API across all three. Same CSS selectors. Same response object. You're not choosing between different libraries — you're choosing how much overhead to pay.

But the real question is: do you need a browser at all? Let's benchmark.

Speed (4 sites, 3 runs each, averaged)

Fetcher	Avg Speed	vs Fastest
`Fetcher` (HTTP)	0.77s	1×
`DynamicFetcher` (Chromium)	3.66s	4.8×
`StealthyFetcher`	~4s	5.2×

The HTTP fetcher is absurdly fast. Browser-based tools add 3-4 seconds of overhead per page. That gap compounds: 10 pages is 7.7s vs 40s. 100 pages is 77s vs 6.5 minutes.

Memory (headless, single page, measured on VPS)

Fetcher	RAM Delta
`Fetcher` (HTTP)	~0 MB
`StealthyFetcher`	+120 MB
`DynamicFetcher`	+180 MB

The rule is simple: start at tier 1 and only escalate when proven necessary. If the page is static, you don't need a browser. If it's JS-rendered, you don't need stealth. If it has anti-bot, you don't need a different IP. Prove each escalation before taking it.

The Camoufox Plot Twist

Remember how I skipped Camoufox because of that 1.4GB horror story? After getting the stack running, I decided to test it properly.

pip install camoufox
python -m camoufox fetch  # Downloads the browser binary (~713MB)

Camoufox is actually the lightest browser. Measured on our VPS:

Browser	RAM (headless)	Stealth Level
Camoufox (Firefox)	81 MB	C++-level
Scrapling StealthyFetcher (Chromium)	120 MB	JS-patched
Scrapling DynamicFetcher (Chromium)	180 MB	None

The 1.4GB from that GitHub thread was user error — spawning a fresh browser per request without closing old ones. Same thing happens with any browser. Camoufox is a debloated Firefox fork: telemetry stripped, Mozilla services removed, navigator.webdriver genuinely absent at the C++ level.

But there's a catch: Scrapling's StealthyFetcher uses patchright (a Playwright Chromium fork) and does NOT auto-detect Camoufox. They don't integrate at the browser level because Playwright's Firefox protocol differs from Chromium's.

The workaround is straightforward:

from camoufox import Camoufox
from scrapling import Selector

# Camoufox: stealth browsing with Firefox fingerprint (81MB)
with Camoufox(headless=True) as browser:
    page = browser.new_page()
    page.goto('https://target.com')
    html = page.content()

# Scrapling: adaptive parsing with CSS/XPath
sel = Selector(html)
data = sel.css('.product::text').getall()

Camoufox fetches undetected. Scrapling parses with adaptive resilience. Best of both worlds — but it's slow. More on that next.

Camoufox Speed

Browser	Avg Page Load
Scrapling DynamicFetcher (Chromium)	3.66s
Camoufox (Firefox)	8.84s

11× slower than the HTTP fetcher, 2.4× slower than Chromium. Firefox on Linux pays a cold-start tax. Camoufox earns its place at tier 5 in the ladder — not a replacement for Chromium, but a fallback when Chromium's fingerprint is the problem.

The Priority Ladder

All of this — the speed data, the memory measurements, the Camoufox discovery — points to one design:

Priority 1:  Fetcher (HTTP)              0.77s   ~0 MB    Static pages
   ↓ page is empty / JS-rendered?
Priority 3:  DynamicFetcher (Chromium)    3.66s   180 MB   JS-rendered SPAs
   ↓ blocked by anti-bot?
Priority 4:  StealthyFetcher (Chromium)   ~4s     120 MB   Cloudflare, basic WAF
   ↓ Chromium itself blocked?
Priority 5:  Camoufox (Firefox)           8.84s    81 MB   Firefox fingerprint
   ↓ CAPTCHA / aggressive WAF?
Priority 6:  Firecrawl enhanced proxy     ~3-5s    credits Hard targets

Each tier costs more — time or money. Only escalate when proven necessary. The ladder is encoded as an agent skill, so every scraping task automatically starts at tier 1 and escalates on failure.

Real-World Test: Lazada Singapore

Lazada SG was the proving ground. Two-layer defense: Google reCAPTCHA → custom slider verification. In a previous test (early May 2026), only Lightpanda's Zig-based browser survived. Every Chromium tool got blocked.

Running the ladder:

Priority	Tool	Page 1	Page 2	Page 3	Time
1	HTTP Fetcher	❌ Empty	—	—	0.77s
3	DynamicFetcher	✅ 41 items	✅ 41 items	✅ 41 items	~3s/page
5	Camoufox	✅ 40 items	—	—	42s/page

The ladder worked exactly as designed:

Tier 1 correctly failed — Lazada is JS-rendered, raw HTML is empty. No time wasted.
Tier 3 succeeded on all 3 pages at ~3s each. No IP ban, no reCAPTCHA. Different outcome from the May test where StealthyFetcher was banned on page 3 — either Lazada relaxed detection or DynamicFetcher's lighter fingerprint helps.
Tier 5 worked but was never needed — 42s vs 3s confirms it belongs at the bottom.

The ladder saved us from jumping straight to Camoufox or paying Firecrawl credits when a simple Chromium browser handled everything.

The Complete Stack

Priority 1:  Scrapling Fetcher (HTTP)      0.77s   $0
Priority 3:  Scrapling DynamicFetcher       3.66s   $0
Priority 4:  Scrapling StealthyFetcher      ~4s     $0
Priority 5:  Camoufox + Scrapling Selector  8.84s   $0
Priority 6:  Firecrawl enhanced proxy       ~3-5s   credits

Everything runs on a single 4GB VPS. Peak memory with one browser session: ~800MB including the AI agent gateway. 39GB free disk after cleaning stale caches and old kernels. Total scraping cost: $0.

Key Lessons

Installation is the first test. Read the docs before pip install. Know what each dependency costs in RAM. Skip what you don't need — you can always add it later.
The 1.4GB Camoufox story was user error. Spawning browsers in a loop without sessions will eat any machine. With persistent sessions, Camoufox is the lightest browser in the stack at 81MB. Don't believe benchmark threads — run your own.
Speed differences compound silently. 0.77s vs 8.84s is nothing for one page. For 100 pages, it's 77 seconds vs nearly 15 minutes. Choosing the right tier pays off exponentially.
Fingerprint diversity is a superpower. Having both Chromium and Firefox in your arsenal means you can bypass sites that target either. Camoufox is slow but it's a different shape entirely — and sometimes that's all you need.
Wire the ladder, not the tools. Individual tools leave you guessing. A priority ladder gives you a protocol: start cheap, escalate on failure. Encode it as an agent skill and you never have to think about it again.
Scrapling is the platform, not just a fetcher. Adaptive element tracking, three-tier architecture, spider framework with pause/resume, MCP server for AI agents — it's the foundation everything else plugs into. The benchmarks measure its fetchers, but the framework is what makes them interchangeable.

Questions? Find me on X @mariatanbobo

DEV Community