Robert N. Gutierrez

Posted on Jan 28

The Hybrid Fallback Strategy: Combining Cheerio and Playwright for Maximum Reliability

#webscraping #hybridscraping #javascript

In web scraping, developers usually find themselves at a crossroads: do you choose the raw speed of static HTML parsing or the heavy-duty reliability of browser automation?

If you use Cheerio, your scrapers are lightning-fast and consume minimal CPU, but they crumble the moment a site serves a JavaScript-heavy layout or triggers a security challenge. If you use Playwright, you can bypass almost any client-side hurdle, but infrastructure costs skyrocket and execution time increases significantly.

The most sophisticated scraping pipelines don't choose between them—they use both. By implementing a Hybrid Fallback Strategy, you can attempt the "fast path" first and only spin up a browser when necessary.

The Core Trade-off: Speed vs. Reliability

Before building the hybrid system, let's look at the two components we'll merge, based on the Ebay.com-Scrapers repository.

Feature	Cheerio + Axios (The Sprinter)	Playwright (The Heavy Lifter)
Execution Speed	Extremely Fast (< 1s)	Slow (5s - 15s+)
Resource Usage	Low CPU/RAM	High (Browser overhead)
JS Rendering	No	Yes
Bot Detection	Vulnerable	Reliable (with Stealth)
Cost	Cheap	Expensive

The static approach relies on a simple GET request, which works for the majority of eBay pages. However, for the pages where content is injected via React or protected by anti-bot walls, we need the logic found in the Playwright implementation.

Phase 1: The Components

We'll start by modularizing the code from the repository into two distinct functions.

The Sprinter: Cheerio + Axios

This function uses axios to fetch HTML and cheerio to parse it. It is optimized for speed with a strict timeout.

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeWithCheerio(url) {
    const response = await axios.get(url, { 
        timeout: 5000,
        headers: { 'User-Agent': 'Mozilla/5.0...' } 
    });
    const $ = cheerio.load(response.data);

    return {
        name: $(".x-item-title__mainTitle").text().trim(),
        price: parseFloat($(".x-price-primary").text().replace(/[^0-9.]/g, "")) || 0,
        source: 'cheerio'
    };
}

The Heavy Lifter: Playwright + Stealth

This function launches a headless Chromium instance. We use playwright-extra with the StealthPlugin to mimic real human behavior.

const { chromium } = require('playwright-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
chromium.use(StealthPlugin());

async function scrapeWithPlaywright(url) {
    const browser = await chromium.launch({ headless: true });
    try {
        const page = await browser.newPage();
        await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });

        const name = await page.locator("h1.x-item-title__mainTitle").innerText();
        const priceText = await page.locator(".x-price-primary").innerText();

        return {
            name: name.trim(),
            price: parseFloat(priceText.replace(/[^0-9.]/g, "")) || 0,
            source: 'playwright'
        };
    } finally {
        await browser.close();
    }
}

Phase 2: Defining Failure

A failure in web scraping isn't always an HTTP 403 or 500 error. Often, the request succeeds with an HTTP 200, but the data is missing because the site served a CAPTCHA or a skeleton screen that requires JavaScript.

We need a validation function to decide if the Cheerio attempt was successful or if we need to trigger the fallback.

function isValid(data) {
    // Check if critical fields are present and valid
    const hasName = data.name && data.name.length > 5;
    const hasPrice = data.price > 0;

    // If eBay serves a "Security Challenge" page, 
    // the title will likely be missing or contain "Verify"
    const isNotBlocked = !data.name.includes("Verify") && !data.name.includes("Security");

    return hasName && hasPrice && isNotBlocked;
}

Phase 3: Implementing the Hybrid Controller

The controller manages the lifecycle of the request. It attempts the fast path first and escalates to the browser path only if validation fails or a network error occurs.

async function hybridScrape(url) {
    let result = null;

    // Step 1: Attempt the Fast Scrape
    try {
        console.log(`[1/2] Attempting Cheerio scrape: ${url}`);
        result = await scrapeWithCheerio(url);

        if (isValid(result)) {
            console.log("Success: Data extracted via Cheerio.");
            return result;
        }
        console.warn("Soft Fail: Cheerio returned empty or blocked data.");
    } catch (error) {
        console.error(`Hard Fail: Cheerio network error: ${error.message}`);
    }

    // Step 2: Fallback to Playwright
    console.log(`[2/2] Escalating to Playwright browser automation...`);
    try {
        result = await scrapeWithPlaywright(url);
        return result;
    } catch (error) {
        console.error("Critical Fail: Both methods failed.");
        throw error;
    }
}

This "Waterfall" approach ensures that for most requests, you use the fewest resources possible.

Phase 4: Optimization and Proxy Strategy

To get the most out of this strategy, vary your proxy approach based on the method:

Cheerio Path: Use cheaper datacenter proxies. Since this method targets low-security pages, you can reduce costs here.
Playwright Path: Use premium residential proxies. If you've reached the fallback stage, the site is likely detecting your scraper. Using ScrapeOps residential proxies here maximizes the success rate.

You can modify the scrapeWithPlaywright function to include a proxy configuration:

const PROXY_CONFIG = {
    server: 'http://residential-proxy.scrapeops.io:8181',
    username: 'scrapeops',
    password: 'YOUR_API_KEY'
};

// Pass this to chromium.launch({ proxy: PROXY_CONFIG })

Monitoring Performance

It is vital to log which method succeeds. If the Playwright success rate jumps from 10% to 90%, the target website has likely updated its layout or bot protection. In that case, you should update your Cheerio selectors or accept that the site now requires a browser by default.

To Wrap Up

The hybrid fallback strategy transforms a brittle scraping script into a production-grade data pipeline. By combining the speed of Cheerio with the reliability of Playwright, you balance infrastructure costs with data integrity.

Key Takeaways:

Attempt Fast First: Try the lightweight HTTP request before launching a heavy browser.
Validate Data, Not Just Status: An HTTP 200 is not a success if the data fields are empty.
Escalate Intelligently: Use premium resources like residential proxies only when standard methods fail.
Standardize Output: Ensure both extraction methods return the same JSON schema so your database remains consistent regardless of the source.

To start implementing these patterns, clone the Ebay.com-Scrapers repository and merge the logic from the cheerio-axios and playwright directories into your own hybrid controller.

DEV Community