DEV Community

John Rooney for Extract by Zyte

Posted on • Originally published at zyte.com

Hybrid scraping: The architecture for the modern web

If you scrape the modern web, you probably know the pain of the JavaScript challenge.

Before you can access any data, the website forces your browser to execute a snippet of JavaScript code. It calculates a result, sends it back to an endpoint for verification, and often captures extensive fingerprinting data in the process.

browser checks

Once you pass this test, the server assigns you a session cookie. This cookie acts as your "access pass." It tells the website, "This user has passed the challenge," so you don’t have to re-run the JavaScript test on every single page load.

devtool shows storage token

For web scrapers, this mechanism creates a massive inefficiency.

It looks like you are forced to use a headless browser (like Puppeteer or Playwright) for every single request just to handle that initial check. But browsers are heavy, they are slow and they consume massive amounts of RAM and bandwidth.

Running a browser for thousands of requests can quickly become an infrastructure nightmare. You end up paying for CPU cycles just to render a page when all you wanted was the JSON payload.

The solution: Hybrid scraping

The answer to this problem is a technique I’ve started calling hybrid scraping.

This involves using the browser only to open the initial request, grab the cookie, and create a session. Once you have them, you extract that session data and hand it over to a standard, lightweight HTTP client.

This architecture gives you the access of a browser with the speed and efficiency of a script.

Implementing this in Python

To build this in Python, we need two specific packages:

  1. A browser: We will use ZenDriver, a modern wrapper for headless Chrome that handles the "undetected" configuration for us.
  2. HTTP client: We will use rnet, a Rust-based HTTP client for Python.

But why rnet? Well, within the initial TLS handshake where the client/server “hello” is sent, the information traded here can be fingerprinted, taking in things like the TLS version and the ciphers available for encryption. This can be hashed into a fingerprint and profiled.

Python’s requests package, which uses urllib from the standard library, has a very distinctive TLS fingerprint, containing ciphers (amongst other things) that aren’t seen in a browser. This makes it very easy to spot. Both rnet, and other options such as curl-cffi, are able to send a TLS fingerprint similar to that of a browser. This reduces the chances of our request being blocked.

Here is how we assemble the pipeline.

Step 1: Load the page (The handshake)

First, we define our browser logic. Notice that we are not trying to parse HTML here. Our only goal is to visit the site, pass the initial JavaScript challenge, and extract the session cookies.

import zendriver as zd
import asyncio

async def get_cookies():
    """
    Use ZenDriver to launch a browser, navigate to the page, 
    and retrieve the cookies.
    """

    browser = await zd.start()

    # Hit the homepage to trigger the check
    await browser.get("https://auto.hylnd7.com")

    # Wait briefly for the JS challenge to complete
    await asyncio.sleep(1) 

    # Extract the cookies
    requests_style_cookies = await browser.cookies.get_all()
    await browser.stop()

    return requests_style_cookies
Enter fullscreen mode Exit fullscreen mode

What’s happening here:

We launch the browser, visit the site, and wait just one second for the JS challenge to run. Once we have the cookies, we call browser.stop(). This is the most important line: we do not want a browser instance wasting resources when we don’t need it.

Step 2: Use the cookies

Now that we have the "access pass," we can switch to our lightweight HTTP client. We take those cookies and inject them into the rnet client headers.

from rnet import Client, Emulation

async def http_request_rnet(cookies=None):
    """
    Make a fast request using RNet with the borrowed cookies.
    """
    headers = {
        "referer": "https://auto.hylnd7.com/",
    }

    # Format the browser cookies into a simple HTTP header string
    if cookies:
        cookie_list = []
        for cookie in cookies:
            cookie_list.append(f"{cookie.name}={cookie.value}")
        headers["Cookie"] = "; ".join(cookie_list)

    # We use Emulation.Chrome142 to change the TLS Fingerprint.
    # This is site dependent - but worth using
    client = Client(emulation=Emulation.Chrome142, headers=headers)

    response = await client.get("https://auto.hylnd7.com/api/products?page=1&limit=8")
    return response
Enter fullscreen mode Exit fullscreen mode

What’s happening here:

We convert the browser's cookie format into a standard header string. Note the “Emulation.Chrome142” parameter. We are layering two techniques here: hybrid scraping (using real cookies) and TLS fingerprinting (using a modern HTTP client). This double-layer approach covers all our bases.

(Note: Many HTTP clients have a cookie jar that you could also use; for this example, sending the header directly worked perfectly).

Step 3: Run the code

Finally, we tie it together. For this demo, we use a simple argparse flag to show the difference with and without the cookie.

async def main(use_cookies: bool):
    cookies = None

    # The Decision Logic
    if use_cookies:
        cookies = await get_cookies() # Run the heavy browser

    # Always run the fast HTTP client
    resp = await http_request_rnet(cookies)

    status_code = resp.status
    print("Status Code:", status_code)

    if status_code == 200:
        print("Response Body:", await resp.json())
    else:
        print("Request blocked")
Enter fullscreen mode Exit fullscreen mode

Get the complete script

Want to run this yourself? We’ve put the full, copy-pasteable script (including the argument parsers and imports) in the block below.

uv init
uv add zendriver rnet rich
# linux/mac
source .venv/bin/activate\
# windows
.venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode
import argparse
import asyncio

import zendriver as zd
from rnet import Client, Emulation
from rich import print


async def http_request_rnet(cookies=None):
    """
    Make an HTTP GET request using rnet with the provided cookies. Cookies are sent in the headers. Note for this site we need the referer too.
    Return the Response Object.
    """
    headers = {
        "referer": "https://auto.hylnd7.com/",
    }

    if cookies:
        cookie_list = []
        for cookie in cookies:
            # Adjust based on the actual structure of the cookie object from zendriver
            # If it's a dict: cookie['name'], cookie['value']
            # If it's an object: cookie.name, cookie.value
            cookie_list.append(f"{cookie.name}={cookie.value}")
        headers["Cookie"] = "; ".join(cookie_list)

    client = Client(emulation=Emulation.Chrome142, headers=headers)
    response = await client.get("https://auto.hylnd7.com/api/products?page=1&limit=8")
    return response

async def get_cookies():
    """
    Use zendriver to launch a browser, navigate to a page, and retrieve cookies.
    """
    browser = await zd.start()
    await browser.get("https://auto.hylnd7.com")
    await asyncio.sleep(1)
    requests_style_cookies = await browser.cookies.get_all()
    await browser.stop()
    return requests_style_cookies

async def main(use_cookies: bool):
    cookies = None
    if use_cookies:
        cookies = await get_cookies()

    resp = await http_request_rnet(cookies)
    status_code = resp.status
    print("Status Code:", status_code)

    if status_code == 200:
        print("Response Body:", await resp.json())

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Make HTTP request with optional browser cookies")
    parser.add_argument(
        "--cookies",
        type=lambda x: x.lower() == "true",
        default=False,
        help="Set to 'true' to launch browser and get cookies, 'false' to skip (default: false)"
    )
    args = parser.parse_args()
    asyncio.run(main(args.cookies))
Enter fullscreen mode Exit fullscreen mode

Pros and Cons of Hybrid Scraping

Feature Pros Cons
Efficiency Reduces RAM usage massively compared to pure browser scraping. Higher complexity: You must manage two libraries (zendriver and rnet) and the glue code.
Speed HTTP requests complete in milliseconds. Browsers take seconds. State management: You need logic to handle cookie expiry. If the cookie dies, you must "wake up" the browser.
Access You get the verification of a real browser without the drag. Maintenance: You are debugging two points of failure: the browser's ability to solve the challenge, and the client's ability to fetch data.

Final thoughts

For smaller jobs, it might be easier to just use the browser; the benefits won’t necessarily outweigh the extra complexity required.

But for production pipelines, this approach is the standard. It treats the browser as a luxury resource: used only when strictly necessary to unlock the door, so the HTTP client can do the real work. It’s this session and state management that allows you to scrape harder-to-access sites effectively and efficiently.

If building this orchestration layer yourself feels like too much overhead, this is exactly what the Zyte API handles internally. We manage the browser/HTTP switching logic automatically, so you just make a single request and get the data.

Top comments (0)