Did you know that Playwright allows you to block requests and thus speed up your scraping or testing? You can block certain resource types like images, any requests by domain, or many different ways.
Prerequisites
For the code to work, you will need python3 installed. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.
pip install playwright
playwright install
Intro to Playwright
Playwright "is a Python library to automate Chromium, Firefox, and WebKit browsers with a single API." It allows us to browse the Internet with a headless browser programmatically.
Playwright is also available for Node.js, and everything shown below can be done with a similar syntax. Check the docs for more details.
Here's how to start a browser (i.e., Chromium) in a few lines, navigate to a page, and get its title.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://www.zenrows.com/")
print(page.title())
# Web Scraping API & Data Extraction - ZenRows
page.context.close()
browser.close()
Loggin Network Events
Subscribe to events such as request or response and log their content to see what is happening. Since we did not tell Playwright otherwise, it will load the entire page: HTML, CSS, execute Javascript, get images, and so on. Add these two lines before requesting the page to see what's going on.
page.on("request", lambda request: print(
">>", request.method, request.url,
request.resource_type))
page.on("response", lambda response: print(
"<<", response.status, response.url))
page.goto("https://www.zenrows.com/")
# >> GET https://www.zenrows.com/ document
# << 200 https://www.zenrows.com/
# >> GET https://cdn.zenrows.com/images_dash/logo-instagram.svg image
# << 200 https://cdn.zenrows.com/images_dash/logo-instagram.svg
The entire output is almost 50 lines long, with 24 resource requests. We probably don't need most of those for website scraping, so we will see how to block them and save time and bandwidth.
Blocking Resources
Why load resources and content that we won't use? Learn how to avoid unnecessary data and network requests with these techniques.
Block by Glob Pattern
page
also exposes a method route
that will execute a handler for each matching route or pattern. Let's say that we don't want SVGs to load. Using a pattern like "**/*.svg"
will match requests ending with that extension. As for the handler, we need no logic for the moment, only to abort the request. For that, we'll use a lambda and the route
param's abort method.
page.route("**/*.jpg", lambda route: route.abort())
page.goto("https://www.zenrows.com/")
Note: according to the official documentation, patterns like "**/*.{png,jpg,jpeg}"
should work, but we found otherwise. Anyway, it's doable with the next blocking strategy.
Block by Regex
If (for some reason ๐) you are into Regex, feel free to use them. But compiling them first is mandatory. We will block three image extensions in this case. Regex are tricky, but they offer a ton of flexibility.
import re
# ...
page.route(re.compile(r"\.(jpg|png|svg)$"),
lambda route: route.abort())
page.goto("https://www.zenrows.com/")
Now there are 23 requests and only 15 responses. We saved 8 images from being downloaded!
Block by Resource Type
But what happens if they use "jpeg" extension instead of "jpg"? Or avif, gif, webp? Should we maintain an updated list?
Luckily for us, the route
param exposed in the lambda function above includes the original request and resource type. And one of those types is image
, perfect! You can access the whole resource type list.
We'll now match every request ("**/*"
) and add conditional logic to the lambda function. In case it is an image, abort the request as before. Else, continue with it as usual.
page.route("**/*", lambda route: route.abort()
if route.request.resource_type == "image"
else route.continue_()
)
page.goto("https://www.zenrows.com/")
Take into consideration that some trackers use images. It is probably not a big deal when scraping or testing, but just in case.
Function handler
We can also define functions for the handlers instead of using lambdas. That comes in handy in case we need to reuse it or it grows past a single conditional.
Suppose that we want to block aggressively now. Looking at the output from the previous runs will show a list of the used resources. We'll add those to a list and then check if the type is in that list.
excluded_resource_types = ["stylesheet", "script", "image", "font"]
def block_aggressively(route):
if (route.request.resource_type in excluded_resource_types):
route.abort()
else:
route.continue_()
# ...
page.route("**/*", block_aggressively)
page.goto("https://www.zenrows.com/")
We are entirely in control now, and the versatility is absolute. From routes.request
, the original URL, the headers, and several other info are available.
Being even more strict: block everything that is not document
type. That will effectively prevent anything but the initial HTML from being loaded.
def block_aggressively(route):
if (route.request.resource_type != "document"):
route.abort()
else:
route.continue_()
There is a single response now! We got the HTML without any other resource being downloaded. We sure have saved a lot of time and bandwidth, right? But... how much exactly?
Measuring Performance Boost
We can only say that it got better if we could measure the differences. We will take a look at three approaches. But just running a script with several URLs will do. Spoiler: we did that for 10 URLs, 1.3 seconds VS 8.4.
HAR Files
For those of you used to checking the DevTools Network tab, we have good news! Playwright allows HAR recording by providing an extra parameter in the new_page
method. As easy as that.
page = browser.new_page(record_har_path="playwright_test.har")
page.goto("https://www.zenrows.com/")
There are some HAR visualizers out there, but the easiest way is to use Chrome DevTools. Open the Network tab and click on the import button or drag&drop the HAR file.
Check time! Below is the comparison between two different HAR files. The first one is without blocking (regular navigation). The second one is blocking everything except for the initial document.
Almost every resource has a "-1" Status and "Pending" Time on the blocking side. That's DevTools's way of telling us that those were blocked and not downloaded. We can see clearly on the bottom left that we performed fewer requests, and the transferred data amount is a fraction of the original! From 524kB to 17.4kB, a 96% cut.
Browser's Performance API
The browsers offer an interface to check the performance that shows how it went for things like timing. Playwright can evaluate Javascript, so we'll use it to print those results.
The output will be a JSON object with a lot of timestamps. The most straightforward check is to get the difference between navigationStart
and loadEventEnd
. When blocking, it should be under half a second (i.e., 346ms); for regular navigation, above a second or even two (i.e., 1363ms).
page.goto("https://www.zenrows.com/")
print(page.evaluate("JSON.stringify(window.performance)"))
# {"timing":{"connectStart":1632902378272,"navigationStart":1632902378244, ...
As you can see, blocking can be a second faster, even more for slower sites. The less you download, the faster you can scrape!
CDP Session
Going a step further, we connect directly with Chrome DevTools Protocol. Playwright creates a CDP Session for us to extract, for example, performance metrics.
We have to create a client from the page context and start communication with CDP. In our case, we will enable "Performance" before visiting the page and getting the metrics after it.
The output will be a JSON-like string with interesting values such as nodes, process time, JS Heap used, and many more.
client = page.context.new_cdp_session(page)
client.send("Performance.enable")
page.goto("https://www.zenrows.com/")
print(client.send("Performance.getMetrics"))
Conclusion
We'd like you to part with three main points:
- Load only needed resources.
- Save time and bandwidth when possible.
- Measure your efforts and performance before scaling up.
Let's not forget that website scraping is a process with multiple steps, one of them being rotating proxies. Those add processing time and, sometimes, charge per bandwidth.
You can achieve precisely the same results while saving time/bandwidth/money. Many times, images or CSS are only overhead. Some others, no need for JS if you only want the initial static content.
Thanks for reading! Did you find the content helpful? Please, spread the word and share it. ๐
Originally published at https://www.zenrows.com
Top comments (0)