Ondra Urban for Apify

Posted on Aug 3, 2020 • Originally published at blog.apify.com

How to scrape the web with Playwright

#tutorial #playwright #automation #javascript

Playwright is a browser automation library very similar to Puppeteer. Both allow you to control a web browser with only a few lines of code. The possibilities are endless. From automating mundane tasks and testing web applications to data mining.

With Playwright you can run Firefox and Safari (WebKit), not only Chromium based browsers. It will also save you time, because Playwright automates away repetitive code, such as waiting for buttons to appear in the page.

You don’t need to be familiar with Playwright, Puppeteer or web scraping to enjoy this tutorial, but knowledge of HTML, CSS and JavaScript is expected.

In this tutorial you’ll learn how to:

Start a browser with Playwright
Click buttons and wait for actions
Extract data from a website

The Project

To showcase the basics of Playwright, we will create a simple scraper that extracts data about GitHub Topics. You’ll be able to select a topic and the scraper will return information about repositories tagged with this topic.

We will use Playwright to start a browser, open the GitHub topic page, click the Load more button to display more repositories, and then extract the following information:

Owner
Name
URL
Number of stars
Description
List of repository topics

Installation

To use Playwright you’ll need Node.js version higher than 10 and a package manager. We’ll use npm, which comes preinstalled with Node.js. You can confirm their existence on your machine by running:

node -v && npm -v

If you’re missing either Node.js or NPM, visit the to get started.

Now that we know our environment checks out, let’s create a new project and install Playwright.

mkdir playwright-scraper && cd playwright-scraper
npm init -y
npm i playwright

The first time you install Playwright, it will download browser binaries, so the installation may take a bit longer.

Building a scraper

Creating a scraper with Playwright is surprisingly easy, even if you have no previous scraping experience. If you understand JavaScript and CSS, it will be a piece of cake.

In your project folder, create a file called scraper.js (or choose any other name) and open it in your favorite code editor. First, we will confirm that Playwright is correctly installed and working by running a simple script.

Now run it using your code editor or by executing the following command in your project folder.

node scraper.js

If you saw a Chromium window open and the GitHub Topics page successfully loaded, congratulations, you just robotized your web browser with Playwright!

Loading more repositories

When you first open the topic page, the number of displayed repositories is limited to 30. You can load more by clicking the Load more… button at the bottom of the page.

There are two things we need to tell Playwright to load more repositories:

Click the Load more… button.
Wait for the repositories to load.

Clicking buttons is extremely easy with Playwright. By prefixing text= to a string you’re looking for, Playwright will find the element that includes this string and click it. It will also wait for the element to appear if it’s not rendered on the page yet.

await page.click('text=Load more');

This is a huge improvement over Puppeteer and it makes Playwright lovely to work with.

After clicking, we need to wait for the repositories to load. If we didn’t, the scraper could finish before the new repositories show up on the page and we would miss that data. page.waitForFunction() allows you to execute a function inside the browser and wait until the function returns true.

await page.waitForFunction(() => {
    const repoCards = document.querySelectorAll('article.border');
    return repoCards.length > 30;
});

To find that article.border selector, we used browser Dev Tools, which you can open in most browsers by right-clicking anywhere on the page and selecting Inspect. It means: Select the <article> tag with the border class.

Let’s plug this into our code and do a test run.

If you watch the run, you’ll see that the browser first scrolls down and clicks the Load more… button, which changes the text into Loading more. After a second or two, you’ll see the next batch of 30 repositories appear. Great job!

Extracting data

Now that we know how to load more repositories, we will extract the data we want. To do this, we’ll use the page.$$eval function. It tells the browser to find certain elements and then execute a JavaScript function with those elements.

It works like this: page.$$eval finds our repositories and executes the provided function in the browser. We get repoCards which is an Array of all the repo elements. The return value of the function becomes the return value of the
page.$$eval call. Thanks to Playwright, you can pull data out of the browser and save them to a variable in Node.js. Magic!

If you’re struggling to understand the extraction code itself, be sure to check out this guide on working with CSS selectors and this tutorial on using those selectors to find HTML elements.

And here’s the code with extraction included. When you run it, you’ll see 60 repositories with their information printed to the console.

Conclusion

In this tutorial we learned how to start a browser with Playwright, and control its actions with some of Playwright’s most useful functions: page.click() to emulate mouse clicks, page.waitForFunction() to wait for things to happen and page.$$eval() to extract data from a browser page.

But we’ve only scratched the surface of what’s possible with Playwright. You can log into websites, fill forms, intercept network communication, and most importantly, use almost any browser in existence. Where will you take this project next? How about turning it into a command-line interface (CLI) tool that takes a topic and number of repositories on input and outputs a file with the repositories? You can do it now. Happy scraping!

DEV Community