Nelson Orina

Posted on Oct 4

Introduction to Playwright.

#playwright #webscraping #programming #python

Introduction

The web today isn't what it used to be. For years, I relied on request and BeautifulSoup to extract data from the web.It worked perfectly until it didn't. I hit a wall when websites started loading content dynamically with JavaScript. That's when I discovered Playwright.

What is Playwright?
Playwright is an open-source automation framework developed by Microsoft. It allows you to control modern browsers with simple code. Instead of just fetching static HTML Playwright:

Renders the full page, executing all JavaScript just like a human user.
Interacts with elements like clicking buttons, filling forms, scrolling, and even waiting for content to load as needed.
Works consistently across all modern browsers.

Why Use Playwright?
Here are just a few things you can do with it:

Scrape dynamic websites that load content via JavaScript
Write end-to-end test for complex web applications.
Automate repetitive tasks like form submissions or screenshots.

In this first post, I'm going to document my first steps with Playwright. We'll move from theory to practice by installing it and writting a simple script to scrape a site that would have been impossible with my old toolkit.

Getting Started with Playwright

Now that we know why Playwright matters, let's actually install it and run our first script. I'll use Python here, but Playwright also works with Node.js, Java and .NET.

Step 1: Install Playwright

First, let's install the Playwright package using pip:

pip install playwright

Once installed, we need to install the browser binaries. Playwright can control Chromium, Firefox, and WebKit(Safari):

playwright install

This command installs the necessary browsers so Playwright can control them.

Challenge Encountered: Missing Host Dependencies
After running the playwright install command, I encountered a warning that initially confused me. While the browsers were successfully downloaded and placed in the cache, the Playwright host system validation noted that my Linux operating system was missing essential libraries needed for the browsers to actually run.

Here is the exact warning I received:

Playwright Host validation warning: 
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Please install them with the following command:      ║
║                                                      ║
║      sudo playwright install-deps                    ║
║                                                      ║
║ Alternatively, use apt:                              ║
║      sudo apt-get install libwoff1\                  ║
║      ... [and a long list of other libraries]        ║
║                                                      ║
╚══════════════════════════════════════════════════════╝

Resolving the Dependency Issue
The good news is that the Playwright team anticipated this and provided two clear solutions.

Solution A: Using the Recommended Playwright Command

Playwright provides a single, easy-to-use command specifically for installing these system dependencies:

sudo playwright install-deps

I attempted to run this command outside of my virtual environment, but it immediately failed:

orina@Orina:~$ sudo playwright install-deps 
[sudo] password for orina: 
sudo: playwright: command not found

The issue here is that the playwright command is installed within my isolated Python virtual environment (venv). Even though I was running it with sudo, my system's global PATH did not include the virtual environment's binary directory, leading to the command not found error.

To correctly use this command, I would have needed to either:

Use the absolute path to the executable inside the venv: sudo /path/to/venv/bin/playwright install-deps
Or, run the command while inside the venv and use the -E flag with sudo to preserve the environment variables: sudo -E playwright install-deps

Solution B: Manually Installing Dependencies with apt (The Fix I Used)

Rather than debugging the path issue, the most reliable and straightforward fix was to use the list of packages provided for my Linux distribution and installing the directly onto the hosts system using apt-get:

sudo apt-get install libwoff1 libvpx9 libevent-2.1-7t64 libgstreamer-gl1.0-0 libgstreamer-plugins-bad1.0-0 libwebpdemux2 libharfbuzz-icu0 libenchant-2-2 libsecret-1-0 libhyphen0 libmanette-0.2-0 libflite1 gstreamer1.0-libav

Step 2: Writing Our First Playwright Script

Now that we have all dependencies installed, let's create our first Playwright script. I'll break this down into manageable parts with detailed explanations.

Create a file called first_scraper.py and let's build it step by step.

Part 1: Import and Basic Setup

from playwright.sync_api import sync_playwright

def main(): 
   #The sync_playwright() context manager handles browser lifecycle 
   with sync_playwright() as p:

Explanation

from playwright.sync_api import sync_playwright: We import the synchronous API. Playwright also has an async API, but we're starting with synchronous for simplicity.
with sync_playwright() as p: This context manager automatically handles starting and stopping Playwright. The p variable gives us access to the Playwright instance.

Part 2: Launching the Browser

#Launch a Chromium browser instance 
#headless=False means we'll see the browser window 
browser = p.chromium.launch(headless=False)

Explanation

p.chromium.launch(headless=False): This launches a Chromium browser.
- headless=False means the browser window will be visible. For production scripts, you'd set this to True to run in the background.
- You could also use p.firefox.launch() or p.webkit.launch() for other browsers.
The browser variable represents the browser instance we'll work with.

Part 3: Creating a Page and Navigation

#Create a new page (tab) in the browser
page = browser.new_page()

##Navigate to a website
page.goto('https://webscraper.io/test-sites/e-commerce/more')

Explanation

browser.new_page(): Creates a new tab/page in the browser. Most of our interactions will happen through this page object.
page.goto('https://webscraper.io/test-sites/e-commerce/more'): Tells the browser to navigate to the specified URL. Playwright automatically waits for the page to load.

Part 4: Interacting with the Page

#Get the page title and print it 
title = page.title()
print(f"Page title: {title}")

#Take a screenshot of the entire page 
page.screenshot(path='screenshot.png')
print("Screenshot saved as 'screenshot.png'")

Explanation

page.title(): Returns the title of the current page (what you see in the browser tab).
page.screenshot(path='screenshot.png'): Captures a screenshot of the entire visible page and saves it to a file.

Part 5: Finding and Extracting Content

#Extract the main heading using a CSS selector 
heading = page.text_content('h1')
print(f"Main heading: {heading}")

#Extract all paragraph text 
paragraphs = page.query_selector_all('p') 
print("Paragraphs found:")
for i, paragraph in enumerate(paragraphs,1): 
   print(f"{i}. {paragraph.text_content()}")

Explanation

page.text_content('h1'): Finds the first <h1> element and extracts its text content.
page.query_selector_all('p'): Finds ALL <p> elements on the page and returns a list.
enumerate(paragraphs,1): Loops through the paragraphs with numbering starting at 1.
paragraph.text_content(): Extracts text from each paragraph element.

Part 6: Cleanup

   #Close the browser
   print("Closing browser...")
   browser.close()

if __name__ == '__main__':
  main()

Explanation

browser.close(): Properly closes the browser and releases system resources. This is important to prevent memory leaks.
if name = 'main': Standard Python practice that ensures the main() function only runs when the script is executed directly.

Complete Script

Here's the complete script put together:

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:

        browser = p.chromium.launch(headless = False)

        page = browser.new_page()
        page.goto('https://webscraper.io/test-sites/e-commerce/more')

        title = page.title()
        print(f"Page title: {title}")

        page.screenshot(path='screenshot.png')
        print("Screenshot saved as 'screenshot.png'.")

        heading = page.text_content('h1')
        print(f"Main heading: {heading}")

        paragraphs = page.query_selector_all('p')
        print ("Paragraphs found:")
        for i, paragraph in enumerate(paragraphs, 1): 
            print(f"{i}. {paragraph.text_content()}")

        print("Closing browser...")
        browser.close()

if __name__ == "__main__":
    main()

Running Our Script

Save the file and run it with:

python first_scraper.py

You should see:

A browser window opening and navigating to ---
Output in your terminal showing the page title, heading, and paragraphs.
A screenshot file created in your directory.

Here is the output on the terminal:

Here is the screenshot taken:

Step 3: Seeing Playwright's Real Power

While our first script works great, you might be thinking: "This looks similar to what I could do with requests + BeautifulSoup." And you'd be right; for this simple example.

But remember the dynamic content problem I mentioned in the introduction? Let me show you why Playwright is truly special. Try this quick experiment:

Create a file called `requests_vs_playwright.py`:

import requests 
from bs4 import BeautifulSoup 
from playwright.sync_api import sync_playwright

def compare_versions():

    """Compare JavaScript vs non-JavaScript versions of the same site"""

    print("COMPARISON: JavaScript vs Static Content")

    # Test the JavaScript version with requests (will fail)
    print("1.Testing Request on JavaScript version:")

    response = requests.get('https://quotes.toscrape.com/js/')
    soup = BeautifulSoup(response.content, 'html.parser')
    quotes = soup.find_all('div', class_='quote')
    print(f"Quotes found: {len(quotes)}")
    print(" Requests cannot see JavaScript rendered content")

    # Test the non-JavaScript version with requests
    print("2.Testing Request on non-JavaScript version:")
    response = requests.get('https://quotes.toscrape.com/')
    soup = BeautifulSoup(response.content, 'html.parser')
    quotes = soup.find_all('div', class_='quote')
    print(f"Quotes found: {len(quotes)}")
    print(" Requests can see static content")

    # Test the JavaScript version with Playwright
    print("3.Testing Playwright on JavaScript version:")
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto('https://quotes.toscrape.com/js/')
        quotes = page.query_selector_all('div.quote')
        print(f"Quotes found: {len(quotes)}")
        print(" Playwright can see JavaScript rendered content")

        browser.close()

if __name__ == "__main__":
    compare_versions()

In this file we are scrapping data from two similar sites the difference is one of the sites renders its content using JavaScript while the other utilizes a static html file. From the output of this file when you run python requests_vs_playwright.py you will notice that the first test case that is scrapping using requests & BeautifulSoup finds 0 quotes, on the other hand the third test case that is scrapping the same JavaScript webpage using Playwright finds all the quotes even though they are being rendered using JavaScript. The second test case is using requests & BeautifulSoup but it scrapping a webpage that utilizes static html files, that is why it is able to find the quotes.

Why This Matters:

The JavaScript version ('/js/') loads quotes dynamically after the page loads
Requests only sees the initial HTML skeleton - no quotes
Playwright waits for JavaScript to execute and sees the complete page
This is exactly the problem that made me switch to Playwright

Please try running this script on your own and observe the difference.

Expected output:

What We've Accomplished

In this first journey with Playwright, we've:

Identified the problem with traditional scraping tools for modern web applications.
Successfully installed Playwright and resolved real world dependency issues.
Written our first automation script that can navigate, extract content, and take screenshots.
Proven Playwright's superiority for dynamic content with an undeniable side-by-side comparison.

The most important takeaways is playwright sees what requests cannot and it renders JavaScript exactly like a human user, making previously "unscrapable" websites accessible again.

What's Next in Our Playwright Journey

This is just the beginning. In the next posts, we'll explore:

Interactice scraping - clicking buttons, filling forms, handling dropdowns.
Smart waiting strategies - waiting for elements, network idle, and custom conditions.
Authentication handling - logging into websites and maintaining sessions.
Advanced data extraction - tables, pagination, and complex data structures
Performance optimization - making our scripts faster and more reliable.

Stay tuned for the next post where we'll make our script truly interactive.

In the meantime, I'd love to hear about yout experiences with Playwright. Have you encounteres other dynamic content challenges? What websites are you thinking of automating? Let me know in the comments.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Getting Started with Playwright

Step 1: Install Playwright

Solution A: Using the Recommended Playwright Command

Solution B: Manually Installing Dependencies with apt (The Fix I Used)

Step 2: Writing Our First Playwright Script

Part 1: Import and Basic Setup

Part 2: Launching the Browser

Part 3: Creating a Page and Navigation

Part 4: Interacting with the Page

Part 5: Finding and Extracting Content

Part 6: Cleanup

Complete Script

Running Our Script

Step 3: Seeing Playwright's Real Power

Create a file called requests_vs_playwright.py:

What We've Accomplished

What's Next in Our Playwright Journey

Create a file called `requests_vs_playwright.py`: