DEV Community

Nelson Orina
Nelson Orina

Posted on

Introduction to Playwright.

Introduction

The web today isn't what it used to be. For years, I relied on request and BeautifulSoup to extract data from the web.It worked perfectly until it didn't. I hit a wall when websites started loading content dynamically with JavaScript. That's when I discovered Playwright.

What is Playwright?
Playwright is an open-source automation framework developed by Microsoft. It allows you to control modern browsers with simple code. Instead of just fetching static HTML Playwright:

  • Renders the full page, executing all JavaScript just like a human user.
  • Interacts with elements like clicking buttons, filling forms, scrolling, and even waiting for content to load as needed.
  • Works consistently across all modern browsers.

Why Use Playwright?
Here are just a few things you can do with it:

  • Scrape dynamic websites that load content via JavaScript
  • Write end-to-end test for complex web applications.
  • Automate repetitive tasks like form submissions or screenshots.

In this first post, I'm going to document my first steps with Playwright. We'll move from theory to practice by installing it and writting a simple script to scrape a site that would have been impossible with my old toolkit.

Getting Started with Playwright

Now that we know why Playwright matters, let's actually install it and run our first script. I'll use Python here, but Playwright also works with Node.js, Java and .NET.

Step 1: Install Playwright

First, let's install the Playwright package using pip:

pip install playwright
Enter fullscreen mode Exit fullscreen mode

Once installed, we need to install the browser binaries. Playwright can control Chromium, Firefox, and WebKit(Safari):

playwright install
Enter fullscreen mode Exit fullscreen mode

This command installs the necessary browsers so Playwright can control them.

Challenge Encountered: Missing Host Dependencies
After running the playwright install command, I encountered a warning that initially confused me. While the browsers were successfully downloaded and placed in the cache, the Playwright host system validation noted that my Linux operating system was missing essential libraries needed for the browsers to actually run.

Here is the exact warning I received:

Playwright Host validation warning: 
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Please install them with the following command:      ║
║                                                      ║
║      sudo playwright install-deps                    ║
║                                                      ║
║ Alternatively, use apt:                              ║
║      sudo apt-get install libwoff1\                  ║
║      ... [and a long list of other libraries]        ║
║                                                      ║
╚══════════════════════════════════════════════════════╝
Enter fullscreen mode Exit fullscreen mode

Resolving the Dependency Issue
The good news is that the Playwright team anticipated this and provided two clear solutions.

Solution A: Using the Recommended Playwright Command

Playwright provides a single, easy-to-use command specifically for installing these system dependencies:

sudo playwright install-deps
Enter fullscreen mode Exit fullscreen mode

I attempted to run this command outside of my virtual environment, but it immediately failed:

orina@Orina:~$ sudo playwright install-deps 
[sudo] password for orina: 
sudo: playwright: command not found
Enter fullscreen mode Exit fullscreen mode

The issue here is that the playwright command is installed within my isolated Python virtual environment (venv). Even though I was running it with sudo, my system's global PATH did not include the virtual environment's binary directory, leading to the command not found error.

To correctly use this command, I would have needed to either:

  1. Use the absolute path to the executable inside the venv: sudo /path/to/venv/bin/playwright install-deps

  2. Or, run the command while inside the venv and use the -E flag with sudo to preserve the environment variables: sudo -E playwright install-deps

Solution B: Manually Installing Dependencies with apt (The Fix I Used)

Rather than debugging the path issue, the most reliable and straightforward fix was to use the list of packages provided for my Linux distribution and installing the directly onto the hosts system using apt-get:

sudo apt-get install libwoff1 libvpx9 libevent-2.1-7t64 libgstreamer-gl1.0-0 libgstreamer-plugins-bad1.0-0 libwebpdemux2 libharfbuzz-icu0 libenchant-2-2 libsecret-1-0 libhyphen0 libmanette-0.2-0 libflite1 gstreamer1.0-libav
Enter fullscreen mode Exit fullscreen mode

Step 2: Writing Our First Playwright Script

Now that we have all dependencies installed, let's create our first Playwright script. I'll break this down into manageable parts with detailed explanations.

Create a file called first_scraper.py and let's build it step by step.

Part 1: Import and Basic Setup

from playwright.sync_api import sync_playwright

def main(): 
   #The sync_playwright() context manager handles browser lifecycle 
   with sync_playwright() as p: 

Enter fullscreen mode Exit fullscreen mode

Explanation

  • from playwright.sync_api import sync_playwright: We import the synchronous API. Playwright also has an async API, but we're starting with synchronous for simplicity.
  • with sync_playwright() as p: This context manager automatically handles starting and stopping Playwright. The p variable gives us access to the Playwright instance.

Part 2: Launching the Browser

#Launch a Chromium browser instance 
#headless=False means we'll see the browser window 
browser = p.chromium.launch(headless=False)
Enter fullscreen mode Exit fullscreen mode

Explanation

  • p.chromium.launch(headless=False): This launches a Chromium browser.
    • headless=False means the browser window will be visible. For production scripts, you'd set this to True to run in the background.
    • You could also use p.firefox.launch() or p.webkit.launch() for other browsers.
  • The browser variable represents the browser instance we'll work with.

Part 3: Creating a Page and Navigation

#Create a new page (tab) in the browser
page = browser.new_page()

##Navigate to a website
page.goto('https://webscraper.io/test-sites/e-commerce/more')
Enter fullscreen mode Exit fullscreen mode

Explanation

  • browser.new_page(): Creates a new tab/page in the browser. Most of our interactions will happen through this page object.
  • page.goto('https://webscraper.io/test-sites/e-commerce/more'): Tells the browser to navigate to the specified URL. Playwright automatically waits for the page to load.

Part 4: Interacting with the Page

#Get the page title and print it 
title = page.title()
print(f"Page title: {title}")

#Take a screenshot of the entire page 
page.screenshot(path='screenshot.png')
print("Screenshot saved as 'screenshot.png'")

Enter fullscreen mode Exit fullscreen mode

Explanation

  • page.title(): Returns the title of the current page (what you see in the browser tab).
  • page.screenshot(path='screenshot.png'): Captures a screenshot of the entire visible page and saves it to a file.

Part 5: Finding and Extracting Content

#Extract the main heading using a CSS selector 
heading = page.text_content('h1')
print(f"Main heading: {heading}")

#Extract all paragraph text 
paragraphs = page.query_selector_all('p') 
print("Paragraphs found:")
for i, paragraph in enumerate(paragraphs,1): 
   print(f"{i}. {paragraph.text_content()}")

Enter fullscreen mode Exit fullscreen mode

Explanation

  • page.text_content('h1'): Finds the first <h1> element and extracts its text content.
  • page.query_selector_all('p'): Finds ALL <p> elements on the page and returns a list.
  • enumerate(paragraphs,1): Loops through the paragraphs with numbering starting at 1.
  • paragraph.text_content(): Extracts text from each paragraph element.

Part 6: Cleanup

   #Close the browser
   print("Closing browser...")
   browser.close()

if __name__ == '__main__':
  main()


Enter fullscreen mode Exit fullscreen mode

Explanation

  • browser.close(): Properly closes the browser and releases system resources. This is important to prevent memory leaks.
  • if name = 'main': Standard Python practice that ensures the main() function only runs when the script is executed directly.

Complete Script

Here's the complete script put together:

from playwright.sync_api import sync_playwright

def main():
    with sync_playwright() as p:

        browser = p.chromium.launch(headless = False)

        page = browser.new_page()
        page.goto('https://webscraper.io/test-sites/e-commerce/more')

        title = page.title()
        print(f"Page title: {title}")

        page.screenshot(path='screenshot.png')
        print("Screenshot saved as 'screenshot.png'.")

        heading = page.text_content('h1')
        print(f"Main heading: {heading}")

        paragraphs = page.query_selector_all('p')
        print ("Paragraphs found:")
        for i, paragraph in enumerate(paragraphs, 1): 
            print(f"{i}. {paragraph.text_content()}")

        print("Closing browser...")
        browser.close()

if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

Running Our Script

Save the file and run it with:

python first_scraper.py
Enter fullscreen mode Exit fullscreen mode

You should see:

  1. A browser window opening and navigating to ---
  2. Output in your terminal showing the page title, heading, and paragraphs.
  3. A screenshot file created in your directory.

Here is the output on the terminal:

Here is the screenshot taken:

Step 3: Seeing Playwright's Real Power

While our first script works great, you might be thinking: "This looks similar to what I could do with requests + BeautifulSoup." And you'd be right; for this simple example.

But remember the dynamic content problem I mentioned in the introduction? Let me show you why Playwright is truly special. Try this quick experiment:

Create a file called requests_vs_playwright.py:

import requests 
from bs4 import BeautifulSoup 
from playwright.sync_api import sync_playwright

def compare_versions():

    """Compare JavaScript vs non-JavaScript versions of the same site"""

    print("COMPARISON: JavaScript vs Static Content")

    # Test the JavaScript version with requests (will fail)
    print("1.Testing Request on JavaScript version:")

    response = requests.get('https://quotes.toscrape.com/js/')
    soup = BeautifulSoup(response.content, 'html.parser')
    quotes = soup.find_all('div', class_='quote')
    print(f"Quotes found: {len(quotes)}")
    print(" Requests cannot see JavaScript rendered content")

    # Test the non-JavaScript version with requests
    print("2.Testing Request on non-JavaScript version:")
    response = requests.get('https://quotes.toscrape.com/')
    soup = BeautifulSoup(response.content, 'html.parser')
    quotes = soup.find_all('div', class_='quote')
    print(f"Quotes found: {len(quotes)}")
    print(" Requests can see static content")

    # Test the JavaScript version with Playwright
    print("3.Testing Playwright on JavaScript version:")
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto('https://quotes.toscrape.com/js/')
        quotes = page.query_selector_all('div.quote')
        print(f"Quotes found: {len(quotes)}")
        print(" Playwright can see JavaScript rendered content")

        browser.close()

if __name__ == "__main__":
    compare_versions()

Enter fullscreen mode Exit fullscreen mode

In this file we are scrapping data from two similar sites the difference is one of the sites renders its content using JavaScript while the other utilizes a static html file. From the output of this file when you run python requests_vs_playwright.py you will notice that the first test case that is scrapping using requests & BeautifulSoup finds 0 quotes, on the other hand the third test case that is scrapping the same JavaScript webpage using Playwright finds all the quotes even though they are being rendered using JavaScript. The second test case is using requests & BeautifulSoup but it scrapping a webpage that utilizes static html files, that is why it is able to find the quotes.

Why This Matters:

  • The JavaScript version ('/js/') loads quotes dynamically after the page loads
  • Requests only sees the initial HTML skeleton - no quotes
  • Playwright waits for JavaScript to execute and sees the complete page
  • This is exactly the problem that made me switch to Playwright

Please try running this script on your own and observe the difference.

Expected output:

What We've Accomplished

In this first journey with Playwright, we've:

  • Identified the problem with traditional scraping tools for modern web applications.
  • Successfully installed Playwright and resolved real world dependency issues.
  • Written our first automation script that can navigate, extract content, and take screenshots.
  • Proven Playwright's superiority for dynamic content with an undeniable side-by-side comparison.

The most important takeaways is playwright sees what requests cannot and it renders JavaScript exactly like a human user, making previously "unscrapable" websites accessible again.

What's Next in Our Playwright Journey

This is just the beginning. In the next posts, we'll explore:

  1. Interactice scraping - clicking buttons, filling forms, handling dropdowns.
  2. Smart waiting strategies - waiting for elements, network idle, and custom conditions.
  3. Authentication handling - logging into websites and maintaining sessions.
  4. Advanced data extraction - tables, pagination, and complex data structures
  5. Performance optimization - making our scripts faster and more reliable.

Stay tuned for the next post where we'll make our script truly interactive.

In the meantime, I'd love to hear about yout experiences with Playwright. Have you encounteres other dynamic content challenges? What websites are you thinking of automating? Let me know in the comments.

Top comments (0)