Introduction
The web today isn't what it used to be. For years, I relied on request
and BeautifulSoup
to extract data from the web.It worked perfectly until it didn't. I hit a wall when websites started loading content dynamically with JavaScript. That's when I discovered Playwright
.
What is Playwright?
Playwright is an open-source automation framework developed by Microsoft. It allows you to control modern browsers with simple code. Instead of just fetching static HTML Playwright:
- Renders the full page, executing all JavaScript just like a human user.
- Interacts with elements like clicking buttons, filling forms, scrolling, and even waiting for content to load as needed.
- Works consistently across all modern browsers.
Why Use Playwright?
Here are just a few things you can do with it:
- Scrape dynamic websites that load content via JavaScript
- Write end-to-end test for complex web applications.
- Automate repetitive tasks like form submissions or screenshots.
In this first post, I'm going to document my first steps with Playwright. We'll move from theory to practice by installing it and writting a simple script to scrape a site that would have been impossible with my old toolkit.
Getting Started with Playwright
Now that we know why Playwright matters, let's actually install it and run our first script. I'll use Python here, but Playwright also works with Node.js, Java and .NET.
Step 1: Install Playwright
First, let's install the Playwright package using pip:
pip install playwright
Once installed, we need to install the browser binaries. Playwright can control Chromium, Firefox, and WebKit(Safari):
playwright install
This command installs the necessary browsers so Playwright can control them.
Challenge Encountered: Missing Host Dependencies
After running the playwright install command, I encountered a warning that initially confused me. While the browsers were successfully downloaded and placed in the cache, the Playwright host system validation noted that my Linux operating system was missing essential libraries needed for the browsers to actually run.
Here is the exact warning I received:
Playwright Host validation warning:
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Please install them with the following command: ║
║ ║
║ sudo playwright install-deps ║
║ ║
║ Alternatively, use apt: ║
║ sudo apt-get install libwoff1\ ║
║ ... [and a long list of other libraries] ║
║ ║
╚══════════════════════════════════════════════════════╝
Resolving the Dependency Issue
The good news is that the Playwright team anticipated this and provided two clear solutions.
Solution A: Using the Recommended Playwright Command
Playwright provides a single, easy-to-use command specifically for installing these system dependencies:
sudo playwright install-deps
I attempted to run this command outside of my virtual environment, but it immediately failed:
orina@Orina:~$ sudo playwright install-deps
[sudo] password for orina:
sudo: playwright: command not found
The issue here is that the playwright command is installed within my isolated Python virtual environment (venv)
. Even though I was running it with sudo, my system's global PATH did not include the virtual environment's binary directory, leading to the command not found error.
To correctly use this command, I would have needed to either:
Use the absolute path to the executable inside the venv: sudo /path/to/venv/bin/playwright install-deps
Or, run the command while inside the venv and use the -E flag with sudo to preserve the environment variables: sudo -E playwright install-deps
Solution B: Manually Installing Dependencies with apt (The Fix I Used)
Rather than debugging the path issue, the most reliable and straightforward fix was to use the list of packages provided for my Linux distribution and installing the directly onto the hosts system using apt-get
:
sudo apt-get install libwoff1 libvpx9 libevent-2.1-7t64 libgstreamer-gl1.0-0 libgstreamer-plugins-bad1.0-0 libwebpdemux2 libharfbuzz-icu0 libenchant-2-2 libsecret-1-0 libhyphen0 libmanette-0.2-0 libflite1 gstreamer1.0-libav
Step 2: Writing Our First Playwright Script
Now that we have all dependencies installed, let's create our first Playwright script. I'll break this down into manageable parts with detailed explanations.
Create a file called first_scraper.py
and let's build it step by step.
Part 1: Import and Basic Setup
from playwright.sync_api import sync_playwright
def main():
#The sync_playwright() context manager handles browser lifecycle
with sync_playwright() as p:
Explanation
- from playwright.sync_api import sync_playwright: We import the synchronous API. Playwright also has an async API, but we're starting with synchronous for simplicity.
- with sync_playwright() as p: This context manager automatically handles starting and stopping Playwright. The p variable gives us access to the Playwright instance.
Part 2: Launching the Browser
#Launch a Chromium browser instance
#headless=False means we'll see the browser window
browser = p.chromium.launch(headless=False)
Explanation
-
p.chromium.launch(headless=False): This launches a Chromium browser.
- headless=False means the browser window will be visible. For production scripts, you'd set this to True to run in the background.
- You could also use p.firefox.launch() or p.webkit.launch() for other browsers.
- The browser variable represents the browser instance we'll work with.
Part 3: Creating a Page and Navigation
#Create a new page (tab) in the browser
page = browser.new_page()
##Navigate to a website
page.goto('https://webscraper.io/test-sites/e-commerce/more')
Explanation
- browser.new_page(): Creates a new tab/page in the browser. Most of our interactions will happen through this page object.
- page.goto('https://webscraper.io/test-sites/e-commerce/more'): Tells the browser to navigate to the specified URL. Playwright automatically waits for the page to load.
Part 4: Interacting with the Page
#Get the page title and print it
title = page.title()
print(f"Page title: {title}")
#Take a screenshot of the entire page
page.screenshot(path='screenshot.png')
print("Screenshot saved as 'screenshot.png'")
Explanation
- page.title(): Returns the title of the current page (what you see in the browser tab).
- page.screenshot(path='screenshot.png'): Captures a screenshot of the entire visible page and saves it to a file.
Part 5: Finding and Extracting Content
#Extract the main heading using a CSS selector
heading = page.text_content('h1')
print(f"Main heading: {heading}")
#Extract all paragraph text
paragraphs = page.query_selector_all('p')
print("Paragraphs found:")
for i, paragraph in enumerate(paragraphs,1):
print(f"{i}. {paragraph.text_content()}")
Explanation
-
page.text_content('h1'): Finds the first
<h1>
element and extracts its text content. -
page.query_selector_all('p'): Finds ALL
<p>
elements on the page and returns a list. - enumerate(paragraphs,1): Loops through the paragraphs with numbering starting at 1.
- paragraph.text_content(): Extracts text from each paragraph element.
Part 6: Cleanup
#Close the browser
print("Closing browser...")
browser.close()
if __name__ == '__main__':
main()
Explanation
- browser.close(): Properly closes the browser and releases system resources. This is important to prevent memory leaks.
-
if name = 'main': Standard Python practice that ensures the
main()
function only runs when the script is executed directly.
Complete Script
Here's the complete script put together:
from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless = False)
page = browser.new_page()
page.goto('https://webscraper.io/test-sites/e-commerce/more')
title = page.title()
print(f"Page title: {title}")
page.screenshot(path='screenshot.png')
print("Screenshot saved as 'screenshot.png'.")
heading = page.text_content('h1')
print(f"Main heading: {heading}")
paragraphs = page.query_selector_all('p')
print ("Paragraphs found:")
for i, paragraph in enumerate(paragraphs, 1):
print(f"{i}. {paragraph.text_content()}")
print("Closing browser...")
browser.close()
if __name__ == "__main__":
main()
Running Our Script
Save the file and run it with:
python first_scraper.py
You should see:
- A browser window opening and navigating to ---
- Output in your terminal showing the page title, heading, and paragraphs.
- A screenshot file created in your directory.
Here is the output on the terminal:
Here is the screenshot taken:
Step 3: Seeing Playwright's Real Power
While our first script works great, you might be thinking: "This looks similar to what I could do with requests + BeautifulSoup." And you'd be right; for this simple example.
But remember the dynamic content problem I mentioned in the introduction? Let me show you why Playwright is truly special. Try this quick experiment:
Create a file called requests_vs_playwright.py
:
import requests
from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright
def compare_versions():
"""Compare JavaScript vs non-JavaScript versions of the same site"""
print("COMPARISON: JavaScript vs Static Content")
# Test the JavaScript version with requests (will fail)
print("1.Testing Request on JavaScript version:")
response = requests.get('https://quotes.toscrape.com/js/')
soup = BeautifulSoup(response.content, 'html.parser')
quotes = soup.find_all('div', class_='quote')
print(f"Quotes found: {len(quotes)}")
print(" Requests cannot see JavaScript rendered content")
# Test the non-JavaScript version with requests
print("2.Testing Request on non-JavaScript version:")
response = requests.get('https://quotes.toscrape.com/')
soup = BeautifulSoup(response.content, 'html.parser')
quotes = soup.find_all('div', class_='quote')
print(f"Quotes found: {len(quotes)}")
print(" Requests can see static content")
# Test the JavaScript version with Playwright
print("3.Testing Playwright on JavaScript version:")
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://quotes.toscrape.com/js/')
quotes = page.query_selector_all('div.quote')
print(f"Quotes found: {len(quotes)}")
print(" Playwright can see JavaScript rendered content")
browser.close()
if __name__ == "__main__":
compare_versions()
In this file we are scrapping data from two similar sites the difference is one of the sites renders its content using JavaScript while the other utilizes a static html file. From the output of this file when you run python requests_vs_playwright.py
you will notice that the first test case that is scrapping using requests & BeautifulSoup finds 0 quotes, on the other hand the third test case that is scrapping the same JavaScript webpage using Playwright finds all the quotes even though they are being rendered using JavaScript. The second test case is using requests & BeautifulSoup but it scrapping a webpage that utilizes static html files, that is why it is able to find the quotes.
Why This Matters:
- The JavaScript version ('/js/') loads quotes dynamically after the page loads
- Requests only sees the initial HTML skeleton - no quotes
- Playwright waits for JavaScript to execute and sees the complete page
- This is exactly the problem that made me switch to Playwright
Please try running this script on your own and observe the difference.
Expected output:
What We've Accomplished
In this first journey with Playwright, we've:
- Identified the problem with traditional scraping tools for modern web applications.
- Successfully installed Playwright and resolved real world dependency issues.
- Written our first automation script that can navigate, extract content, and take screenshots.
- Proven Playwright's superiority for dynamic content with an undeniable side-by-side comparison.
The most important takeaways is playwright sees what requests cannot and it renders JavaScript exactly like a human user, making previously "unscrapable" websites accessible again.
What's Next in Our Playwright Journey
This is just the beginning. In the next posts, we'll explore:
- Interactice scraping - clicking buttons, filling forms, handling dropdowns.
- Smart waiting strategies - waiting for elements, network idle, and custom conditions.
- Authentication handling - logging into websites and maintaining sessions.
- Advanced data extraction - tables, pagination, and complex data structures
- Performance optimization - making our scripts faster and more reliable.
Stay tuned for the next post where we'll make our script truly interactive.
In the meantime, I'd love to hear about yout experiences with Playwright. Have you encounteres other dynamic content challenges? What websites are you thinking of automating? Let me know in the comments.
Top comments (0)