Scrapy is a widely used web scraping library with convenient and comprehensive architecture support for the common web scraping processes. However, it lacks a major feature: JavaScript rendering.
In this tutorial, we'll explore Selenium Playwright. A Scrapy integration that allows scraping dynamic web pages with Scrapy. We'll explain web scraping with Scrapy Playwright through an example project and how to use it for common scraping use cases, such as clicking elements, scrolling and waiting for elements. Let's dive in!
What is Scrapy Playwright?
scrapy-playwright is an integration between Scrapy and Playwright. It enables scraping dynamic web pages with Scrapy by processing the web scraping requests using a Playwright instance.
Scrapy Playwright allows for accessing the used Playwright pages, which enables most of the Playwright features such as:
- Simulating mouse and keyboard actions.
- Waiting for events, load states and HTML elements.
- Taking screenshots.
- Executing custom JavaScript code.
How to Install Scrapy Playwright?
To web scrape with Scrapy Playwright, we'll have to install a few Python libraries:
- Scrapy: For creating a Scrapy project and executing the scraping spiders.
- scrapy-playwright: A middleware for processing the requests using Playwright.
- Playwright: The Playwright Python API for automating the headless browsers.
The above libraries can be installed using the pip
command:
pip install scrapy scrapy-playwright playwright
After running the above command, install the Playwright headless browser binaries and dependencies:
playwright install chromium
playwright install-deps chromium
The above command will install the related Chrome binaries. However, we can specify other browser engines: firefox
or webkit
.
πβ Note that scrapy-playwright relies on the asyncio SelectorEventLoop. So, to use Scrapy Playwright on Windows, we have to use WSL. An interface for running Linux environments in Windows. For the installation instructions, refer to the official Microsoft guide.
How to Scrape with Scrapy Playwright?
In this section, we'll go over a step-by-step tutorial on creating a Scrapy project, integrating it with Playwright and creating a scraping Spider to extract data using Playwright.
This Scrapy Playwright tutorial will briefly cover the basics of Scrapy. For further details, refer to our dedicated guide on Scrapy.
Setting Up Scrapy Project
Let's start out with creating a new Scrapy project through the Scrapy
commands:
$ scrapy startproject reviewgather reviewgather-scraper
# ^ name ^ project directory
Executing the above command will create a Scrapy project in the reviewgather-scraper
folder. Let's navigate to its directory and inspect the created project files:
$ cd reviewgather-scraper
$ tail
.
βββ reviewgather
β βββ __init__.py
β βββ items.py
β βββ middlewares.py
β βββ pipelines.py
β βββ settings.py
β βββ spiders
β βββ __init__.py
βββ scrapy.cfg
Now that the Scrapy project is ready. Let's power it with Playwright!
Integrating Playwright With Scrapy
Setting Playwright with Scrapy is fairly straightforward. All we have to do is add these two lines to the settings.py
file in the Scrapy project:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
Also, enable the AsyncioSelectorReactor
by making sure that the following line exists in the same file and add it if not:
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Our Scrapy project can now use Playwright. Let's create the first Scrapy Playwright scraping spider to put it into evaluation!
Creating Scraping Spider
In this Scrapy Playwright tutorial, we'll scrape review data from web-scraping.dev:
<img src="https://scrapfly.io/blog/content/images/2024/02/web-scraping.dev-review-data-1.webp" alt="webpage with review data" title=""><figcaption>Reviews on web-scraping.dev</figcaption>
To scrape the above review data, we have to create a Scrapy spider:
$ scrapy genspider reviews web-scraping.dev
# ^ name ^ domain to scrape
The above Scrapy command will generate a spider named reviews.py
with a boilerplate code:
import scrapy
class ReviewsSpider(scrapy.Spider):
name = "reviews"
allowed_domains = ["web-scraping.dev"]
start_urls = ["https://web-scraping.dev"]
def parse(self, response):
pass
The starting point of the above spider is start_urls
, which is used for crawling purposes. Since our scraping target is only one page, we'll change it to a start_requests
function and request the target web page with Playwright:
import scrapy
class ReviewsSpider(scrapy.Spider):
name = "reviews"
allowed_domains = ["web-scraping.dev"]
def start_requests(self):
url = "https://web-scraping.dev/testimonials"
yield scrapy.Request(
url=url,
meta={
"playwright": True
}
)
def parse(self, response):
reviews = response.css("div.testimonial")
for review in reviews:
yield {
"rate": len(review.css("span.rating > svg").getall()),
"text": review.css("p.text::text").get()
}
Let's go through the above spider changes:
- Add a
start_requests
function and along with the target page URL. - Request the URL using the
scrapy.Request
method and add theplaywright
parameter to the request metadata to process it with Playwright. - Update that
parse()
callback function to parse the review data on the page by iterating and extracting them using CSS selectors.
The next step is executing the reviews spider and save the results:
scrapy crawl reviews --output reviews.json
The above command will create a reviews.json
with the data extracted:
[
{"rate": 5, "text": "We've been using this utility for years - awesome service!"},
{"rate": 5, "text": "This Python app simplified my workflow significantly. Highly recommended."},
{"rate": 4, "text": "Had a few issues at first, but their support team is top-notch!"},
{"rate": 5, "text": "A fantastic tool - it has everything you need and more."},
{"rate": 5, "text": "The interface could be a little more user-friendly."},
{"rate": 5, "text": "Been a fan of this app since day one. It just keeps getting better!"},
{"rate": 4, "text": "The recent updates really improved the overall experience."},
{"rate": 3, "text": "A decent web app. There's room for improvement though."},
{"rate": 5, "text": "The app is reliable and efficient. I can't imagine my day without it now."},
{"rate": 1, "text": "Encountered some bugs. Hope they fix it soon."}
]
Cool, our Scrapy Playwright scraping spider extracted the review data! However, it only contains the data from the first review page. To load and scrape more reviews, we have to scroll down the page. To do this, let's have a closer look at configuring Scrapy Playwright and automating the headless browser!
Implement Common Scraping Cases With Scrapy Selenium
In the following sections, we'll explore configuring Playwright with the Scrapy setup and controlling the Plawright headless browser for common web scraping use cases.
The scrapy-playwright
middleware supports most of the Playwright methods. This means that we can apply the regular Playwright features in Scrapy. For further details on these features, refer to our dedicated guide on Playwright.
Configuring Scrapy Playwright
Before we explore using Scrapy Playwright to execute different web scraping tasks, let's have a look at configuring the Playwright browser and its context first.
The scrapy-playwright middleware allows for defining global Playwright configuration through the settings.py
file in the Scrapy project:
#settings.py
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False, # run in the headful mode
"timeout": 60 * 1000, # 60 seconds
}
PLAYWRIGHT_CONTEXTS = {
"some_context_name": {
"viewport": {"width": 1280, "height": 720},
"locale": "fe-FR",
"timezone_id": "Europe/Paris",
}
}
In the above code, we define two different configurations:
- Launch options: A timeout for the browser instance and whether to run the browser in headless mode.
- Browser context: The Playwright browser emulation settings, such as viewport and locality configuration.
The Playwright context settings are global and can include several context profiles. They can be used by declaring the profile name in the request metadata:
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_context": "some_context_name",
}
)
Scrapy Playwright also allows for defining custom Headers and Cookies that will be used across all the requests:
# settings.py
from playwright.async_api import Request
from scrapy.http.headers import Headers
def custom_headers(
browser_type: str,
playwright_request: Request,
scrapy_headers: Headers,
) -> dict:
return {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35"}
PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers
Here, we define a custom_headers
function that returns specific headers values and pass it to PLAYWRIGHT_PROCESS_REQUEST_HEADERS
to use it across all Playwright requests. It will also override the default Scrapy headers and the headers passed to the Scrapy request.
Scrolling
Let's update our previous Scrapy Playwright scraping spider to scroll down and load more reviews. For this, we'll use the scrapy-playwright PageMethod, which supports most of the default Playwright page methods.
We'll execute a custom JavaScript code to simulate a scroll action to load all the review page data and then parse them:
import scrapy
from scrapy_playwright.page import PageMethod
class ReviewsSpider(scrapy.Spider):
name = "reviews"
allowed_domains = ["web-scraping.dev"]
def start_requests(self):
url = "https://web-scraping.dev/testimonials"
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_page_methods": [
# execute the scroll script
PageMethod("evaluate", "for (let i = 0; i < 8; i++) setTimeout(() => window.scrollTo(0, document.body.scrollHeight), i * 2000);"),
# wait for 30 seconds
PageMethod("wait_for_timeout", 15000)
],
}
)
async def parse(self, response):
reviews = response.css("div.testimonial")
for review in reviews:
yield {
"rate": len(review.css("span.rating > svg").getall()),
"text": review.css("p.text::text").get()
}
The above code is almost the same as the previous spider one. We only execute a JavaScript code for scrolling and wait for 15 seconds for the script to finish.
If we run the above spider and look at the result file, we'll find all the review data scraped:
[
....
{"rate": 5, "text": "I've tried many similar apps, but this one stands out with its exceptional performance and features."},
{"rate": 2, "text": "The app's user interface is outdated and not intuitive. It needs a modern redesign."},
{"rate": 5, "text": "I'm extremely satisfied with this app. It has exceeded my expectations in every way."},
{"rate": 5, "text": "The app's documentation is comprehensive and easy to follow, making it easy to get started."},
{"rate": 5, "text": "The app's performance has been flawless. I haven't experienced any issues or slowdowns."}
]
We can successfully handle infinite scrolling with Scrapy Playwright. However, the script waits for a fixed timeout, which isn't advised. Let's wait for a specific element instead!
Timeouts and Waiting For Elements
Playwright provides support for different waiting types:
- An event.
- A function to finish.
- A load state, either as
domcontentloaded
ornetworkidle
. - A URL, in case of navigation.
- A specific element to be present.
- Fixed timeouts.
Relying on dynamic timeouts is more efficient in terms of performance, as it reduces the unnecessary delays between the script actions.
The load state and fixed timeouts are usually used to wait for the natural page loading without explicit actions from the scraper side:
def start_requests(self):
url = "https://web-scraping.dev/testimonials"
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_page_methods": [
# fixed timeout wait
PageMethod("wait_for_timeout", 5000),
# # wait for the document to load
PageMethod("wait_for_load_state", "domcontentloaded"),
# # wait for the network to be idle
PageMethod("wait_for_load_state", "networkidle"),
],
}
)
In the context of our reviews scraper, we'll wait for the latest review on the page to load after the scroll:
class ReviewsSpider(scrapy.Spider):
name = "reviews"
allowed_domains = ["web-scraping.dev"]
def start_requests(self):
url = "https://web-scraping.dev/testimonials"
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_page_methods": [
# fixed timeout wait
PageMethod("evaluate", "for (let i = 0; i < 10; i++) setTimeout(() => window.scrollTo(0, document.body.scrollHeight), i * 2000);"),
# wait for latest element to load
PageMethod("wait_for_selector", "div.testimonial:nth-child(60)"),
],
}
)
async def parse(self, response):
reviews = response.css("div.testimonial")
for review in reviews:
yield {
"rate": len(review.css("span.rating > svg").getall()),
"text": review.css("p.text::text").get()
}
Here, we use the same JavaScript code to scroll and add an additional PageMethod
to wait for the latest review element to appear on the HTML.
Taking Screenshots
To capture a screenshot with Scrapy Playwright, we can utilize the screenshot
page method:
class ReviewsSpider(scrapy.Spider):
name = "reviews"
allowed_domains = ["web-scraping.dev"]
def start_requests(self):
url = "https://web-scraping.dev/testimonials"
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod(
"screenshot",
path="screenshot.png",
full_page=True # whether to capture the whole page
),
],
}
)
Here, we use the screenshot
method in the PageMethod
parameter to save it to the project directory. However, screenshots are usually captured after some browser actions. Luckily, we can capture the screenshot from the callback function instead:
class ReviewsSpider(scrapy.Spider):
name = "reviews"
allowed_domains = ["web-scraping.dev"]
def start_requests(self):
url = "https://web-scraping.dev/testimonials"
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_include_page": True
}
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.screenshot(path="screenshot.png", full_page=True)
Here, we pass the Playwright page
instance using the playwright_include_page
. Then, we access it from the response metadata and use it to take a screenshot directly.
Clicking Buttons And Filling Forms
Interacting the DOM elements on a page is commonly used while web scraping. In this Scrapy Playwright tutorial, we'll explain clicking buttons and filling forms by attempting to log in to the web-scraping.dev/login example.
We'll create a Scrapy Playwright spider to request the page URL, accept the cookies policy, fill in the login credentials, and then click the login button:
# spiders/login.py
# scrapy crawl login
import scrapy
from scrapy_playwright.page import PageMethod
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["web-scraping.dev"]
def start_requests(self):
url = "https://web-scraping.dev/login?cookies="
yield scrapy.Request(
url=url,
meta={
"playwright": True,
"playwright_page_methods": [
# wait the page to fully load
PageMethod("wait_for_load_state", "networkidle"),
# accept the cookie policy
PageMethod("click", "button#cookie-ok"),
# fill in the login creadentials
PageMethod("fill", "input[name='username']", "user123"),
PageMethod("fill", "input[name='password']", "password"),
# click submit button
PageMethod("click", "button[type='submit']"),
# wait for an element on the reidrect page
PageMethod("wait_for_selector", "div#secret-message"),
]
}
)
def parse(self, response):
print(f"The secret message is {response.css('div#secret-message::text').get()}")
"The secret message is π€«"
Before you run the above spider, make sure to disable the default Scrapy headers by adding the following line to the settings.py file: PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
In the above Scrapy Playwright scraper, we use the click
and fill
to complete the login process while utilizing timeouts between the steps to ensure a successful execution.
ScrapFly: Scrapy Playwright Alternative
ScrapFly is a web scraping API that supports scraping dynamic web pages using a JavaScript rendering feature. It also provides built-in JavaScript scenarios for controlling the headless browsers for common scraping use cases, such as waiting for elements, scrolling, filling and clicking elements.
Moreover, ScrapFly allows for scraping at scale by providing:
- Anti-scraping protection bypass: For scraping any website without getting blocked.
- Residential proxiess in over 50 countries: For avoiding IP address blocking and throttling while also allowing for scraping from almost any geographical location.
- Scrapy Integration, as well as Python and Typescript SDKs.
- And much more!
ScrapFly is available as a Scrapy integration. Simply add the following lines to the settings.py
file in the Scrapy project to authorize the API calls and set the concurrency limit:
SCRAPFLY_API_KEY = "Your ScrapFly API key"
CONCURRENT_REQUESTS = 2 # Adjust according to your plan limit rate and your needs
Let's replicate the latest Scrapy spider with the ScrapFly API. All we have to do is enable the asp
parameter to avoid scraping blocking and control the headless through the JavaScript scenarios.
ScrapFly X Scrapy:
from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse
class LoginSpider(ScrapflySpider):
name = 'login'
allowed_domains = ['web-scraping.dev']
def start_requests(self):
yield ScrapflyScrapyRequest(
scrape_config=ScrapeConfig(
# target website URL
url="https://web-scraping.dev/login?cookies=",
# bypass anti scraping protection
asp=True,
# set the proxy location to a specific country
country="US",
# enable JavaScript rendering
render_js=True,
# scroll down the page automatically
auto_scroll=True,
# add JavaScript scenarios
js_scenario=[
{"click": {"selector": "button#cookie-ok"}},
{"fill": {"selector": "input[name='username']","clear": True,"value": "user123"}},
{"fill": {"selector": "input[name='password']","clear": True,"value": "password"}},
{"click": {"selector": "form > button[type='submit']"}},
{"wait_for_navigation": {"timeout": 5000}}
],
# take a screenshot
screenshots={"logged_in_screen": "fullpage"}
),
callback=self.parse
)
def parse(self, response: ScrapflyScrapyResponse):
print(f"The secret message is {response.css('div#secret-message::text').get()}")
"The secret message is π€«"
ScrapFly SDK:
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
api_response: ScrapeApiResponse = scrapfly.scrape(
ScrapeConfig(
# target website URL
url="https://web-scraping.dev/login?cookies=",
# bypass anti scraping protection
asp=True,
# set the proxy location to a specific country
country="US",
# # enable the cookies policy
# headers={"cookie": "cookiesAccepted=true"},
# enable JavaScript rendering
render_js=True,
# scroll down the page automatically
auto_scroll=True,
# add JavaScript scenarios
js_scenario=[
{"click": {"selector": "button#cookie-ok"}},
{"fill": {"selector": "input[name='username']","clear": True,"value": "user123"}},
{"fill": {"selector": "input[name='password']","clear": True,"value": "password"}},
{"click": {"selector": "form > button[type='submit']"}},
{"wait_for_navigation": {"timeout": 5000}}
],
# take a screenshot
screenshots={"logged_in_screen": "fullpage"},
debug=True
)
)
# get the HTML from the response
html = api_response.scrape_result['content']
# use the built-in Parsel selector
selector = api_response.selector
print(f"The secret message is {selector.css('div#secret-message::text').get()}")
"The secret message is π€«"
Sign up to get your API key!
FAQ
To wrap up this guide on web scraping with Scrapy Playwright, let's have a look at some frequently asked questions.
How to solve the error "NotImplementedError in ('twisted.internet.asyncioreactor.AsyncioSelectorReactor')"?
This is a common error that occurs while running scrapy-playwright in Windows. It happens due to the lack of support for the SelectorEventLoop
in Windows. The alternative for using Scrapy Playwright in Windows is running it on WSL. For further details, refer to the official scrapy-playwright known issues.
Can I scrape dynamic web pages with Scrapy?
Yes. Scrapy Playwright is a middleware integration that enables scraping dynamic pages with Scrapy by processing the requests using a Playwright instance.
Are there alternatives for Scrapy Playwright?
Yes, there are other integrations that allow Scrapy to scrape dynamic web pages, such as Scrapy Selenium and Scrapy Splash.
Summary
In this guide, we explored the scrapy-playwright integration, which allows scraping dynamic web pages with Scrapy using Playwright headless browsers.
We went through a step-by-step guide on installing Scrapy Playwright and using it through an example project. We have also explained implementing common web scraping with Scrapy Playwright use cases, such as:
- Handling infinite scrolling while scraping.
- Executing custom JavaScript code.
- Applying timeout waits.
- Taking screenshots.
- Clicking buttons and filling out forms.
Top comments (0)