I spent a week trying to scrape a React-based e-commerce site with regular Scrapy. The page source was nearly empty. Just a <div id="root"></div> and a bunch of JavaScript files.
I tried everything. Different selectors. XPath. Nothing worked because the content didn't exist until JavaScript ran.
Then I discovered Scrapy-Playwright. Suddenly, scraping JavaScript-heavy sites became easy. Let me show you everything you need to know.
What Is Scrapy-Playwright?
Scrapy-Playwright integrates Playwright (a browser automation tool) with Scrapy.
What Playwright does:
- Launches real browsers (Chromium, Firefox, WebKit)
- Executes JavaScript
- Renders pages fully
- Handles dynamic content
- Supports modern web features
Why use it with Scrapy:
- Scrape JavaScript-heavy sites
- Handle infinite scroll
- Interact with pages (click, type, scroll)
- Take screenshots
- Bypass simple bot detection
Installation
Step 1: Install Scrapy-Playwright
pip install scrapy-playwright
Step 2: Install Playwright Browsers
playwright install
This downloads Chromium, Firefox, and WebKit browsers (about 300MB each).
Step 3: Enable in Scrapy
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
That's it! You're ready.
Your First Playwright Spider
Basic Example
import scrapy
class PlaywrightSpider(scrapy.Spider):
name = 'playwright_basic'
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={'playwright': True} # Enable Playwright for this request
)
def parse(self, response):
# JavaScript has executed!
# Response contains fully rendered HTML
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Key point: Add meta={'playwright': True} to enable Playwright for that request.
Choosing Browser
You can choose which browser to use:
# settings.py
PLAYWRIGHT_BROWSER_TYPE = 'chromium' # Default
# or 'firefox'
# or 'webkit'
When to use which:
- Chromium: Best compatibility, most features
- Firefox: Good for debugging, different fingerprint
- WebKit: Safari engine, for Mac/iOS specific sites
Headless vs Headed Mode
Headless (Default)
Browser runs without visible window. Faster and uses less resources.
# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True # Default
}
Headed (Visible Browser)
Useful for debugging. See what browser is doing.
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': False # See the browser
}
What the docs don't tell you:
- Headed mode is 2-3x slower
- Use headed only for debugging
- Production should always be headless
Waiting for Content
JavaScript takes time to load. Tell Playwright when to consider page "ready".
Wait for Selector
Most common approach. Wait until specific element appears:
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_page_methods': [
{'wait_for_selector': '.product'}, # Wait for products to load
]
}
)
Wait for Network Idle
Wait until no network requests for a while:
meta={
'playwright': True,
'playwright_page_methods': [
{'wait_for_load_state': 'networkidle'}, # No requests for 500ms
]
}
Load states:
-
'load'- Page load event fired -
'domcontentloaded'- DOM is ready -
'networkidle'- No network activity
Wait for Timeout
Simple time delay:
meta={
'playwright': True,
'playwright_page_methods': [
{'wait_for_timeout': 3000}, # Wait 3 seconds
]
}
Multiple Waits
Chain multiple wait conditions:
meta={
'playwright': True,
'playwright_page_methods': [
{'wait_for_selector': '.loading'}, # Wait for loader to appear
{'wait_for_selector': '.loading', 'state': 'hidden'}, # Wait for it to disappear
{'wait_for_selector': '.product'}, # Wait for products
]
}
Page Interactions
Click buttons, type text, scroll:
Click Elements
meta={
'playwright': True,
'playwright_page_methods': [
{'click': 'button.load-more'}, # Click "Load More" button
{'wait_for_selector': '.new-products'}, # Wait for new content
]
}
Type in Inputs
meta={
'playwright': True,
'playwright_page_methods': [
{'fill': {'selector': 'input#search', 'value': 'laptop'}}, # Type in search box
{'click': 'button.search'}, # Click search button
{'wait_for_selector': '.results'}, # Wait for results
]
}
Scroll Page
meta={
'playwright': True,
'playwright_page_methods': [
{'evaluate': 'window.scrollTo(0, document.body.scrollHeight)'}, # Scroll to bottom
{'wait_for_timeout': 2000}, # Wait for content to load
]
}
Select Dropdown
meta={
'playwright': True,
'playwright_page_methods': [
{'select_option': {'selector': 'select#category', 'value': 'electronics'}},
]
}
Screenshots
Take screenshots of pages:
meta={
'playwright': True,
'playwright_page_methods': [
{'screenshot': {'path': 'screenshot.png', 'fullPage': True}},
]
}
Options:
-
fullPage: True- Entire page (scrolls automatically) -
fullPage: False- Visible viewport only -
path- Where to save screenshot
Accessing Page Object
For advanced interactions, get access to the page object:
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={
'playwright': True,
'playwright_include_page': True # Include page object
},
callback=self.parse
)
async def parse(self, response):
page = response.meta['playwright_page']
# Now you can use full Playwright API
await page.click('button.load-more')
await page.wait_for_selector('.new-products')
# Get updated HTML
content = await page.content()
# Don't forget to close page!
await page.close()
# Parse content
from scrapy.http import HtmlResponse
new_response = HtmlResponse(
url=response.url,
body=content.encode('utf-8'),
encoding='utf-8'
)
for product in new_response.css('.product'):
yield {'name': product.css('h2::text').get()}
Important: When using playwright_include_page, your callback MUST be async!
Handling Infinite Scroll
Common pattern for infinite scroll sites:
async def parse(self, response):
page = response.meta['playwright_page']
# Scroll multiple times
for i in range(10): # Scroll 10 times
# Scroll to bottom
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
# Wait for new content to load
await page.wait_for_timeout(2000)
# Get final HTML
content = await page.content()
await page.close()
# Parse all loaded content
new_response = HtmlResponse(url=response.url, body=content.encode())
for product in new_response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
Network Interception
Intercept and modify network requests:
async def parse(self, response):
page = response.meta['playwright_page']
# Block images and CSS to speed up
async def route_handler(route):
if route.request.resource_type in ['image', 'stylesheet']:
await route.abort()
else:
await route.continue_()
await page.route('**/*', route_handler)
# Continue with page
await page.goto('https://example.com/products')
# ... rest of scraping
Performance Optimization
Block Unnecessary Resources
# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': True,
'args': [
'--disable-images', # Don't load images
'--disable-css', # Don't load CSS (careful, might break layout)
]
}
Better approach with route blocking:
meta={
'playwright': True,
'playwright_page_methods': [
{'route': {
'pattern': '**/*.{png,jpg,jpeg,gif,svg,css}',
'handler': 'abort'
}}
]
}
Reduce Browser Count
# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 2 # Max concurrent browser contexts (default: 8)
Lower number = less memory but slower.
Close Contexts Properly
Always close pages when done:
async def parse(self, response):
page = response.meta.get('playwright_page')
try:
# Your scraping logic
pass
finally:
if page:
await page.close()
Common Patterns
Pattern 1: Login and Then Scrape
async def parse(self, response):
page = response.meta['playwright_page']
# Login
await page.fill('input#username', 'myuser')
await page.fill('input#password', 'mypass')
await page.click('button.login')
await page.wait_for_selector('.dashboard')
# Now scrape protected content
await page.goto('https://example.com/protected/data')
content = await page.content()
await page.close()
# Parse
new_response = HtmlResponse(url=response.url, body=content.encode())
for item in new_response.css('.item'):
yield {'data': item.css('.data::text').get()}
Pattern 2: Handle Popups
async def parse(self, response):
page = response.meta['playwright_page']
# Close popup if it appears
try:
await page.click('.popup-close', timeout=2000)
except:
pass # No popup, continue
# Continue scraping
# ...
Pattern 3: Extract from Shadow DOM
async def parse(self, response):
page = response.meta['playwright_page']
# Access shadow DOM
shadow_content = await page.evaluate('''
() => {
const host = document.querySelector('my-component');
const shadowRoot = host.shadowRoot;
return shadowRoot.querySelector('.data').textContent;
}
''')
yield {'shadow_data': shadow_content}
await page.close()
Error Handling
Handle Playwright errors gracefully:
async def parse(self, response):
page = response.meta.get('playwright_page')
if not page:
self.logger.error('No Playwright page available')
return
try:
# Try to wait for selector
await page.wait_for_selector('.product', timeout=10000)
content = await page.content()
except Exception as e:
self.logger.error(f'Playwright error: {e}')
# Take screenshot for debugging
await page.screenshot(path=f'error_{response.url.split("/")[-1]}.png')
finally:
await page.close()
# Continue parsing
# ...
Real-World Example: Scraping SPA
Complete example for Single Page Application:
import scrapy
class SPASpider(scrapy.Spider):
name = 'spa'
custom_settings = {
'DOWNLOAD_HANDLERS': {
'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
},
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'PLAYWRIGHT_BROWSER_TYPE': 'chromium',
'PLAYWRIGHT_LAUNCH_OPTIONS': {
'headless': True,
'timeout': 30000
}
}
def start_requests(self):
yield scrapy.Request(
'https://example.com/products',
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_methods': [
{'wait_for_selector': '.product-list'},
]
},
callback=self.parse,
errback=self.errback_playwright
)
async def parse(self, response):
page = response.meta['playwright_page']
try:
# Wait for products to load
await page.wait_for_selector('.product', timeout=10000)
# Scroll to load all products (infinite scroll)
previous_height = 0
while True:
# Get current scroll height
current_height = await page.evaluate('document.body.scrollHeight')
# If no change, we've reached the end
if current_height == previous_height:
break
previous_height = current_height
# Scroll to bottom
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
# Wait for new content
await page.wait_for_timeout(2000)
# Get final HTML
content = await page.content()
self.logger.info(f'Loaded all products, page height: {current_height}px')
except Exception as e:
self.logger.error(f'Error during page interaction: {e}')
await page.screenshot(path='error.png')
return
finally:
await page.close()
# Parse the fully loaded page
from scrapy.http import HtmlResponse
final_response = HtmlResponse(
url=response.url,
body=content.encode('utf-8'),
encoding='utf-8'
)
products = final_response.css('.product')
self.logger.info(f'Found {len(products)} products')
for product in products:
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'image': product.css('img::attr(src)').get(),
'url': product.css('a::attr(href)').get()
}
async def errback_playwright(self, failure):
page = failure.request.meta.get('playwright_page')
if page:
await page.close()
self.logger.error(f'Request failed: {failure.value}')
Debugging Tips
Enable Debug Logging
# settings.py
PLAYWRIGHT_LOGGING = True
Take Screenshots at Each Step
meta={
'playwright': True,
'playwright_page_methods': [
{'screenshot': {'path': 'step1.png'}},
{'click': 'button.load-more'},
{'wait_for_timeout': 2000},
{'screenshot': {'path': 'step2.png'}},
]
}
Run in Headed Mode
PLAYWRIGHT_LAUNCH_OPTIONS = {
'headless': False,
'slow_mo': 1000 # Slow down by 1 second per action
}
Common Mistakes
Mistake #1: Forgetting async/await
# BAD (will crash)
def parse(self, response):
page = response.meta['playwright_page']
page.click('button') # Missing await!
# GOOD
async def parse(self, response):
page = response.meta['playwright_page']
await page.click('button')
Mistake #2: Not Closing Pages
# BAD (memory leak)
async def parse(self, response):
page = response.meta['playwright_page']
content = await page.content()
# Forgot to close!
# GOOD
async def parse(self, response):
page = response.meta['playwright_page']
try:
content = await page.content()
finally:
await page.close()
Mistake #3: Using Playwright for Everything
# BAD (unnecessary)
yield scrapy.Request(url, meta={'playwright': True})
# If site works without JavaScript, don't use Playwright!
# GOOD
# Use Playwright only when needed
if self.needs_javascript(url):
yield scrapy.Request(url, meta={'playwright': True})
else:
yield scrapy.Request(url) # Regular Scrapy
When to Use Playwright
Use Playwright when:
- Content loaded by JavaScript
- Need to interact with page (click, scroll, type)
- Infinite scroll
- Single Page Applications (React, Vue, Angular)
- Need screenshots
- Content in Shadow DOM
Don't use Playwright when:
- Content in HTML source (check with Ctrl+U)
- API available (much faster)
- Simple static sites
- Speed is critical
Rule of thumb: Check page source first. If data is there, use regular Scrapy!
Summary
Installation:
pip install scrapy-playwright
playwright install
Enable in settings:
DOWNLOAD_HANDLERS = {
'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}
Basic usage:
meta={'playwright': True}
Page interactions:
meta={
'playwright': True,
'playwright_page_methods': [
{'wait_for_selector': '.product'},
{'click': 'button'},
{'screenshot': {'path': 'page.png'}}
]
}
Advanced (page object):
meta={'playwright': True, 'playwright_include_page': True}
# Then use: page = response.meta['playwright_page']
Remember:
- Use only when needed
- Always close pages
- async/await required with page object
- Check page source first
- Headed mode for debugging only
Scrapy-Playwright is powerful but slower than regular Scrapy. Use it wisely!
Happy scraping! 🕷️
Top comments (0)