DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy-Playwright Complete Guide: Scrape JavaScript Sites Like a Pro

I spent a week trying to scrape a React-based e-commerce site with regular Scrapy. The page source was nearly empty. Just a <div id="root"></div> and a bunch of JavaScript files.

I tried everything. Different selectors. XPath. Nothing worked because the content didn't exist until JavaScript ran.

Then I discovered Scrapy-Playwright. Suddenly, scraping JavaScript-heavy sites became easy. Let me show you everything you need to know.


What Is Scrapy-Playwright?

Scrapy-Playwright integrates Playwright (a browser automation tool) with Scrapy.

What Playwright does:

  • Launches real browsers (Chromium, Firefox, WebKit)
  • Executes JavaScript
  • Renders pages fully
  • Handles dynamic content
  • Supports modern web features

Why use it with Scrapy:

  • Scrape JavaScript-heavy sites
  • Handle infinite scroll
  • Interact with pages (click, type, scroll)
  • Take screenshots
  • Bypass simple bot detection

Installation

Step 1: Install Scrapy-Playwright

pip install scrapy-playwright
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Playwright Browsers

playwright install
Enter fullscreen mode Exit fullscreen mode

This downloads Chromium, Firefox, and WebKit browsers (about 300MB each).

Step 3: Enable in Scrapy

# settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Enter fullscreen mode Exit fullscreen mode

That's it! You're ready.


Your First Playwright Spider

Basic Example

import scrapy

class PlaywrightSpider(scrapy.Spider):
    name = 'playwright_basic'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com',
            meta={'playwright': True}  # Enable Playwright for this request
        )

    def parse(self, response):
        # JavaScript has executed!
        # Response contains fully rendered HTML

        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Key point: Add meta={'playwright': True} to enable Playwright for that request.


Choosing Browser

You can choose which browser to use:

# settings.py

PLAYWRIGHT_BROWSER_TYPE = 'chromium'  # Default
# or 'firefox'
# or 'webkit'
Enter fullscreen mode Exit fullscreen mode

When to use which:

  • Chromium: Best compatibility, most features
  • Firefox: Good for debugging, different fingerprint
  • WebKit: Safari engine, for Mac/iOS specific sites

Headless vs Headed Mode

Headless (Default)

Browser runs without visible window. Faster and uses less resources.

# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': True  # Default
}
Enter fullscreen mode Exit fullscreen mode

Headed (Visible Browser)

Useful for debugging. See what browser is doing.

PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': False  # See the browser
}
Enter fullscreen mode Exit fullscreen mode

What the docs don't tell you:

  • Headed mode is 2-3x slower
  • Use headed only for debugging
  • Production should always be headless

Waiting for Content

JavaScript takes time to load. Tell Playwright when to consider page "ready".

Wait for Selector

Most common approach. Wait until specific element appears:

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                {'wait_for_selector': '.product'},  # Wait for products to load
            ]
        }
    )
Enter fullscreen mode Exit fullscreen mode

Wait for Network Idle

Wait until no network requests for a while:

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'wait_for_load_state': 'networkidle'},  # No requests for 500ms
    ]
}
Enter fullscreen mode Exit fullscreen mode

Load states:

  • 'load' - Page load event fired
  • 'domcontentloaded' - DOM is ready
  • 'networkidle' - No network activity

Wait for Timeout

Simple time delay:

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'wait_for_timeout': 3000},  # Wait 3 seconds
    ]
}
Enter fullscreen mode Exit fullscreen mode

Multiple Waits

Chain multiple wait conditions:

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'wait_for_selector': '.loading'},  # Wait for loader to appear
        {'wait_for_selector': '.loading', 'state': 'hidden'},  # Wait for it to disappear
        {'wait_for_selector': '.product'},  # Wait for products
    ]
}
Enter fullscreen mode Exit fullscreen mode

Page Interactions

Click buttons, type text, scroll:

Click Elements

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'click': 'button.load-more'},  # Click "Load More" button
        {'wait_for_selector': '.new-products'},  # Wait for new content
    ]
}
Enter fullscreen mode Exit fullscreen mode

Type in Inputs

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'fill': {'selector': 'input#search', 'value': 'laptop'}},  # Type in search box
        {'click': 'button.search'},  # Click search button
        {'wait_for_selector': '.results'},  # Wait for results
    ]
}
Enter fullscreen mode Exit fullscreen mode

Scroll Page

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'evaluate': 'window.scrollTo(0, document.body.scrollHeight)'},  # Scroll to bottom
        {'wait_for_timeout': 2000},  # Wait for content to load
    ]
}
Enter fullscreen mode Exit fullscreen mode

Select Dropdown

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'select_option': {'selector': 'select#category', 'value': 'electronics'}},
    ]
}
Enter fullscreen mode Exit fullscreen mode

Screenshots

Take screenshots of pages:

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'screenshot': {'path': 'screenshot.png', 'fullPage': True}},
    ]
}
Enter fullscreen mode Exit fullscreen mode

Options:

  • fullPage: True - Entire page (scrolls automatically)
  • fullPage: False - Visible viewport only
  • path - Where to save screenshot

Accessing Page Object

For advanced interactions, get access to the page object:

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_include_page': True  # Include page object
        },
        callback=self.parse
    )

async def parse(self, response):
    page = response.meta['playwright_page']

    # Now you can use full Playwright API
    await page.click('button.load-more')
    await page.wait_for_selector('.new-products')

    # Get updated HTML
    content = await page.content()

    # Don't forget to close page!
    await page.close()

    # Parse content
    from scrapy.http import HtmlResponse
    new_response = HtmlResponse(
        url=response.url,
        body=content.encode('utf-8'),
        encoding='utf-8'
    )

    for product in new_response.css('.product'):
        yield {'name': product.css('h2::text').get()}
Enter fullscreen mode Exit fullscreen mode

Important: When using playwright_include_page, your callback MUST be async!


Handling Infinite Scroll

Common pattern for infinite scroll sites:

async def parse(self, response):
    page = response.meta['playwright_page']

    # Scroll multiple times
    for i in range(10):  # Scroll 10 times
        # Scroll to bottom
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')

        # Wait for new content to load
        await page.wait_for_timeout(2000)

    # Get final HTML
    content = await page.content()
    await page.close()

    # Parse all loaded content
    new_response = HtmlResponse(url=response.url, body=content.encode())

    for product in new_response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Network Interception

Intercept and modify network requests:

async def parse(self, response):
    page = response.meta['playwright_page']

    # Block images and CSS to speed up
    async def route_handler(route):
        if route.request.resource_type in ['image', 'stylesheet']:
            await route.abort()
        else:
            await route.continue_()

    await page.route('**/*', route_handler)

    # Continue with page
    await page.goto('https://example.com/products')

    # ... rest of scraping
Enter fullscreen mode Exit fullscreen mode

Performance Optimization

Block Unnecessary Resources

# settings.py
PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': True,
    'args': [
        '--disable-images',  # Don't load images
        '--disable-css',     # Don't load CSS (careful, might break layout)
    ]
}
Enter fullscreen mode Exit fullscreen mode

Better approach with route blocking:

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'route': {
            'pattern': '**/*.{png,jpg,jpeg,gif,svg,css}',
            'handler': 'abort'
        }}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Reduce Browser Count

# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 2  # Max concurrent browser contexts (default: 8)
Enter fullscreen mode Exit fullscreen mode

Lower number = less memory but slower.

Close Contexts Properly

Always close pages when done:

async def parse(self, response):
    page = response.meta.get('playwright_page')

    try:
        # Your scraping logic
        pass
    finally:
        if page:
            await page.close()
Enter fullscreen mode Exit fullscreen mode

Common Patterns

Pattern 1: Login and Then Scrape

async def parse(self, response):
    page = response.meta['playwright_page']

    # Login
    await page.fill('input#username', 'myuser')
    await page.fill('input#password', 'mypass')
    await page.click('button.login')
    await page.wait_for_selector('.dashboard')

    # Now scrape protected content
    await page.goto('https://example.com/protected/data')
    content = await page.content()
    await page.close()

    # Parse
    new_response = HtmlResponse(url=response.url, body=content.encode())
    for item in new_response.css('.item'):
        yield {'data': item.css('.data::text').get()}
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Handle Popups

async def parse(self, response):
    page = response.meta['playwright_page']

    # Close popup if it appears
    try:
        await page.click('.popup-close', timeout=2000)
    except:
        pass  # No popup, continue

    # Continue scraping
    # ...
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Extract from Shadow DOM

async def parse(self, response):
    page = response.meta['playwright_page']

    # Access shadow DOM
    shadow_content = await page.evaluate('''
        () => {
            const host = document.querySelector('my-component');
            const shadowRoot = host.shadowRoot;
            return shadowRoot.querySelector('.data').textContent;
        }
    ''')

    yield {'shadow_data': shadow_content}
    await page.close()
Enter fullscreen mode Exit fullscreen mode

Error Handling

Handle Playwright errors gracefully:

async def parse(self, response):
    page = response.meta.get('playwright_page')

    if not page:
        self.logger.error('No Playwright page available')
        return

    try:
        # Try to wait for selector
        await page.wait_for_selector('.product', timeout=10000)

        content = await page.content()

    except Exception as e:
        self.logger.error(f'Playwright error: {e}')

        # Take screenshot for debugging
        await page.screenshot(path=f'error_{response.url.split("/")[-1]}.png')

    finally:
        await page.close()

    # Continue parsing
    # ...
Enter fullscreen mode Exit fullscreen mode

Real-World Example: Scraping SPA

Complete example for Single Page Application:

import scrapy

class SPASpider(scrapy.Spider):
    name = 'spa'

    custom_settings = {
        'DOWNLOAD_HANDLERS': {
            'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
            'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
        },
        'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
        'PLAYWRIGHT_BROWSER_TYPE': 'chromium',
        'PLAYWRIGHT_LAUNCH_OPTIONS': {
            'headless': True,
            'timeout': 30000
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            meta={
                'playwright': True,
                'playwright_include_page': True,
                'playwright_page_methods': [
                    {'wait_for_selector': '.product-list'},
                ]
            },
            callback=self.parse,
            errback=self.errback_playwright
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        try:
            # Wait for products to load
            await page.wait_for_selector('.product', timeout=10000)

            # Scroll to load all products (infinite scroll)
            previous_height = 0
            while True:
                # Get current scroll height
                current_height = await page.evaluate('document.body.scrollHeight')

                # If no change, we've reached the end
                if current_height == previous_height:
                    break

                previous_height = current_height

                # Scroll to bottom
                await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')

                # Wait for new content
                await page.wait_for_timeout(2000)

            # Get final HTML
            content = await page.content()

            self.logger.info(f'Loaded all products, page height: {current_height}px')

        except Exception as e:
            self.logger.error(f'Error during page interaction: {e}')
            await page.screenshot(path='error.png')
            return

        finally:
            await page.close()

        # Parse the fully loaded page
        from scrapy.http import HtmlResponse
        final_response = HtmlResponse(
            url=response.url,
            body=content.encode('utf-8'),
            encoding='utf-8'
        )

        products = final_response.css('.product')
        self.logger.info(f'Found {len(products)} products')

        for product in products:
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'image': product.css('img::attr(src)').get(),
                'url': product.css('a::attr(href)').get()
            }

    async def errback_playwright(self, failure):
        page = failure.request.meta.get('playwright_page')
        if page:
            await page.close()

        self.logger.error(f'Request failed: {failure.value}')
Enter fullscreen mode Exit fullscreen mode

Debugging Tips

Enable Debug Logging

# settings.py
PLAYWRIGHT_LOGGING = True
Enter fullscreen mode Exit fullscreen mode

Take Screenshots at Each Step

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'screenshot': {'path': 'step1.png'}},
        {'click': 'button.load-more'},
        {'wait_for_timeout': 2000},
        {'screenshot': {'path': 'step2.png'}},
    ]
}
Enter fullscreen mode Exit fullscreen mode

Run in Headed Mode

PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': False,
    'slow_mo': 1000  # Slow down by 1 second per action
}
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Forgetting async/await

# BAD (will crash)
def parse(self, response):
    page = response.meta['playwright_page']
    page.click('button')  # Missing await!

# GOOD
async def parse(self, response):
    page = response.meta['playwright_page']
    await page.click('button')
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Not Closing Pages

# BAD (memory leak)
async def parse(self, response):
    page = response.meta['playwright_page']
    content = await page.content()
    # Forgot to close!

# GOOD
async def parse(self, response):
    page = response.meta['playwright_page']
    try:
        content = await page.content()
    finally:
        await page.close()
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Using Playwright for Everything

# BAD (unnecessary)
yield scrapy.Request(url, meta={'playwright': True})
# If site works without JavaScript, don't use Playwright!

# GOOD
# Use Playwright only when needed
if self.needs_javascript(url):
    yield scrapy.Request(url, meta={'playwright': True})
else:
    yield scrapy.Request(url)  # Regular Scrapy
Enter fullscreen mode Exit fullscreen mode

When to Use Playwright

Use Playwright when:

  • Content loaded by JavaScript
  • Need to interact with page (click, scroll, type)
  • Infinite scroll
  • Single Page Applications (React, Vue, Angular)
  • Need screenshots
  • Content in Shadow DOM

Don't use Playwright when:

  • Content in HTML source (check with Ctrl+U)
  • API available (much faster)
  • Simple static sites
  • Speed is critical

Rule of thumb: Check page source first. If data is there, use regular Scrapy!


Summary

Installation:

pip install scrapy-playwright
playwright install
Enter fullscreen mode Exit fullscreen mode

Enable in settings:

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}
Enter fullscreen mode Exit fullscreen mode

Basic usage:

meta={'playwright': True}
Enter fullscreen mode Exit fullscreen mode

Page interactions:

meta={
    'playwright': True,
    'playwright_page_methods': [
        {'wait_for_selector': '.product'},
        {'click': 'button'},
        {'screenshot': {'path': 'page.png'}}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Advanced (page object):

meta={'playwright': True, 'playwright_include_page': True}
# Then use: page = response.meta['playwright_page']
Enter fullscreen mode Exit fullscreen mode

Remember:

  • Use only when needed
  • Always close pages
  • async/await required with page object
  • Check page source first
  • Headed mode for debugging only

Scrapy-Playwright is powerful but slower than regular Scrapy. Use it wisely!

Happy scraping! 🕷️

Top comments (0)