DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy with JavaScript & Dynamic Content: When Your Selectors Return Nothing

I'll never forget the first time I tried scraping a modern website. My selectors worked perfectly in the browser's inspector. But when I ran my spider, everything returned None.

I checked my CSS selectors ten times. I tried XPath. Still nothing. I was going crazy.

Then I viewed the page source (Ctrl+U) and realized: the content wasn't there. The HTML was nearly empty. Everything was loaded by JavaScript after the page rendered.

That's when I learned: Scrapy doesn't run JavaScript. It only sees the initial HTML. Let me show you how to handle JavaScript-heavy sites.


The Problem: Scrapy Doesn't Run JavaScript

When you visit a website in a browser:

  1. Browser downloads HTML
  2. Browser runs JavaScript
  3. JavaScript fetches data (AJAX)
  4. JavaScript builds the page
  5. You see the final result

When Scrapy visits the same site:

  1. Scrapy downloads HTML
  2. Scrapy stops here
  3. JavaScript never runs
  4. Dynamic content never loads
  5. Your selectors find nothing

How to Detect JavaScript-Heavy Sites

Test 1: View Page Source vs Inspect Element

In your browser:

  1. Right-click → "Inspect Element"
  2. Find the element you want to scrape
  3. Note its HTML structure

Then:

  1. Press Ctrl+U (or Cmd+Option+U on Mac)
  2. Search for the same content
  3. Is it there?

If NO: Content is JavaScript-loaded.

If YES: Content is in HTML, Scrapy will work.

Test 2: Use Scrapy Shell

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode
>>> response.css('.product-name::text').get()
None  # Uh oh, JavaScript site!
Enter fullscreen mode Exit fullscreen mode

Test 3: Disable JavaScript in Browser

  1. Open Chrome DevTools (F12)
  2. Press Ctrl+Shift+P
  3. Type "disable javascript"
  4. Select "Disable JavaScript"
  5. Refresh the page

If the page is now empty or broken, it's JavaScript-heavy.


Solution 1: Find the API (Best Approach)

JavaScript-heavy sites load data from APIs. Find these APIs and scrape them directly.

How to Find APIs

Step 1: Open Network Tab

  1. Open DevTools (F12)
  2. Click "Network" tab
  3. Filter by "XHR" or "Fetch"
  4. Refresh the page

Step 2: Look for JSON Responses

Watch the network requests. Look for:

  • Requests to /api/
  • Requests returning JSON
  • Requests with product data

Step 3: Click on Interesting Requests

Click a request → "Preview" tab

If you see your data in JSON format, you found it!

Example: Scraping the API Directly

Let's say you find this API:

https://example.com/api/products?page=1&limit=20
Enter fullscreen mode Exit fullscreen mode

Returns:

{
  "products": [
    {"id": 1, "name": "Product 1", "price": 29.99},
    {"id": 2, "name": "Product 2", "price": 39.99}
  ],
  "total": 100
}
Enter fullscreen mode Exit fullscreen mode

Your Spider:

import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'api'

    def start_requests(self):
        # Scrape the API directly, not the webpage
        url = 'https://example.com/api/products?page=1&limit=20'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        data = json.loads(response.text)

        for product in data['products']:
            yield {
                'name': product['name'],
                'price': product['price']
            }

        # Pagination
        current_page = int(response.url.split('page=')[1].split('&')[0])
        total = data['total']
        items_per_page = 20

        if current_page * items_per_page < total:
            next_page = current_page + 1
            next_url = f'https://example.com/api/products?page={next_page}&limit=20'
            yield scrapy.Request(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Much faster (no rendering)
  • Clean JSON data (no HTML parsing)
  • More reliable
  • Often has more data than the webpage

What the docs don't tell you:

  • APIs often have rate limiting (go slower)
  • Some APIs require authentication (check headers)
  • APIs might have different pagination than the website

Solution 2: Scrapy-Playwright (Modern Browser Automation)

When you can't find the API, use Scrapy-Playwright to render JavaScript.

Installation

pip install scrapy-playwright
playwright install
Enter fullscreen mode Exit fullscreen mode

Basic Setup

settings.py:

# Enable Playwright
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

# Optional: Use Chromium, Firefox, or Webkit
PLAYWRIGHT_BROWSER_TYPE = 'chromium'

# Launch options
PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': True,  # Run without visible browser
    'timeout': 60000,  # 60 seconds timeout
}
Enter fullscreen mode Exit fullscreen mode

Basic Spider

import scrapy

class PlaywrightSpider(scrapy.Spider):
    name = 'playwright'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com',
            meta={'playwright': True}  # Enable Playwright for this request
        )

    def parse(self, response):
        # JavaScript has run! Content is now in response
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Wait for Elements to Load

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Wait for specific element
                {'wait_for_selector': '.product-name'},
            ]
        }
    )
Enter fullscreen mode Exit fullscreen mode

Wait for Network Idle

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Wait until no network requests for 500ms
                {'wait_for_load_state': 'networkidle'},
            ]
        }
    )
Enter fullscreen mode Exit fullscreen mode

Scrolling (For Infinite Scroll)

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Scroll down multiple times
                {'evaluate': 'window.scrollBy(0, 1000)'},
                {'wait_for_timeout': 2000},
                {'evaluate': 'window.scrollBy(0, 1000)'},
                {'wait_for_timeout': 2000},
                {'evaluate': 'window.scrollBy(0, 1000)'},
            ]
        }
    )
Enter fullscreen mode Exit fullscreen mode

Click Buttons

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Click "Load More" button
                {'click': 'button.load-more'},
                {'wait_for_selector': '.new-products'},
            ]
        }
    )
Enter fullscreen mode Exit fullscreen mode

Screenshots (Debugging)

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                {'screenshot': {'path': 'page.png', 'fullPage': True}},
            ]
        }
    )
Enter fullscreen mode Exit fullscreen mode

Solution 3: Scrapy-Selenium (Older but Still Works)

Selenium has been around longer and has more examples online.

Installation

pip install scrapy-selenium
Enter fullscreen mode Exit fullscreen mode

Download ChromeDriver from: https://chromedriver.chromium.org/

Setup

settings.py:

from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless']  # Run without visible browser

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}
Enter fullscreen mode Exit fullscreen mode

Basic Spider

from scrapy_selenium import SeleniumRequest

class SeleniumSpider(scrapy.Spider):
    name = 'selenium'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://example.com',
            callback=self.parse
        )

    def parse(self, response):
        # JavaScript has run!
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Wait for Elements

from scrapy_selenium import SeleniumRequest

def start_requests(self):
    yield SeleniumRequest(
        url='https://example.com',
        callback=self.parse,
        wait_time=10,  # Wait up to 10 seconds
        wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'product'))
    )
Enter fullscreen mode Exit fullscreen mode

Execute JavaScript

def parse(self, response):
    driver = response.meta['driver']

    # Scroll down
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

    # Wait a bit
    import time
    time.sleep(2)

    # Now scrape
    for product in response.css('.product'):
        yield {...}
Enter fullscreen mode Exit fullscreen mode

Solution 4: Splash (Lightweight Rendering)

Splash is a lightweight JavaScript rendering service.

Installation

# Run Splash with Docker
docker run -p 8050:8050 scrapinghub/splash
Enter fullscreen mode Exit fullscreen mode

Setup

settings.py:

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Enter fullscreen mode Exit fullscreen mode

Basic Spider

from scrapy_splash import SplashRequest

class SplashSpider(scrapy.Spider):
    name = 'splash'

    def start_requests(self):
        yield SplashRequest(
            url='https://example.com',
            callback=self.parse,
            args={'wait': 2}  # Wait 2 seconds
        )

    def parse(self, response):
        for product in response.css('.product'):
            yield {...}
Enter fullscreen mode Exit fullscreen mode

Comparison: Which Solution to Use?

Use API Scraping When:

  • You can find the API endpoints
  • API has all the data you need
  • You want maximum speed
  • You want clean JSON data

Pros: Fast, reliable, clean data
Cons: Requires finding API, might need authentication

Use Scrapy-Playwright When:

  • Modern sites (2020+)
  • Need full browser features
  • Complex JavaScript interactions
  • Want the best tool

Pros: Modern, fast, feature-rich
Cons: Requires Playwright installation

Use Scrapy-Selenium When:

  • Older sites
  • Need Selenium-specific features
  • More examples available online
  • Already familiar with Selenium

Pros: Mature, lots of examples
Cons: Slower, resource-heavy

Use Splash When:

  • Want lightweight rendering
  • Already using Scrapinghub services
  • Need something between Scrapy and full browsers

Pros: Lightweight, separate service
Cons: Extra infrastructure, learning curve


Real-World Example: Infinite Scroll Site

Many modern sites use infinite scroll. Here's how to handle it:

With Playwright

import scrapy

class InfiniteScrollSpider(scrapy.Spider):
    name = 'infinite'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            meta={
                'playwright': True,
                'playwright_include_page': True,  # Keep page object
            },
            callback=self.parse
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        # Scroll and wait multiple times
        for i in range(10):  # Scroll 10 times
            # Scroll to bottom
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')

            # Wait for new content
            await page.wait_for_timeout(2000)

        # Get the fully loaded HTML
        content = await page.content()

        # Close page
        await page.close()

        # Parse with Scrapy selectors
        from scrapy.http import HtmlResponse
        new_response = HtmlResponse(
            url=response.url,
            body=content.encode('utf-8')
        )

        for product in new_response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Performance Considerations

JavaScript Rendering is SLOW

Normal Scrapy:

  • 10-50 pages per second

With Playwright/Selenium:

  • 1-5 pages per second

Rendering JavaScript is 10-50x slower!

Optimization Strategies

1. Only render when necessary

def start_requests(self):
    # Check if page needs JavaScript
    if self.needs_javascript(url):
        yield scrapy.Request(url, meta={'playwright': True})
    else:
        yield scrapy.Request(url)  # Normal request
Enter fullscreen mode Exit fullscreen mode

2. Use concurrent browsers

# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 4  # Run 4 browsers in parallel
Enter fullscreen mode Exit fullscreen mode

3. Cache rendered pages

HTTPCACHE_ENABLED = True  # Don't re-render during development
Enter fullscreen mode Exit fullscreen mode

4. Prefer APIs

Always try to find APIs first. They're much faster.


Common Mistakes

Mistake #1: Not Waiting Long Enough

# BAD (content might not load)
meta={'playwright': True}

# GOOD (wait for content)
meta={
    'playwright': True,
    'playwright_page_methods': [
        {'wait_for_selector': '.products-loaded'}
    ]
}
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Forgetting to Enable Playwright

# BAD (won't render JavaScript)
yield scrapy.Request(url)

# GOOD
yield scrapy.Request(url, meta={'playwright': True})
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Using Rendering for Everything

# BAD (slow for no reason)
for url in urls:
    yield scrapy.Request(url, meta={'playwright': True})

# GOOD (only render when needed)
for url in urls:
    if 'product' in url:
        yield scrapy.Request(url, meta={'playwright': True})
    else:
        yield scrapy.Request(url)
Enter fullscreen mode Exit fullscreen mode

Quick Decision Tree

Is content in page source (Ctrl+U)?
├─ Yes → Use normal Scrapy
└─ No → JavaScript content
   │
   Can you find the API?
   ├─ Yes → Scrape API directly (BEST)
   └─ No → Need browser
      │
      Modern site (2020+)?
      ├─ Yes → Use Scrapy-Playwright
      └─ No → Use Scrapy-Selenium
Enter fullscreen mode Exit fullscreen mode

Summary

JavaScript content isn't in initial HTML:

  • Scrapy doesn't run JavaScript
  • Need special tools to render pages

Four solutions:

  1. Find API (best, fastest)
  2. Scrapy-Playwright (modern, recommended)
  3. Scrapy-Selenium (older, still works)
  4. Splash (lightweight alternative)

Always try APIs first:

  • 10-50x faster
  • Cleaner data
  • More reliable

When rendering JavaScript:

  • Wait for content to load
  • Only render when necessary
  • Use concurrent browsers
  • Cache during development

Key insight:

  • View page source (Ctrl+U) to check if content is there
  • Inspect Element shows final result after JavaScript
  • These are different!

Start by checking page source. If content is there, use normal Scrapy. If not, find the API or use Playwright.

Happy scraping! 🕷️

Top comments (0)