Muhammad Ikramullah Khan

Posted on Jan 2

Scrapy with JavaScript & Dynamic Content: When Your Selectors Return Nothing

#python #javascript #json #programming

I'll never forget the first time I tried scraping a modern website. My selectors worked perfectly in the browser's inspector. But when I ran my spider, everything returned None.

I checked my CSS selectors ten times. I tried XPath. Still nothing. I was going crazy.

Then I viewed the page source (Ctrl+U) and realized: the content wasn't there. The HTML was nearly empty. Everything was loaded by JavaScript after the page rendered.

That's when I learned: Scrapy doesn't run JavaScript. It only sees the initial HTML. Let me show you how to handle JavaScript-heavy sites.

The Problem: Scrapy Doesn't Run JavaScript

When you visit a website in a browser:

Browser downloads HTML
Browser runs JavaScript
JavaScript fetches data (AJAX)
JavaScript builds the page
You see the final result

When Scrapy visits the same site:

Scrapy downloads HTML
Scrapy stops here
JavaScript never runs
Dynamic content never loads
Your selectors find nothing

How to Detect JavaScript-Heavy Sites

Test 1: View Page Source vs Inspect Element

In your browser:

Right-click → "Inspect Element"
Find the element you want to scrape
Note its HTML structure

Then:

Press Ctrl+U (or Cmd+Option+U on Mac)
Search for the same content
Is it there?

If NO: Content is JavaScript-loaded.

If YES: Content is in HTML, Scrapy will work.

Test 2: Use Scrapy Shell

scrapy shell "https://example.com"

>>> response.css('.product-name::text').get()
None  # Uh oh, JavaScript site!

Test 3: Disable JavaScript in Browser

Open Chrome DevTools (F12)
Press Ctrl+Shift+P
Type "disable javascript"
Select "Disable JavaScript"
Refresh the page

If the page is now empty or broken, it's JavaScript-heavy.

Solution 1: Find the API (Best Approach)

JavaScript-heavy sites load data from APIs. Find these APIs and scrape them directly.

How to Find APIs

Step 1: Open Network Tab

Open DevTools (F12)
Click "Network" tab
Filter by "XHR" or "Fetch"
Refresh the page

Step 2: Look for JSON Responses

Watch the network requests. Look for:

Requests to /api/
Requests returning JSON
Requests with product data

Step 3: Click on Interesting Requests

Click a request → "Preview" tab

If you see your data in JSON format, you found it!

Example: Scraping the API Directly

Let's say you find this API:

https://example.com/api/products?page=1&limit=20

Returns:

{
  "products": [
    {"id": 1, "name": "Product 1", "price": 29.99},
    {"id": 2, "name": "Product 2", "price": 39.99}
  ],
  "total": 100
}

Your Spider:

import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'api'

    def start_requests(self):
        # Scrape the API directly, not the webpage
        url = 'https://example.com/api/products?page=1&limit=20'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        data = json.loads(response.text)

        for product in data['products']:
            yield {
                'name': product['name'],
                'price': product['price']
            }

        # Pagination
        current_page = int(response.url.split('page=')[1].split('&')[0])
        total = data['total']
        items_per_page = 20

        if current_page * items_per_page < total:
            next_page = current_page + 1
            next_url = f'https://example.com/api/products?page={next_page}&limit=20'
            yield scrapy.Request(next_url, callback=self.parse)

Benefits:

Much faster (no rendering)
Clean JSON data (no HTML parsing)
More reliable
Often has more data than the webpage

What the docs don't tell you:

APIs often have rate limiting (go slower)
Some APIs require authentication (check headers)
APIs might have different pagination than the website

Solution 2: Scrapy-Playwright (Modern Browser Automation)

When you can't find the API, use Scrapy-Playwright to render JavaScript.

Installation

pip install scrapy-playwright
playwright install

Basic Setup

settings.py:

# Enable Playwright
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

# Optional: Use Chromium, Firefox, or Webkit
PLAYWRIGHT_BROWSER_TYPE = 'chromium'

# Launch options
PLAYWRIGHT_LAUNCH_OPTIONS = {
    'headless': True,  # Run without visible browser
    'timeout': 60000,  # 60 seconds timeout
}

Basic Spider

import scrapy

class PlaywrightSpider(scrapy.Spider):
    name = 'playwright'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com',
            meta={'playwright': True}  # Enable Playwright for this request
        )

    def parse(self, response):
        # JavaScript has run! Content is now in response
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

Wait for Elements to Load

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Wait for specific element
                {'wait_for_selector': '.product-name'},
            ]
        }
    )

Wait for Network Idle

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Wait until no network requests for 500ms
                {'wait_for_load_state': 'networkidle'},
            ]
        }
    )

Scrolling (For Infinite Scroll)

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Scroll down multiple times
                {'evaluate': 'window.scrollBy(0, 1000)'},
                {'wait_for_timeout': 2000},
                {'evaluate': 'window.scrollBy(0, 1000)'},
                {'wait_for_timeout': 2000},
                {'evaluate': 'window.scrollBy(0, 1000)'},
            ]
        }
    )

Click Buttons

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                # Click "Load More" button
                {'click': 'button.load-more'},
                {'wait_for_selector': '.new-products'},
            ]
        }
    )

Screenshots (Debugging)

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={
            'playwright': True,
            'playwright_page_methods': [
                {'screenshot': {'path': 'page.png', 'fullPage': True}},
            ]
        }
    )

Solution 3: Scrapy-Selenium (Older but Still Works)

Selenium has been around longer and has more examples online.

Installation

pip install scrapy-selenium

Download ChromeDriver from: https://chromedriver.chromium.org/

Setup

settings.py:

from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless']  # Run without visible browser

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

Basic Spider

from scrapy_selenium import SeleniumRequest

class SeleniumSpider(scrapy.Spider):
    name = 'selenium'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://example.com',
            callback=self.parse
        )

    def parse(self, response):
        # JavaScript has run!
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

Wait for Elements

from scrapy_selenium import SeleniumRequest

def start_requests(self):
    yield SeleniumRequest(
        url='https://example.com',
        callback=self.parse,
        wait_time=10,  # Wait up to 10 seconds
        wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'product'))
    )

Execute JavaScript

def parse(self, response):
    driver = response.meta['driver']

    # Scroll down
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

    # Wait a bit
    import time
    time.sleep(2)

    # Now scrape
    for product in response.css('.product'):
        yield {...}

Solution 4: Splash (Lightweight Rendering)

Splash is a lightweight JavaScript rendering service.

Installation

# Run Splash with Docker
docker run -p 8050:8050 scrapinghub/splash

Setup

settings.py:

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Basic Spider

from scrapy_splash import SplashRequest

class SplashSpider(scrapy.Spider):
    name = 'splash'

    def start_requests(self):
        yield SplashRequest(
            url='https://example.com',
            callback=self.parse,
            args={'wait': 2}  # Wait 2 seconds
        )

    def parse(self, response):
        for product in response.css('.product'):
            yield {...}

Comparison: Which Solution to Use?

Use API Scraping When:

You can find the API endpoints
API has all the data you need
You want maximum speed
You want clean JSON data

Pros: Fast, reliable, clean data
Cons: Requires finding API, might need authentication

Use Scrapy-Playwright When:

Modern sites (2020+)
Need full browser features
Complex JavaScript interactions
Want the best tool

Pros: Modern, fast, feature-rich
Cons: Requires Playwright installation

Use Scrapy-Selenium When:

Older sites
Need Selenium-specific features
More examples available online
Already familiar with Selenium

Pros: Mature, lots of examples
Cons: Slower, resource-heavy

Use Splash When:

Want lightweight rendering
Already using Scrapinghub services
Need something between Scrapy and full browsers

Pros: Lightweight, separate service
Cons: Extra infrastructure, learning curve

Real-World Example: Infinite Scroll Site

Many modern sites use infinite scroll. Here's how to handle it:

With Playwright

import scrapy

class InfiniteScrollSpider(scrapy.Spider):
    name = 'infinite'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            meta={
                'playwright': True,
                'playwright_include_page': True,  # Keep page object
            },
            callback=self.parse
        )

    async def parse(self, response):
        page = response.meta['playwright_page']

        # Scroll and wait multiple times
        for i in range(10):  # Scroll 10 times
            # Scroll to bottom
            await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')

            # Wait for new content
            await page.wait_for_timeout(2000)

        # Get the fully loaded HTML
        content = await page.content()

        # Close page
        await page.close()

        # Parse with Scrapy selectors
        from scrapy.http import HtmlResponse
        new_response = HtmlResponse(
            url=response.url,
            body=content.encode('utf-8')
        )

        for product in new_response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

Performance Considerations

JavaScript Rendering is SLOW

Normal Scrapy:

10-50 pages per second

With Playwright/Selenium:

1-5 pages per second

Rendering JavaScript is 10-50x slower!

Optimization Strategies

1. Only render when necessary

def start_requests(self):
    # Check if page needs JavaScript
    if self.needs_javascript(url):
        yield scrapy.Request(url, meta={'playwright': True})
    else:
        yield scrapy.Request(url)  # Normal request

2. Use concurrent browsers

# settings.py
PLAYWRIGHT_MAX_CONTEXTS = 4  # Run 4 browsers in parallel

3. Cache rendered pages

HTTPCACHE_ENABLED = True  # Don't re-render during development

4. Prefer APIs

Always try to find APIs first. They're much faster.

Common Mistakes

Mistake #1: Not Waiting Long Enough

# BAD (content might not load)
meta={'playwright': True}

# GOOD (wait for content)
meta={
    'playwright': True,
    'playwright_page_methods': [
        {'wait_for_selector': '.products-loaded'}
    ]
}

Mistake #2: Forgetting to Enable Playwright

# BAD (won't render JavaScript)
yield scrapy.Request(url)

# GOOD
yield scrapy.Request(url, meta={'playwright': True})

Mistake #3: Using Rendering for Everything

# BAD (slow for no reason)
for url in urls:
    yield scrapy.Request(url, meta={'playwright': True})

# GOOD (only render when needed)
for url in urls:
    if 'product' in url:
        yield scrapy.Request(url, meta={'playwright': True})
    else:
        yield scrapy.Request(url)

Quick Decision Tree

Is content in page source (Ctrl+U)?
├─ Yes → Use normal Scrapy
└─ No → JavaScript content
   │
   Can you find the API?
   ├─ Yes → Scrape API directly (BEST)
   └─ No → Need browser
      │
      Modern site (2020+)?
      ├─ Yes → Use Scrapy-Playwright
      └─ No → Use Scrapy-Selenium

Summary

JavaScript content isn't in initial HTML:

Scrapy doesn't run JavaScript
Need special tools to render pages

Four solutions:

Find API (best, fastest)
Scrapy-Playwright (modern, recommended)
Scrapy-Selenium (older, still works)
Splash (lightweight alternative)

Always try APIs first:

10-50x faster
Cleaner data
More reliable

When rendering JavaScript:

Wait for content to load
Only render when necessary
Use concurrent browsers
Cache during development

Key insight:

View page source (Ctrl+U) to check if content is there
Inspect Element shows final result after JavaScript
These are different!

Start by checking page source. If content is there, use normal Scrapy. If not, find the API or use Playwright.

Happy scraping! 🕷️