DEV Community

John Rooney for Extract by Zyte

Posted on

How to Handle JavaScript-Rendered Pages Without a Full Browser

The HTML that requests downloads is what the server sends before any JavaScript runs. For a large and growing number of sites, that document is nearly empty — a shell with <script> tags that populate the content after the browser executes them. BeautifulSoup finds nothing because there is nothing to find in the source.

Two approaches handle this. The first is to skip the HTML entirely and call the underlying API the JavaScript is already talking to. The second is to use a real browser. The first option is faster, more reliable, and available more often than people expect.


Identifying a JS-rendered page

The test is straightforward: compare what you see in "View Source" against what you see in DevTools' Elements panel. View Source shows the raw server response — exactly what requests receives. The Elements panel shows the live DOM after JavaScript has run.

If your target data appears in the Elements panel but not in View Source, it's JS-rendered.

The other common indicator: requests returns a 200 with a short body. A page that renders 50 product cards in the browser but returns 2KB of HTML to a plain HTTP request is loading its content dynamically.


Option 1: Find the underlying API

When a page loads content via JavaScript, that JavaScript has to get the data from somewhere. Usually it makes an XHR or Fetch request to a JSON API. That same API is available to your scraper — with no browser required.

To find it, open DevTools, go to the Network tab, filter by "Fetch/XHR", then reload the page. Watch for requests that return JSON. Click on one, check the "Preview" tab — if it contains your data, you've found the endpoint.

Here's a concrete example. quotes.toscrape.com/js/ renders its quotes via JavaScript. The raw HTML response contains zero quote elements:

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

resp = session.get("https://quotes.toscrape.com/js/", timeout=15)
soup = BeautifulSoup(resp.text, "html.parser")
print(len(soup.find_all("div", class_="quote")))  # 0
Enter fullscreen mode Exit fullscreen mode

But the JavaScript is fetching from /api/quotes. That endpoint returns clean JSON and supports pagination via a page parameter:

import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "application/json",
    "Accept-Language": "en-GB,en;q=0.9",
})

all_quotes = []
page = 1

while True:
    resp = session.get(
        "https://quotes.toscrape.com/api/quotes",
        params={"page": page},
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    all_quotes.extend(data["quotes"])
    print(f"Page {page}: {len(data['quotes'])} quotes")

    if not data["has_next"]:
        break
    page += 1

print(f"\nTotal: {len(all_quotes)} quotes")
Enter fullscreen mode Exit fullscreen mode

Output:

Page 1: 10 quotes
Page 2: 10 quotes
...
Page 10: 10 quotes

Total: 100 quotes
Enter fullscreen mode Exit fullscreen mode

This is faster than any browser-based approach and produces cleaner data. Change the Accept header to application/json when hitting JSON endpoints — some APIs check it.

When inspecting Network requests, look for:

  • URLs containing /api/, /graphql, /v1/, /data/, or .json
  • Requests where the Preview tab shows an object or array
  • Query parameters like page, offset, cursor, limit — these tell you the pagination model upfront

Not every site exposes a clean API. Some assemble their data server-side, some use GraphQL, some obfuscate the endpoints. When the API route isn't viable, use a browser.


Option 2: Playwright

Playwright drives a real browser, so what it sees is identical to what a user sees. It's slower than requests and consumes more memory, but it's the correct tool when JavaScript execution is unavoidable.

Install:

pip install playwright
python -m playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Basic scrape of the same JS-rendered page:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    quotes = page.locator("div.quote").all()
    for q in quotes:
        text   = q.locator("span.text").inner_text()
        author = q.locator("small.author").inner_text()
        print(f"{text[:70]}... — {author}")

    browser.close()
Enter fullscreen mode Exit fullscreen mode

wait_until="networkidle" tells Playwright to wait until there are no ongoing network requests for at least 500ms — enough time for most JS-driven content to load. For pages with long-running background requests (analytics pings, chat widgets), "domcontentloaded" or a specific element wait is more reliable:

# Wait for a specific element rather than network quiet
page.goto("https://quotes.toscrape.com/js/")
page.wait_for_selector("div.quote")
Enter fullscreen mode Exit fullscreen mode

page.locator() is preferable to page.query_selector_all(). Locators are lazy — they don't execute until you call a method on them — and they retry automatically if the element isn't immediately present. This makes them more tolerant of pages that render content in stages.


Blocking unnecessary resources

By default, Playwright fetches everything: stylesheets, images, fonts, analytics scripts. For scraping, none of that matters. Blocking it cuts load time noticeably:

from playwright.sync_api import sync_playwright

def block_non_essential(route):
    if route.request.resource_type in ("image", "font", "stylesheet", "media"):
        route.abort()
    else:
        route.continue_()

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.route("**/*", block_non_essential)

    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    texts = page.locator("div.quote span.text").all_inner_texts()
    print(f"Found {len(texts)} quotes")

    browser.close()
Enter fullscreen mode Exit fullscreen mode

The page.route() call intercepts every request. For anything that matches image, font, stylesheet, or media, it calls route.abort() instead of letting it through. The data-carrying requests — the HTML document and any XHR calls — continue normally.


Using Playwright to discover hidden APIs

Even when you intend to use Playwright for the actual scraping, it can show you API calls you didn't know existed. Register a response handler before navigation:

from playwright.sync_api import sync_playwright

api_calls = []

def capture_api(response):
    content_type = response.headers.get("content-type", "")
    if "json" in content_type:
        api_calls.append(response.url)

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.on("response", capture_api)

    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    print("JSON responses observed:")
    for url in api_calls:
        print(f"  {url}")

    browser.close()
Enter fullscreen mode Exit fullscreen mode

If any JSON responses appear, check whether they contain your target data. If they do, you can switch from Playwright to a plain requests call against that URL — no browser needed for subsequent runs.


Choosing between the two approaches

API scraping is worth the investigation time. A requests loop that pages through a JSON API runs 10-50x faster than Playwright, uses a fraction of the memory, and produces structured data without parsing HTML. If the endpoint exists and isn't authenticated in a way you can't replicate, use it.

Playwright is the right answer when: the page builds its content from multiple sources with no single API, authentication involves cookies set by JavaScript challenges, or content only appears after user interactions like scroll events or button clicks.

The two approaches also compose. Use Playwright to log in and capture session cookies, then hand those cookies to a requests.Session for the actual data collection:

from playwright.sync_api import sync_playwright
import requests

# Step 1: get session cookies via browser
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/login")
    page.fill("#username", "you@example.com")
    page.fill("#password", "your-password")
    page.click("button[type=submit]")
    page.wait_for_url("**/dashboard")
    cookies = page.context.cookies()
    browser.close()

# Step 2: transfer cookies to requests session
session = requests.Session()
for cookie in cookies:
    session.cookies.set(cookie["name"], cookie["value"], domain=cookie["domain"])

# Step 3: scrape with requests
resp = session.get("https://example.com/api/data", timeout=15)
print(resp.json())
Enter fullscreen mode Exit fullscreen mode

Tags: python webscraping playwright tutorial

Top comments (0)