Lakshay Nasa for Extract Data by Zyte

Posted on Nov 17

n8n Web Scraping || Part 2: Pagination, Infinite Scroll, Network Capture & More

#programming #webscraping #ai #python

This is Part 2 of our n8n web scraping series with Zyte API. If you’re new here, check out Part 1 first, it covers the basics: fetching pages, extracting HTML with the HTML node, cleaning + normalizing results, & exporting CSV/JSON.

Pagination
Infinite Scroll
Geolocation support
Screenshots from browser rendering
Capturing network requests
Handling cookies, sessions, headers & IP type

Let’s Begin!

In this part, we’ll explore some important scraping practices and nodes, along with a few hands on tricks that make your web scraping journey smoother.

Everything you learn here will also lay the foundation for our 3rd & final part, where we will build a universal scraper capable of scraping any website with minimal configuration.

Let’s start by taking the same workflow we built in Part 1, & extend it. Beginning with Pagination and Infinite Scroll.

Pagination across pages

A website can navigate in multiple ways & our scraper needs to adapt accordingly.

N8N gives us a default Pagination Mode inside the HTTP Request node under Options, and while it sounds convenient, it didn’t behave reliably in my experience for typical web scraping use cases.

After testing several patterns, the approach below is the one that has worked most consistently in my workflows.

💬 If you’re stuck or want to share your own approach, lets discuss it in the Extract Data Discord.

Step 1: Page Manager Node

Before calling the HTTP Request node, we introduce a small function called Page Manager, exactly what the name suggests: a node that controls the page number.

Add a Code node (JavaScript) and paste:

// Page Manager Function ( We use this node as both starter and incrementer)
const MAX_PAGE = 100;

// n8n provides `items` array. If no items => first run
let current = 0;
if (Array.isArray(items) && items.length > 0) {

  // Check for .json.page (from this node's first run)
  // OR .json.Page (from the Normalizer node's output)
  const p = items[0].json?.page || items[0].json?.Page;

  current = (typeof p !== 'undefined' && p !== null && p !== '') ? Number(p) || 0 : 0;
}

let next;
if (current === 0) {
  next = 1; // first run (still 1)
} else {
  next = current + 1;
}

if (next > MAX_PAGE) return []; // safety stop

return [{ json: { page: next } }];

What this does:

On the first run, it starts with page = 1.
Every time the loop returns here, it increments to the next page.
There’s a built in safety MAX_PAGE so you don’t accidentally infinite loop. ( Adjust accordingly )

Now update your URL in old HTTP Request node to use the page variable:

URL: https://books.toscrape.com/catalogue/page-{{ $json.page }}.html

This makes the node fetch the correct page each time.

The rest of the workflow remains the same, till the second HTML Extract node (where we parsed the book name, URL, price, rating etc. in Part 1).

Step 2: Modify the Normalizer Function Node to Save Results Across Pages

In Part 1, our Step 7 code simply cleaned and normalized items for one page.

Now we need it to do two things:

Normalize the results (same as before)
Store the results from every page inside n8n’s global static data bucket. Think of it like temporary workflow memory.

Update the node, code with:

// --- Normalizer (Code node) ---
// Get the global workflow static data bucket
const workflowStaticData = $getWorkflowStaticData('global');

// initialize storage if needed
workflowStaticData.workBooks = workflowStaticData.workBooks || [];

// normalization logic (kept minimal version)
const base = 'https://books.toscrape.com/';
const normalized = items.map(item => {
  const urlRel = item.json.url || '';
  const imgRel = item.json.image || '';
  const ratingClass = item.json.rating || '';
  const ratingParts = ratingClass.split(' ');
  const rating = ratingParts.length > 1 ? ratingParts[ratingParts.length - 1] : '';

  return {
    name: item.json.name || '',
    url: base + urlRel.replace(/^(\.\.\/)+/, ''),
    image: base + imgRel.replace(/^(\.\.\/)+/, ''),
    price: item.json.price || '',
    availability: (item.json.availability || '').toString().trim(),
    rating
  };
});

// append to global storage
workflowStaticData.workBooks.push(...normalized);

// return control info for IF node (not the items)
const currentPage = $('Page Manager').first().json.page || 1;
return [{
  json: {
    itemsFound: normalized.length,
    nextHref: $json.nextHref || null,
    Page: currentPage
  }
}];

We normalize the data exactly like Part 1.
Then we push all normalized items into workflowStaticData.workBooks.
Instead of returning the items themselves, we return only a small control object.
This object is used by the IF node to decide whether we continue scraping or stop.

Step 3: IF Node (Stop Scraping or Continue)

Add an IF node with two conditions and OR Type:

Condition 1:
{{ $json.itemsFound }} is equal to 0

Meaning → The current page returned no items → we’ve reached the end.

Condition 2:
{{ $json.Page }} is greater than or equal to YOUR_MAX_PAGE

Meaning → Stop when you reach the max page number you set.

Together these conditions help the workflow decide:

IF → True
Stop scraping and move to the export step.

IF → False
Go back to the Page Manager, increment the page number, and keep scraping.

This creates a complete and safe pagination loop.

Step 4: Collect All Results and Export

When the IF node returns True, add one more small Code node before the Convert To File node:

// Get the global data
const workflowStaticData = $getWorkflowStaticData('global');

// Get the array of books, or an empty array if it doesn't exist
const allBooks = workflowStaticData.workBooks || [];

// Return all the books as standard n8n items
return allBooks.map(book => ({ json: book }));

What this one does:

Pulls everything we stored in the temporary memory.
Returns it as normal n8n items.
These go straight into Convert To File → CSV.

And that’s the entire pagination workflow.

![Pagination in N8N]](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xjdz86w1ynumt4xj1a3x.png)

Infinite Scroll

This one is much simpler.

Some websites load content as you scroll, there's no traditional page numbers.
The Zyte API supports browser actions, which makes this easy.

Just add one line to our original cURL command:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{ "action": "scrollBottom" }]}' \
   https://api.zyte.com/v1/extract

Why this works

Zyte API loads the page in a headful browser session.
It scrolls to the bottom, triggering all JavaScript that loads additional items.
Then it returns the final, fully loaded browserHtml.
You can parse this HTML normally using the same nodes from Part 1.

Geolocation

Some websites return different data depending on your region.
Zyte API makes this super simple by allowing you to specify a geolocation.

Use this inside an HTTP Request node:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "http://ip-api.com/json", "browserHtml": true, "geolocation": "AU" }' \
   https://api.zyte.com/v1/extract

Setting "geolocation": "AU" makes Zyte perform the browser request from that region, check the list of all available CountryCodes.
Many websites use region based content (pricing, currencies, language, product availability), so this is extremely helpful.

Screenshots

If you’d like to grab a screenshot of what the browser rendered, you can do that too.

cURL:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://toscrape.com", "screenshot": true }' \
   https://api.zyte.com/v1/extract

It will return the screenshot as Base64 data.

To convert it into a proper image (PNG, JPEG, etc.) → Use Convert To File node in n8n.

Important:

n8n often converts boolean values like true into "true" when importing via cURL.
Fix it by clicking the gear icon → Add Expression → {{true}}.

Or switch body mode to Using JSON and paste:

{
  "url": "https://toscrape.com",
  "screenshot": true
}

Network Capture

Many modern websites load content through background API calls rather than raw HTML.
And you can just capture those network activity during rendering.

Example:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true,  "networkCapture": [
        {
            "filterType": "url",
            "httpResponseBody": true,
            "value": "/api/",
            "matchType": "contains"
        }]}' \
   https://api.zyte.com/v1/extract

This returns a networkCapture array with all responses whose URL contains /api/.

Understanding above Parameters

filterType: "url" ⟶ filter network requests by URL
value: "/api/" ⟶ look for URLs containing /api/
matchType: "contains" ⟶ pattern match style
httpResponseBody: true ⟶ include the response body (Base64)

Extracting data from the captured network response

You can decode the Base64 response in two easy ways:

1. Using a Function node (Python)

(You can also use JS if you prefer)

# Get the network capture data
capture = _input.first().json["networkCapture"][0]

# Decode base64 and parse JSON
import base64
import json

decoded_data = base64.b64decode(capture["httpResponseBody"]).decode('utf-8')
data = json.loads(decoded_data)

# Return the result
return [{
    "json": {
        "quotes": data["quotes"],
        "firstAuthor": data["quotes"][0]["author"]["name"]
    }
}]

→ This method decodes the Base64 encoded HTTP response, parses it as JSON, and gives you structured data directly, very reliable and readable.

2. Using Edit Field Node (No code)

In this method, you still need to parse your data

Add an Edit Fields node
Mode: Add Field
Name: decodedData
Type: String
Value:

{{ $json.networkCapture[0].httpResponseBody.base64Decode().parseJson() }}

→ This takes the Base64 content, decodes it, parses JSON, and puts the result under decodedData automatically.

Cookies, sessions, headers & IP type (quick guide)

When you move from toy sites to real sites, a few extra controls matter a lot: which IP type you use, whether you keep a session, and what cookies or headers you send.

Zyte API exposes all these as request fields and you can use them the same way we used browserHtml, networkCapture or actions above (via curl → Import in n8n HTTP Request node → Adjust Fields as needed → Extract).

To keep this guide focused, we won’t dive into all code examples here, but here’s one small one for _ Setting a cookie and getting it back_ ( requestCookies ) just to show how it integrates.

Cookies (viarequestCookies/ responseCookies) ➜ Useful when a website relies on cookies for preferences, language, or maintaining continuity between requests.

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{ "url": "https://httpbin.org/cookies", "browserHtml": true,
    "requestCookies": [{ "name": "foo",
            "value": "bar",
            "domain": "httpbin.org"
        }]
}' \
   https://api.zyte.com/v1/extract

⟶ This example uses requestCookies, but responseCookies works the same way, you simply read cookies from one request and pass them into the next.

Learn more on cookies.

Everything else below (sessions, ipType, custom headers) plugs in the same way.

Sessions
➜ Sessions bundle IP address, cookie jar, and network settings so multiple requests look consistently related. Helpful for multi step interactions, region based content, or sites that hate stateless scraping..
Docs: Sessions
Custom Headers
➜ Add a User Agent, Referer, or any custom metadata the target site expects: simply define them inside the HTTP Request node headers.
Docs: Headers
IP Type (datacenter vs residential)
➜ Some sites vary content based on IP type. Zyte API automatically selects the best option, but you can override it with ipType.
Docs: IP Types

All of these follow the same pattern we’ve already used above.

Where This Takes Us Next

And that’s it for Part 2! 🎉

We covered a lot more than just pagination, from infinite scroll & geolocation to screenshots, network capture, and the key request fields you’ll use while scraping sites.

What we learned isn’t a complete workflow on its own, but it builds the foundation you’ll use again and again in your scraping workflows.

In Part 3, we’ll take everything one step further and combine these patterns into a universal scraper: a reusable, configurable template that can adapt to almost any site with minimal changes.

Thanks for following along, and feel free to share your workflow, questions, or improvements in the Extract Data Community.
Happy scraping! 🕸️✨

DEV Community