Lakshay Nasa for Extract by Zyte

Posted on Oct 17 • Edited on Nov 16

Web Scraping with n8n | Part 1: Build Your First Web Scraper

#programming #webscraping #ai #python

What it will cover!

If you’ve ever wished you could automate scraping without setting up a bunch of scripts, proxies, or browser logic, you're in the right place.

We’ll use n8n, the low code automation tool, together with Zyte API to fetch structured data from https://books.toscrape.com/.

By the end, you’ll have a workflow that runs on its own, giving you clean JSON or CSV output of all books - their names, prices, ratings, and images. And a setup you can easily adapt for other publicly available or test websites with similar layouts.

Let’s get scraping!

The game plan:

Fetch the page using Zyte API (it handles rendering & manages blocks automatically)
Extract HTML content inside n8n
Parse book elements with CSS selectors
Clean and normalize the data
Export results as JSON or CSV

First, let’s get n8n ready to roll.
You can set it up for free locally, or in the cloud whichever you prefer.
If you’re going local, install it via Docker or npm, it only takes a few commands.

Once it’s up, the steps below will work exactly the same whether you’re using n8n Desktop or n8n Cloud.

Step 1: Create a new workflow in n8n

After logging in, create a new workflow.
Name it something like "Book Catalog Scraper" or you can tweak the same workflow later for similar pages or categories.
This blank canvas is where all your nodes will live.

Step 2: Add an HTTP Request Node

We’ll use the HTTP Request node to call the Zyte API.

We’ll use cURL to configure this node. Click on Import cURL, then paste the following command and hit Import.
(Don’t forget to replace the API key with your own, and change the URL if you’d like.)

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html", "browserHtml": true}' \
   https://api.zyte.com/v1/extract

Once imported, you’ll see the node fields automatically populated.
Note: When you import via cURL, n8n often converts boolean values like true into the string "true".
To fix this, click the little gear icon → “Add Expression” next to the value and set it to {{true}}.
This is especially required for the browserHtml field, it ensures the Zyte API receives a real boolean, not a string.

Now hit Execute Node, and you should see a JSON response with a big block of HTML inside the "browserHtml" field.

Step 3: Extract the HTML content

Next, add an Edit Fields node (previously called Set node) to isolate that browserHtml content.

Mode: Add Field
Name: data
Value {{$json["browserHtml"]}}

This gives us a clean data field containing just the HTML we need.

Step 4: Parse book elements

Add the HTML node ( Extract HTML Content ).

Source Data: data
Key: books
CSS Selector: article.product_pod
Return Array: ✅ Enabled
Return Value: HTML

Run it once, you’ll see a new field - books, containing an array where each item represents a single book’s HTML block.

We have one array, with multiple products, each ready to be parsed individually in the next step.

Step 5: Split the list into items

Now we’ll process each product individually.
Add the Split Out node:

Fields To Split Out: books

Now each book becomes its own item for extraction. this makes it easier to handle or filter each record separately later on.n.

(You can skip this step if you only need a quick one-shot export, but keeping it helps if you plan to scale or tweak the workflow later.)

Step 6: Extract product details

Add another HTML node ( Extract HTML Content ) to grab the details inside each product.

Extraction Values:

Key	CSS Selector	Return Value
`name`	`h3 a`	Attribute → title
`url`	`h3 a`	Attribute → `href`
`price`	`.price_color`	Text
`availability`	`.instock.availability`	Text
`rating`	`p.star-rating`	Attribute → `class`
`image`	`.image_container img`	Attribute → `src`

Hit Execute, you’ll get a structured JSON for each book.

Step 7: Clean and normalize the data

We’ll make sure URLs and image links are full paths, and rating classes are readable.
Add a Code node ( Code in JavaScript ) and paste:

return items.map(item => {
  const base = 'https://books.toscrape.com/';
  const urlRel = item.json.url || '';
  const imgRel = item.json.image || '';
  const ratingClass = item.json.rating || '';
  const ratingParts = ratingClass.split(' ');
  const rating = ratingParts.length > 1 ? ratingParts[ratingParts.length - 1] : '';

  return {
    json: {
      name: item.json.name || '',
      url: base + urlRel.replace(/^(\.\.\/)+/, ''),
      image: base + imgRel.replace(/^(\.\.\/)+/, ''),
      price: item.json.price || '',
      availability: (item.json.availability || '').trim(),
      rating
    }
  };
});

Config

Mode: Run Once for All Items
Language: JavaScript

Note:
You can tweak this logic based on your own site or data structure, for instance, you might want to clean extra fields, adjust paths differently, or skip this step entirely if your data’s already in the format you want.

Now your output will have clean, structured data, ready to export or feed into your next automation.

Step 8: Export your data the way you want

Now that your data is clean and structured, let’s turn it into a downloadable file, whether that’s CSV, .txt, or something else.

Finally, drop in the Convert to File node

This node takes your structured data and converts it into different file types.

Here’s how to configure it:

Once done, click Execute Node and you’ll see a binary output with your file ready to download.

Wrapping up

And that’s it, we just built a full web scraping workflow in n8n, powered by the Zyte API.

You’ve just automated a complete workflow: fetching, parsing, cleaning, and exporting all visually inside n8n.

This same flow can be easily tweaked for other pages, just change the URL, update your selectors, and you’re good to go.

In the next part, we’ll take this further and scrape multiple pages automatically by adding pagination logic.

Stay tuned, thanks for reading!😄

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.