What it will cover!
If you’ve ever wished you could automate scraping without setting up a bunch of scripts, proxies, or browser logic, you're in the right place.
We’ll use n8n, the low code automation tool, together with Zyte API to fetch structured data from https://books.toscrape.com/.
By the end, you’ll have a workflow that runs on its own, giving you clean JSON or CSV output of all books - their names, prices, ratings, and images. And a setup you can easily adapt for other publicly available or test websites with similar layouts.
Let’s get scraping!
The game plan:
- Fetch the page using Zyte API (it handles rendering & manages blocks automatically)
- Extract HTML content inside n8n
- Parse book elements with CSS selectors
- Clean and normalize the data
- Export results as JSON or CSV
First, let’s get n8n ready to roll.
You can set it up for free locally, or in the cloud whichever you prefer.
If you’re going local, install it via Docker or npm, it only takes a few commands.
Once it’s up, the steps below will work exactly the same whether you’re using n8n Desktop or n8n Cloud.
Step 1: Create a new workflow in n8n
After logging in, create a new workflow.
Name it something like "Book Catalog Scraper" or you can tweak the same workflow later for similar pages or categories.
This blank canvas is where all your nodes will live.
Step 2: Add an HTTP Request Node
We’ll use the HTTP Request node to call the Zyte API.
We’ll use cURL to configure this node. Click on Import cURL, then paste the following command and hit Import.
(Don’t forget to replace the API key with your own, and change the URL if you’d like.)
curl \
--user YOUR_ZYTE_API_KEY_GOES_HERE: \
--header 'Content-Type: application/json' \
--data '{"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html", "browserHtml": true}' \
https://api.zyte.com/v1/extract
Once imported, you’ll see the node fields automatically populated.
Note: When you import via cURL, n8n often converts boolean values like true into the string "true".
To fix this, click the little gear icon → “Add Expression” next to the value and set it to {{true}}.
This is especially required for the browserHtml field, it ensures the Zyte API receives a real boolean, not a string.
Now hit Execute Node, and you should see a JSON response with a big block of HTML inside the "browserHtml" field.
Step 3: Extract the HTML content
Next, add an Edit Fields node (previously called Set node) to isolate that browserHtml content.
- Mode: Add Field
-
Name:
data
-
Value
{{$json["browserHtml"]}}
This gives us a clean data
field containing just the HTML we need.
Step 4: Parse book elements
Add the HTML node ( Extract HTML Content ).
-
Source Data:
data
- Key: books
-
CSS Selector:
article.product_pod
- Return Array: ✅ Enabled
- Return Value: HTML
Run it once, you’ll see a new field - books, containing an array where each item represents a single book’s HTML block.
We have one array, with multiple products, each ready to be parsed individually in the next step.
Step 5: Split the list into items
Now we’ll process each product individually.
Add the Split Out node:
-
Fields To Split Out:
books
Now each book becomes its own item for extraction. this makes it easier to handle or filter each record separately later on.n.
(You can skip this step if you only need a quick one-shot export, but keeping it helps if you plan to scale or tweak the workflow later.)
Step 6: Extract product details
Add another HTML node ( Extract HTML Content ) to grab the details inside each product.
Extraction Values:
Key | CSS Selector | Return Value |
---|---|---|
name |
h3 a |
Attribute → title |
url |
h3 a |
Attribute → href
|
price |
.price_color |
Text |
availability |
.instock.availability |
Text |
rating |
p.star-rating |
Attribute → class
|
image |
.image_container img |
Attribute → src
|
Hit Execute, you’ll get a structured JSON for each book.
Step 7: Clean and normalize the data
We’ll make sure URLs and image links are full paths, and rating classes are readable.
Add a Code node ( Code in JavaScript ) and paste:
return items.map(item => {
const base = 'https://books.toscrape.com/';
const urlRel = item.json.url || '';
const imgRel = item.json.image || '';
const ratingClass = item.json.rating || '';
const ratingParts = ratingClass.split(' ');
const rating = ratingParts.length > 1 ? ratingParts[ratingParts.length - 1] : '';
return {
json: {
name: item.json.name || '',
url: base + urlRel.replace(/^(\.\.\/)+/, ''),
image: base + imgRel.replace(/^(\.\.\/)+/, ''),
price: item.json.price || '',
availability: (item.json.availability || '').trim(),
rating
}
};
});
Config
- Mode: Run Once for All Items
- Language: JavaScript
Note:
You can tweak this logic based on your own site or data structure, for instance, you might want to clean extra fields, adjust paths differently, or skip this step entirely if your data’s already in the format you want.
Now your output will have clean, structured data, ready to export or feed into your next automation.
Step 8: Export your data the way you want
Now that your data is clean and structured, let’s turn it into a downloadable file, whether that’s CSV, .txt, or something else.
Finally, drop in the Convert to File node
This node takes your structured data and converts it into different file types.
Once done, click Execute Node and you’ll see a binary output with your file ready to download.
Wrapping up
And that’s it, we just built a full web scraping workflow in n8n, powered by the Zyte API.
You’ve just automated a complete workflow: fetching, parsing, cleaning, and exporting all visually inside n8n.
This same flow can be easily tweaked for other pages, just change the URL, update your selectors, and you’re good to go.
In the next part, we’ll take this further and scrape multiple pages automatically by adding pagination logic.
Stay tuned, thanks for reading!😄
Top comments (1)
Spin up n8n with Docker-it’s the fastest, least finicky path-and fix the cURL boolean snag by using Add Expression. The bare-bones pipeline is HTTP Request - HTML - Code - Convert to File, and you’ll pick raw requests for static pages, headless Playwright/Puppeteer for heavy JS or Zyte API when anti-bot is being spicy. Normalize with new URL, run lightweight schema checks, clean text and standardize price/currency/availability so downstream tools don’t choke.