DEV Community

Cover image for Web Scraping with n8n | Part 1: Build Your First Web Scraper
Lakshay Nasa for Extract Data

Posted on

Web Scraping with n8n | Part 1: Build Your First Web Scraper

What it will cover!

If you’ve ever wished you could automate scraping without setting up a bunch of scripts, proxies, or browser logic, you're in the right place.

We’ll use n8n, the low code automation tool, together with Zyte API to fetch structured data from https://books.toscrape.com/.

By the end, you’ll have a workflow that runs on its own, giving you clean JSON or CSV output of all books - their names, prices, ratings, and images. And a setup you can easily adapt for other publicly available or test websites with similar layouts.

Let’s get scraping!

The game plan:

  • Fetch the page using Zyte API (it handles rendering & manages blocks automatically)
  • Extract HTML content inside n8n
  • Parse book elements with CSS selectors
  • Clean and normalize the data
  • Export results as JSON or CSV

First, let’s get n8n ready to roll.
You can set it up for free locally, or in the cloud whichever you prefer.
If you’re going local, install it via Docker or npm, it only takes a few commands.

Once it’s up, the steps below will work exactly the same whether you’re using n8n Desktop or n8n Cloud.

Step 1: Create a new workflow in n8n

After logging in, create a new workflow.
Name it something like "Book Catalog Scraper" or you can tweak the same workflow later for similar pages or categories.
This blank canvas is where all your nodes will live.

N8N Blank Canvas

Step 2: Add an HTTP Request Node

We’ll use the HTTP Request node to call the Zyte API.

We’ll use cURL to configure this node. Click on Import cURL, then paste the following command and hit Import.
(Don’t forget to replace the API key with your own, and change the URL if you’d like.)

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html", "browserHtml": true}' \
   https://api.zyte.com/v1/extract
Enter fullscreen mode Exit fullscreen mode

Once imported, you’ll see the node fields automatically populated.
Note: When you import via cURL, n8n often converts boolean values like true into the string "true".
To fix this, click the little gear icon → “Add Expression” next to the value and set it to {{true}}.
This is especially required for the browserHtml field, it ensures the Zyte API receives a real boolean, not a string.

Now hit Execute Node, and you should see a JSON response with a big block of HTML inside the "browserHtml" field.

HTTP Request Node N8N Zyte API

Step 3: Extract the HTML content

Next, add an Edit Fields node (previously called Set node) to isolate that browserHtml content.

  • Mode: Add Field
  • Name: data
  • Value {{$json["browserHtml"]}}

Extract HTML N8N

This gives us a clean data field containing just the HTML we need.

Step 4: Parse book elements

Add the HTML node ( Extract HTML Content ).

  • Source Data: data
  • Key: books
  • CSS Selector: article.product_pod
  • Return Array: ✅ Enabled
  • Return Value: HTML

Run it once, you’ll see a new field - books, containing an array where each item represents a single book’s HTML block.

Parse book elements

We have one array, with multiple products, each ready to be parsed individually in the next step.

Step 5: Split the list into items

Now we’ll process each product individually.
Add the Split Out node:

  • Fields To Split Out: books

Now each book becomes its own item for extraction. this makes it easier to handle or filter each record separately later on.n.

(You can skip this step if you only need a quick one-shot export, but keeping it helps if you plan to scale or tweak the workflow later.)

Split Out N8N

Step 6: Extract product details

Add another HTML node ( Extract HTML Content ) to grab the details inside each product.

Extraction Values:

Key CSS Selector Return Value
name h3 a Attribute → title
url h3 a Attribute → href
price .price_color Text
availability .instock.availability Text
rating p.star-rating Attribute → class
image .image_container img Attribute → src

Hit Execute, you’ll get a structured JSON for each book.

Extract product details

Step 7: Clean and normalize the data

We’ll make sure URLs and image links are full paths, and rating classes are readable.
Add a Code node ( Code in JavaScript ) and paste:

return items.map(item => {
  const base = 'https://books.toscrape.com/';
  const urlRel = item.json.url || '';
  const imgRel = item.json.image || '';
  const ratingClass = item.json.rating || '';
  const ratingParts = ratingClass.split(' ');
  const rating = ratingParts.length > 1 ? ratingParts[ratingParts.length - 1] : '';

  return {
    json: {
      name: item.json.name || '',
      url: base + urlRel.replace(/^(\.\.\/)+/, ''),
      image: base + imgRel.replace(/^(\.\.\/)+/, ''),
      price: item.json.price || '',
      availability: (item.json.availability || '').trim(),
      rating
    }
  };
});
Enter fullscreen mode Exit fullscreen mode

Config

  • Mode: Run Once for All Items
  • Language: JavaScript

Note:
You can tweak this logic based on your own site or data structure, for instance, you might want to clean extra fields, adjust paths differently, or skip this step entirely if your data’s already in the format you want.

Now your output will have clean, structured data, ready to export or feed into your next automation.

Step 8: Export your data the way you want

Now that your data is clean and structured, let’s turn it into a downloadable file, whether that’s CSV, .txt, or something else.

Finally, drop in the Convert to File node

This node takes your structured data and converts it into different file types.

Here’s how to configure it:
Convert Node N8N

Once done, click Execute Node and you’ll see a binary output with your file ready to download.

Wrapping up

And that’s it, we just built a full web scraping workflow in n8n, powered by the Zyte API.

You’ve just automated a complete workflow: fetching, parsing, cleaning, and exporting all visually inside n8n.

This same flow can be easily tweaked for other pages, just change the URL, update your selectors, and you’re good to go.

In the next part, we’ll take this further and scrape multiple pages automatically by adding pagination logic.

Stay tuned, thanks for reading!😄

Top comments (1)

Collapse
 
onlineproxy profile image
OnlineProxy

Spin up n8n with Docker-it’s the fastest, least finicky path-and fix the cURL boolean snag by using Add Expression. The bare-bones pipeline is HTTP Request - HTML - Code - Convert to File, and you’ll pick raw requests for static pages, headless Playwright/Puppeteer for heavy JS or Zyte API when anti-bot is being spicy. Normalize with new URL, run lightweight schema checks, clean text and standardize price/currency/availability so downstream tools don’t choke.