DEV Community

Pirate Prentice
Pirate Prentice

Posted on

n8n HTML Extract Node: Scrape and Parse HTML from Any Web Page [Free Workflow JSON]

n8n HTML Extract Node: Scrape and Parse HTML from Any Web Page [Free Workflow JSON]

If you've ever needed to pull a price, a headline, a table of links, or any other piece of data from a web page inside an n8n workflow, the HTML Extract node is the tool designed for exactly that job.

This guide covers every option the HTML Extract node supports, the CSS selector syntax you need to know, and three real-world workflow patterns — with free workflow JSON you can import today.


What the n8n HTML Extract Node Does

The HTML Extract node takes an HTML string and extracts data from it using CSS selectors. You point it at the HTML source (from an HTTP Request node, a webhook, or any string expression), define what you want with a selector, and it returns the matched content as structured n8n items.

Common use cases:

  • Scraping product prices or availability from e-commerce pages
  • Extracting article headlines or metadata from news or blog pages
  • Parsing HTML emails for structured data
  • Pulling links, images, or table data from any webpage
  • Monitoring pages for content changes

Node Parameters

Parameter What it does
Source Data Where the HTML comes from: Auto-input Data (uses the current item's HTML field) or a custom expression
HTML The HTML string to parse (available when Source Data = expression)
CSS Selector Standard CSS selector to target elements
Return Value What to extract: Inner HTML, Text Content, Attribute, or Value
Attribute Which attribute to read (when Return Value = Attribute)
Return Array If enabled, returns all matches as an array instead of just the first

CSS Selector Quick Reference

You don't need to know all of CSS — these patterns cover 95% of HTML extraction use cases:

h1                    → first <h1> element
.price                → elements with class="price"
#main-content         → element with id="main-content"
div.product-card      → <div> elements with class="product-card"
ul.nav > li           → direct <li> children of <ul class="nav">
a[href]               → all <a> elements that have an href attribute
a[href*="amazon"]     → <a> elements with "amazon" in their href
table tr td:nth-child(2) → the second column of every table row
meta[name="description"] → meta description tag
[data-price]          → any element with a data-price attribute
Enter fullscreen mode Exit fullscreen mode

Return Value Options

Text Content — the human-readable text inside the element (strips tags):

<h1>Best Running Shoes 2026</h1>  →  "Best Running Shoes 2026"
Enter fullscreen mode Exit fullscreen mode

Inner HTML — the raw HTML inside the element (preserves child tags):

<p><strong>In stock</strong></p>  →  "<strong>In stock</strong>"
Enter fullscreen mode Exit fullscreen mode

Attribute — a specific HTML attribute value:

selector: a, attribute: href  →  "/products/shoe-123"
Enter fullscreen mode Exit fullscreen mode

Value — the value attribute of form inputs, useful for parsing form HTML.


The "Return Array" Toggle

By default, the HTML Extract node returns only the first match for your selector.

Enable Return Array when you need all matches — for example, all <a> links on a page, all <li> items in a list, or all prices in a product table.

Selector: .product-price
Return Array: ON
→ ["$19.99", "$24.99", "$14.99"]
Enter fullscreen mode Exit fullscreen mode

Workflow Pattern 1: Price Monitor

Goal: Check a product page every hour and alert on Slack when the price drops.

Flow: Schedule Trigger → HTTP Request (fetch page) → HTML Extract (get price) → IF (price < threshold) → Slack (send alert)

Workflow JSON:

{
  "name": "Price Monitor",
  "nodes": [
    {
      "name": "Schedule Trigger",
      "type": "n8n-nodes-base.scheduleTrigger",
      "parameters": { "rule": { "interval": [{ "field": "hours", "hoursInterval": 1 }] } },
      "position": [240, 300]
    },
    {
      "name": "Fetch Page",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://example.com/product/running-shoes",
        "options": { "response": { "response": { "responseFormat": "text" } } }
      },
      "position": [440, 300]
    },
    {
      "name": "Extract Price",
      "type": "n8n-nodes-base.htmlExtract",
      "parameters": {
        "sourceData": "json",
        "dataPropertyName": "data",
        "extractionValues": {
          "values": [
            { "key": "price", "cssSelector": ".price-current", "returnValue": "text", "returnArray": false }
          ]
        }
      },
      "position": [640, 300]
    },
    {
      "name": "Price Drop?",
      "type": "n8n-nodes-base.if",
      "parameters": {
        "conditions": {
          "number": [{ "value1": "={{ parseFloat($json.price.replace(/[^0-9.]/g,'')) }}", "operation": "smallerEqual", "value2": 79.99 }]
        }
      },
      "position": [840, 300]
    },
    {
      "name": "Slack Alert",
      "type": "n8n-nodes-base.slack",
      "parameters": {
        "channel": "#deals",
        "text": "Price dropped to {{ $json.price }}! Check it out."
      },
      "position": [1040, 200]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Workflow Pattern 2: News Headline Scraper

Goal: Scrape the top 10 headlines from a news page every morning and save to Google Sheets.

Flow: Schedule Trigger → HTTP Request → HTML Extract (Return Array ON) → Split Out → Google Sheets (append)

Selector: h2.article-title   (or whatever the target site uses)
Return Value: Text Content
Return Array: ON
Enter fullscreen mode Exit fullscreen mode

The HTML Extract node returns an array of headline strings. Feed it into a Split Out node to create one item per headline, then pipe into Google Sheets.

Gotcha: Many news sites block simple GET requests. You may need to add browser-like headers:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Enter fullscreen mode Exit fullscreen mode

Workflow Pattern 3: Link Extractor

Goal: Extract all links from a page and filter for specific domains.

Flow: HTTP Request → HTML Extract → Code node (filter/transform) → output

Selector: a
Return Value: Attribute
Attribute: href
Return Array: ON
Enter fullscreen mode Exit fullscreen mode

Then in a Code node:

const links = $input.first().json.links || [];
return links
  .filter(url => url && url.startsWith('https://'))
  .map(url => ({ json: { url } }));
Enter fullscreen mode Exit fullscreen mode

Gotchas and Common Mistakes

1. The page is rendered by JavaScript

HTML Extract only parses the raw HTML response from the HTTP Request — it cannot execute JavaScript. If the content you want is loaded by React, Vue, or Angular after the initial page load, it won't be in the HTML string.

Fix: Look for an underlying API the page calls (check Network tab in browser DevTools), or use a headless browser tool for JS-rendered pages.

2. Selectors break when the site updates its HTML

Hard-coded class names like .price-tag-v3__price are brittle. Prefer selectors based on semantic attributes ([data-price], [itemprop="price"], meta[name="price"]) — these are more stable across site redesigns.

3. The node returns empty

Check:

  • Your HTML source expression is pointing at the right field (the HTTP Request response body is usually at $json.data when using "Response Format: Text")
  • Your CSS selector actually matches in the real HTML (paste the HTML into browser DevTools and test with document.querySelector('.your-selector'))
  • The page didn't return a 403/Cloudflare challenge instead of actual HTML

4. "Return Array" returns one item, not an array

When Return Array is ON but only one element matches, it still returns an array with one item — that's correct behavior. The downstream node needs to handle an array.


Combining with Other Nodes

HTTP Request → HTML Extract is the core pairing. The HTTP Request fetches the raw HTML (set Response Format to "Text"), the HTML Extract node reads from the data field.

HTML Extract → Split Out is the standard pattern when Return Array is ON and you need one workflow item per extracted element.

HTML Extract → Set node lets you rename or restructure the extracted fields before passing them downstream.

HTML Extract → IF / Switch is how you implement conditional logic based on scraped content (price threshold, stock status, keyword presence).


Free Workflow JSON

Import this working example to get started immediately. It scrapes a public webpage, extracts the <title>, the meta description, and all <h2> headings, then outputs them as structured data.

{
  "name": "HTML Extract Demo — Title, Meta, Headings",
  "nodes": [
    {
      "name": "Manual Trigger",
      "type": "n8n-nodes-base.manualTrigger",
      "parameters": {},
      "position": [240, 300]
    },
    {
      "name": "Fetch Page",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "https://n8n.io/blog/",
        "options": { "response": { "response": { "responseFormat": "text" } } }
      },
      "position": [440, 300]
    },
    {
      "name": "Extract Page Data",
      "type": "n8n-nodes-base.htmlExtract",
      "parameters": {
        "sourceData": "json",
        "dataPropertyName": "data",
        "extractionValues": {
          "values": [
            { "key": "title", "cssSelector": "title", "returnValue": "text", "returnArray": false },
            { "key": "description", "cssSelector": "meta[name=\"description\"]", "returnValue": "attribute", "attribute": "content", "returnArray": false },
            { "key": "headings", "cssSelector": "h2", "returnValue": "text", "returnArray": true }
          ]
        }
      },
      "position": [640, 300]
    }
  ],
  "connections": {
    "Manual Trigger": { "main": [[{ "node": "Fetch Page", "type": "main", "index": 0 }]] },
    "Fetch Page": { "main": [[{ "node": "Extract Page Data", "type": "main", "index": 0 }]] }
  }
}
Enter fullscreen mode Exit fullscreen mode

Quick Reference

What you want Selector Return Value
Page title title Text Content
Meta description meta[name="description"] Attribute → content
All links a Attribute → href (Array ON)
All images img Attribute → src (Array ON)
First H1 h1 Text Content
Table cell (row 1, col 2) table tr:first-child td:nth-child(2) Text Content
Element by data attribute [data-price] Text Content
OG image meta[property="og:image"] Attribute → content

The HTML Extract node is one of n8n's most useful data-gathering tools — and pairing it with the HTTP Request node unlocks a huge range of web scraping and monitoring use cases without writing a full scraper script.

What are you scraping with n8n? Drop a comment below.


The free workflow JSON above can be imported directly into n8n: Settings → Import Workflow → paste the JSON. If you want a complete pre-built web scraping pack with error handling, Cloudflare bypass headers, change-detection logic, and Google Sheets output — check out the n8n Workflow Starter Pack.

Top comments (1)

Collapse
 
pirateprentice profile image
Pirate Prentice

What sites are you scraping with the HTML Extract node? Price monitoring and headline aggregation are the two I see most — but I'm curious if anyone is using it for something more creative (job boards, real estate listings, sports scores?). Drop your use case below 👇