n8n HTML Extract Node: Scrape and Parse HTML from Any Web Page [Free Workflow JSON]
If you've ever needed to pull a price, a headline, a table of links, or any other piece of data from a web page inside an n8n workflow, the HTML Extract node is the tool designed for exactly that job.
This guide covers every option the HTML Extract node supports, the CSS selector syntax you need to know, and three real-world workflow patterns â with free workflow JSON you can import today.
What the n8n HTML Extract Node Does
The HTML Extract node takes an HTML string and extracts data from it using CSS selectors. You point it at the HTML source (from an HTTP Request node, a webhook, or any string expression), define what you want with a selector, and it returns the matched content as structured n8n items.
Common use cases:
- Scraping product prices or availability from e-commerce pages
- Extracting article headlines or metadata from news or blog pages
- Parsing HTML emails for structured data
- Pulling links, images, or table data from any webpage
- Monitoring pages for content changes
Node Parameters
| Parameter | What it does |
|---|---|
| Source Data | Where the HTML comes from: Auto-input Data (uses the current item's HTML field) or a custom expression |
| HTML | The HTML string to parse (available when Source Data = expression) |
| CSS Selector | Standard CSS selector to target elements |
| Return Value | What to extract: Inner HTML, Text Content, Attribute, or Value
|
| Attribute | Which attribute to read (when Return Value = Attribute) |
| Return Array | If enabled, returns all matches as an array instead of just the first |
CSS Selector Quick Reference
You don't need to know all of CSS â these patterns cover 95% of HTML extraction use cases:
h1 â first <h1> element
.price â elements with class="price"
#main-content â element with id="main-content"
div.product-card â <div> elements with class="product-card"
ul.nav > li â direct <li> children of <ul class="nav">
a[href] â all <a> elements that have an href attribute
a[href*="amazon"] â <a> elements with "amazon" in their href
table tr td:nth-child(2) â the second column of every table row
meta[name="description"] â meta description tag
[data-price] â any element with a data-price attribute
Return Value Options
Text Content â the human-readable text inside the element (strips tags):
<h1>Best Running Shoes 2026</h1> â "Best Running Shoes 2026"
Inner HTML â the raw HTML inside the element (preserves child tags):
<p><strong>In stock</strong></p> â "<strong>In stock</strong>"
Attribute â a specific HTML attribute value:
selector: a, attribute: href â "/products/shoe-123"
Value â the value attribute of form inputs, useful for parsing form HTML.
The "Return Array" Toggle
By default, the HTML Extract node returns only the first match for your selector.
Enable Return Array when you need all matches â for example, all <a> links on a page, all <li> items in a list, or all prices in a product table.
Selector: .product-price
Return Array: ON
â ["$19.99", "$24.99", "$14.99"]
Workflow Pattern 1: Price Monitor
Goal: Check a product page every hour and alert on Slack when the price drops.
Flow: Schedule Trigger â HTTP Request (fetch page) â HTML Extract (get price) â IF (price < threshold) â Slack (send alert)
Workflow JSON:
{
"name": "Price Monitor",
"nodes": [
{
"name": "Schedule Trigger",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": { "rule": { "interval": [{ "field": "hours", "hoursInterval": 1 }] } },
"position": [240, 300]
},
{
"name": "Fetch Page",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://example.com/product/running-shoes",
"options": { "response": { "response": { "responseFormat": "text" } } }
},
"position": [440, 300]
},
{
"name": "Extract Price",
"type": "n8n-nodes-base.htmlExtract",
"parameters": {
"sourceData": "json",
"dataPropertyName": "data",
"extractionValues": {
"values": [
{ "key": "price", "cssSelector": ".price-current", "returnValue": "text", "returnArray": false }
]
}
},
"position": [640, 300]
},
{
"name": "Price Drop?",
"type": "n8n-nodes-base.if",
"parameters": {
"conditions": {
"number": [{ "value1": "={{ parseFloat($json.price.replace(/[^0-9.]/g,'')) }}", "operation": "smallerEqual", "value2": 79.99 }]
}
},
"position": [840, 300]
},
{
"name": "Slack Alert",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#deals",
"text": "Price dropped to {{ $json.price }}! Check it out."
},
"position": [1040, 200]
}
]
}
Workflow Pattern 2: News Headline Scraper
Goal: Scrape the top 10 headlines from a news page every morning and save to Google Sheets.
Flow: Schedule Trigger â HTTP Request â HTML Extract (Return Array ON) â Split Out â Google Sheets (append)
Selector: h2.article-title (or whatever the target site uses)
Return Value: Text Content
Return Array: ON
The HTML Extract node returns an array of headline strings. Feed it into a Split Out node to create one item per headline, then pipe into Google Sheets.
Gotcha: Many news sites block simple GET requests. You may need to add browser-like headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Workflow Pattern 3: Link Extractor
Goal: Extract all links from a page and filter for specific domains.
Flow: HTTP Request â HTML Extract â Code node (filter/transform) â output
Selector: a
Return Value: Attribute
Attribute: href
Return Array: ON
Then in a Code node:
const links = $input.first().json.links || [];
return links
.filter(url => url && url.startsWith('https://'))
.map(url => ({ json: { url } }));
Gotchas and Common Mistakes
1. The page is rendered by JavaScript
HTML Extract only parses the raw HTML response from the HTTP Request â it cannot execute JavaScript. If the content you want is loaded by React, Vue, or Angular after the initial page load, it won't be in the HTML string.
Fix: Look for an underlying API the page calls (check Network tab in browser DevTools), or use a headless browser tool for JS-rendered pages.
2. Selectors break when the site updates its HTML
Hard-coded class names like .price-tag-v3__price are brittle. Prefer selectors based on semantic attributes ([data-price], [itemprop="price"], meta[name="price"]) â these are more stable across site redesigns.
3. The node returns empty
Check:
- Your HTML source expression is pointing at the right field (the HTTP Request response body is usually at
$json.datawhen using "Response Format: Text") - Your CSS selector actually matches in the real HTML (paste the HTML into browser DevTools and test with
document.querySelector('.your-selector')) - The page didn't return a 403/Cloudflare challenge instead of actual HTML
4. "Return Array" returns one item, not an array
When Return Array is ON but only one element matches, it still returns an array with one item â that's correct behavior. The downstream node needs to handle an array.
Combining with Other Nodes
HTTP Request â HTML Extract is the core pairing. The HTTP Request fetches the raw HTML (set Response Format to "Text"), the HTML Extract node reads from the data field.
HTML Extract â Split Out is the standard pattern when Return Array is ON and you need one workflow item per extracted element.
HTML Extract â Set node lets you rename or restructure the extracted fields before passing them downstream.
HTML Extract â IF / Switch is how you implement conditional logic based on scraped content (price threshold, stock status, keyword presence).
Free Workflow JSON
Import this working example to get started immediately. It scrapes a public webpage, extracts the <title>, the meta description, and all <h2> headings, then outputs them as structured data.
{
"name": "HTML Extract Demo â Title, Meta, Headings",
"nodes": [
{
"name": "Manual Trigger",
"type": "n8n-nodes-base.manualTrigger",
"parameters": {},
"position": [240, 300]
},
{
"name": "Fetch Page",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "https://n8n.io/blog/",
"options": { "response": { "response": { "responseFormat": "text" } } }
},
"position": [440, 300]
},
{
"name": "Extract Page Data",
"type": "n8n-nodes-base.htmlExtract",
"parameters": {
"sourceData": "json",
"dataPropertyName": "data",
"extractionValues": {
"values": [
{ "key": "title", "cssSelector": "title", "returnValue": "text", "returnArray": false },
{ "key": "description", "cssSelector": "meta[name=\"description\"]", "returnValue": "attribute", "attribute": "content", "returnArray": false },
{ "key": "headings", "cssSelector": "h2", "returnValue": "text", "returnArray": true }
]
}
},
"position": [640, 300]
}
],
"connections": {
"Manual Trigger": { "main": [[{ "node": "Fetch Page", "type": "main", "index": 0 }]] },
"Fetch Page": { "main": [[{ "node": "Extract Page Data", "type": "main", "index": 0 }]] }
}
}
Quick Reference
| What you want | Selector | Return Value |
|---|---|---|
| Page title | title |
Text Content |
| Meta description | meta[name="description"] |
Attribute â content
|
| All links | a |
Attribute â href (Array ON) |
| All images | img |
Attribute â src (Array ON) |
| First H1 | h1 |
Text Content |
| Table cell (row 1, col 2) | table tr:first-child td:nth-child(2) |
Text Content |
| Element by data attribute | [data-price] |
Text Content |
| OG image | meta[property="og:image"] |
Attribute â content
|
The HTML Extract node is one of n8n's most useful data-gathering tools â and pairing it with the HTTP Request node unlocks a huge range of web scraping and monitoring use cases without writing a full scraper script.
What are you scraping with n8n? Drop a comment below.
The free workflow JSON above can be imported directly into n8n: Settings â Import Workflow â paste the JSON. If you want a complete pre-built web scraping pack with error handling, Cloudflare bypass headers, change-detection logic, and Google Sheets output â check out the n8n Workflow Starter Pack.
Top comments (1)
What sites are you scraping with the HTML Extract node? Price monitoring and headline aggregation are the two I see most — but I'm curious if anyone is using it for something more creative (job boards, real estate listings, sports scores?). Drop your use case below 👇