What this builds
An n8n workflow that takes a list of prospects from Google Sheets, scrapes each prospect's website (homepage plus up to 3 internal pages), uses AI to extract meaningful business insights from each page, generates a hyper-personalised multi-line icebreaker for each prospect, and writes the icebreaker back to the Sheet — ready for your outreach tool.
Output example:
`"Hey Katie, love how KTL Graphics makes it easy to filter by acreage —
also a fan of your property update email option. Wanted to run
something by you..."`
The system found acreage filtering and email notification features by actually crawling the website — not from the company name or LinkedIn headline.
Workflow JSON download: Available on the blog
Architecture
`Manual Trigger
↓
Google Sheets — get all rows
↓
Filter — only rows with email AND website URL
↓
Loop Over Items (batch size 1)
↓
HTTP Request — scrape homepage (HTML)
↓
Edit Fields — extract html field
↓
Code node — convert to string
↓
HTML Extractor — pull all <a href> links
↓
Edit Fields — keep: first_name, last_name, email, website_url, links
↓
Split Out — one row per link
↓
Filter — links starting with /
↓
Code node — normalise relative/absolute URLs
↓
Remove Duplicates + Limit (max 3 pages)
↓
HTTP Request — fetch each internal page
↓
HTML to Markdown conversion
↓
AI Agent — summarise each page into abstract
↓
Merge all abstracts
↓
AI Agent — generate icebreaker from all abstracts
↓
Google Sheets — write icebreaker back to prospect row`
Step 1 — Google Sheets setup
Export your Apollo.io leads (or any source) to Google Sheets.
Required columns:
`first_name | last_name | email | website_url | icebreaker (empty, filled by workflow)`
The workflow reads from this sheet and writes icebreakers back to the icebreaker column.
Step 2 — Filter node (quality gate)
Add a Filter node after the Sheets Get Rows node.
Two conditions, both must be true:
`Condition 1: website_url exists and is not empty
Condition 2: email exists and is not empty`
Without this filter, the workflow attempts to scrape blank URLs and throws errors that cascade through the rest of the pipeline.
Step 3 — Loop Over Items (batch size 1)
Add a Loop Over Items node.
Batch Size: 1
This processes one prospect at a time. Without it, the workflow tries to process all prospects simultaneously — rate limits on external sites cause failures, and the AI responses get mixed across prospects.
Step 4 — Scrape the homepage
Add an HTTP Request node.
`Method: GET
URL: {{ $json.website_url }}
Error handling: Continue on error
Redirects: Follow, max 21`
"Continue on error" is critical. Some websites block scraping. Without this setting, one blocked site kills the entire workflow run.
Step 5 — Extract and normalise HTML
Add an Edit Fields node:
Field: html → string → {{ $json.data }}
Add a Code node:
`return [{
json: {
html: $json.html.toString()
}
}];`
This converts the raw response body to a usable string
Step 6 — Extract all links from the homepage
Add an HTML Extractor node.
`CSS Selector: a
Attribute: href
Return: Array
Options: Trim values + clean text`
This pulls every link from the homepage — navigation, footer, internal pages. The homepage alone contains only part of the story. The About page, Services page, and Blog posts are where the personalisation gold lives.
Step 7 — Normalise URLs (Code node)
After splitting links into individual rows and filtering for links starting with /, add this Code node to normalise both relative and absolute URLs to relative paths:
`const items = $input.all();
const updatedItems = items.map((item) => {
const link = item?.json?.links;
if (typeof link === "string") {
if (link.startsWith("/")) {
item.json.links = link;
}
else if (link.startsWith("http://") || link.startsWith("https://")) {
try {
const url = new URL(link);
let path = url.pathname;
if (path !== "/" && path.endsWith("/")) {
path = path.slice(0, -1);
}
item.json.links = path || "/";
} catch (e) {
item.json.links = link;
}
}
else {
item.json.links = link;
}
}
return item;
});
return updatedItems;`
Why this is necessary: Websites use both relative links (/about) and absolute links (https://example.com/about). You need both normalised to the same format to deduplicate and combine with the base URL correctly in the next step.
Step 8 — Deduplicate and limit
Add a Remove Duplicates node (deduplicate on links field) followed by a Limit node:
`Max items: 3
Keep: First Items`
Three pages per prospect is the sweet spot. Enough to find specific details, not enough to run up excessive API costs or hit rate limits on the target site.
Step 9 — Fetch internal pages
Add another HTTP Request node to fetch each filtered internal URL:
`
Method: GET
URL: {{ $json.website_url }}{{ $json.links }}
Error handling: Continue on error
`
The URL concatenates the base domain from the original lead data with the relative path extracted from the link list.
Step 10 — HTML to Markdown
Add an HTML to Markdown node:
`Mode: HTML to Markdown
HTML: {{ $json.data ? $json.data : "<div>empty</div>" }}
Destination Key: data`
Markdown is significantly more token-efficient than HTML for AI processing. Stripping HTML tags reduces the content you are passing to the AI model by 60–80%, which directly reduces API cost and improves the quality of the AI's analysis by removing noise.
Step 11 — AI page summariser
Add an AI Agent node with this system prompt:
You are a helpful, intelligent website scraping assistant.
You are provided a Markdown scrape of a website page.
Your task is to provide a two-paragraph abstract of what this page is about.
Return in this JSON format:
{"abstract":"your abstract goes here"}
Rules:
- Your extract should be comprehensive — similar level of detail as an abstract to a published paper.
- Use a straightforward, factual writing style.
- Focus on what is unique or distinctive about this business or page.
- Note any specific features, products, services, or differentiators that would be useful for personalised outreach.
- Return ONLY the JSON object. No backticks. No explanation.
Model: GPT-4 mini via OpenRouter — approximately $0.0003 per page abstract.
Step 12 — Merge all abstracts
After the loop processes all three pages, merge the abstract outputs into a single node. Use a Merge node set to "Combine All Items."
Pass the merged abstracts to the final AI Agent.
Step 13 — Icebreaker generator
Add a final AI Agent node with this system prompt:
`You are an expert cold email copywriter specializing in personalized outreach.
You will receive multiple website page summaries for a prospect company.
Your task is to write a multi-line, personalized icebreaker for a cold email.
Rules:
- Reference 1-2 SPECIFIC details you found on their website
(features, initiatives, content, language they use about themselves)
- Sound like you actually explored the website — not like you read a summary
- Be conversational, warm, and curious — not salesy
- Keep it to 2-3 sentences maximum
- Address the first_name directly at the start
- End with a natural transition into your pitch ("Wanted to run something by you...")
First name: {{ $('Loop Over Items').first().json.first_name }}
Website summaries: [all abstract outputs concatenated here]
Return ONLY the icebreaker text. No JSON. No explanation.
`
Step 14 — Write back to Google Sheets
Add a Google Sheets node:
`Operation: Update Row
Match on: email = {{ $json.email }}
Update:
icebreaker: {{ $json.icebreaker }}`
The icebreaker column fills in for each row as the workflow processes it. When the run is complete, your Sheet has a personalised icebreaker for every prospect with a valid website URL.
What breaks
HTTP Request returns 403 on most sites: The site is blocking the default n8n user agent. Add a custom header: User-Agent: Mozilla/5.0 (compatible; outreach-research/1.0). This passes most basic anti-scraping checks.
AI Agent returns a JSON error instead of abstract: The page content was empty (HTTP request returned an error page). The {{ $json.data ? $json.data : "
Loop produces mixed data across prospects: You are not using batch size 1. Set Loop Over Items batch size to exactly 1.
Icebreakers sound generic despite the workflow running: The AI is receiving summaries but not finding distinctive details. Check your Limit node — if it is set to 0 or is missing, you may only be scraping the homepage. Set it to 3 to ensure About and Services pages are included.
Running cost
OpenRouter GPT-4 mini: approximately $0.002 per prospect (3 page abstracts + 1 icebreaker generation).
100 prospects = $0.20 in API costs.
1,000 prospects = $2.00.
Compare this to $5–$50 per manually researched and written personalised opener. The cost case is immediate.
Workflow JSON at elevoras.com.
What niche are you targeting with cold outreach? Drop it in the comments — happy to suggest prompt adjustments for specific industries.
Top comments (0)