Lakshay Nasa for Extract by Zyte

Posted on Dec 16, 2025 • Edited on Dec 18, 2025

The AI Web Scraper: One Workflow to Scrape Anything (n8n Part 3)

#ai #webscraping #data #python

Welcome to the finale of our n8n web scraping series!

In Part 1, we covered the basics: fetching a single page and parsing it with CSS selectors.
In Part 2, we tackled the tricky mechanics: pagination loops, infinite scroll, and network capture.

But if you’ve been following along, you know there is still one massive headache in web scraping: "New Site = New Workflow."

Every time you want to scrape a different website, you have to open the browser inspector, hunt for new <div> classes, debug why price_color isn't working, and rewrite your entire flow. It’s exhausting.

Today, we changed that.

In this final part, we are going to build an Automated AI Scraper - a single n8n workflow that can scrape almost anything (Online Stores, Article/News Sites, Job Boards, and more) without you changing a single node.

Whether you are a developer looking to save hours of coding, or a non-technical user who just needs the data without the headache, this tool is designed for you.

Watch the Walkthrough 🎬

💡 TL;DR: Want to start scraping immediately? We have packaged this entire workflow into a ready-to-use template. 👉 Automate Data Extraction with Zyte AI (Products, Jobs, Articles & More)

The Concept: "AI-Driven Architecture”

So, we will make n8n stop looking for specific elements (CSS Selectors) and start caring about what we actually want (Data).

We are leveraging Zyte’s AI-powered extraction. Instead of saying "Find the text inside .product_pod h3 a, we simply send the URL and say product: true. The AI analyzes the visual layout of the page and figures it out, even if the website changes its code tomorrow.

To handle every scenario, we designed three distinct pipelines:

The "AI Extraction" Pipeline (Automatic)
This is the core of the workflow. You simply select a Category (e.g., E-commerce, Article etc) and a Goal, and the workflow automatically routes you to one of two paths:
- Direct Extraction (Fast): If you just need data from the current page (like a "Single Product" or a "Simple List"), the workflow sends a single smart request. No loops, no waiting.
- The "Two-Phase" Architecture (For AI crawling): If your goal involves "All Pages" or "Visiting Items," the workflow activates a robust recursive loop:
  - Phase 1 (The Crawler): It maps out the URLs you need (looping through pagination or grabbing item links from a list).
  - Phase 2 (The Scraper): It visits every mapped URL one by one to extract the rich details you asked for.
The "SERP" Pipeline (For SEO Data)
Need search rankings? We included a dedicated path for Search Engine Results Pages (SERP). It uses the specific serp schema to automatically extract organic results, ads, and knowledge panels without you needing to parse complex HTML.
The "Manual" Mode (For Raw Control)
Sometimes you don't need AI. We added a "General" path that gives you raw browserHtml, HTTP responses, or Screenshots so you can parse specific data yourself.

Let’s get building.

Step 1: The Control Center

In previous parts, we hardcoded URLs into our nodes. For this tool, that won’t work. We need a flexible User Interface.

1. The Main Interface (Form Trigger)

We use an n8n Form Trigger node as the entry point. This turns your workflow into a clean web app that anyone on your team can use.

The Main Form collects three key inputs:

Target URL (Text): The website you want to scrape.
Site Category (Dropdown): Options like Online Store, Article/News, Job Post , General & More .
Zyte API Key (Password): Securely input the key so it isn't hardcoded in the workflow.

2. Smart Routing (The Switch Node)

This is where the magic happens. Immediately after the form, we use a Switch Node ("Route by Category") that directs the traffic into distinct lanes.

This logic is crucial because different categories require different inputs:

The AI Lane (Store, News, Jobs): If you select a structured category, the workflow routes you to a Secondary Form asking for your "Extraction Goal" (e.g., Scrape this page vs. Crawl ALL pages).
The SEO Lane: If you select "SERP (Search Engine Results)," it bypasses extra forms and goes straight to the specialized SERP scraper.
The Manual Lane (General): If you select "General," it routes you to a different Manual Options Form where you can choose specific technical actions (e.g., Take Screenshot, Get Browser HTML, Network Capture).

This architecture ensures you only see options relevant to your goal.

Step 2: Pipeline 1 – AI Extraction

If the user selects E-commerce, News/Blog/Article, or Jobs, they enter the AI pipeline.

1. The "AI Extraction Goal Form" (Refining Scope)

Since "scraping" can mean anything from checking one price to archiving an entire blog, we present a secondary form here to define the scope. You simply tell the workflow what you need: a quick Single Item lookup, a List from the current page, or a full Multi-Page Crawl.

2. The Brain (Config Generator)

We place a Code Node (the "Zyte Config Generator") to translate your form choices into technical instructions.

For instance,

If you selected "Online Store" → It maps to the Zyte schema product.
If you select "Article Site" → It maps to articles.
If you choose "Get List" → It targets productList (or articleList) to extract an array of items.
If you choose "Crawl All Pages" → It switches the target to productNavigation (or articleNavigation) to activate the crawler loop.

3. The 5 Strategies

Based on your "Extraction Goal," the workflow automatically routes to one of 5 specific branches:

A. Single Item:

Fast execution. Scrapes details of one URL.

We send a single request to the Zyte API with our specific target schema (e.g., product: true). The AI analyzes the page layout and returns a structured JSON object with the price, name, and details instantly.

B. List (Current Page):

Returns a clean JSON array of items found on the provided URL.

Similar to the single item strategy, but instead of asking for one object, we request a List schema (like productList). The AI identifies the repeating elements on the page and returns them as a clean array.

💡 Design Note: You might notice that the nodes for Strategy 1 and 2 look identical. That is because the heavy lifting (choosing between product vs productList) is actually handled upstream by the Config Generator.

🧑‍💻 Best Practice: In your own production automations, you should usually combine these into a single node to keep your canvas clean. However, for this template, we kept them separate. This makes the logic visually intuitive and allows you to add specific post-processing (like a unique filter) to the List path without accidentally breaking the Single Item path.

C. Details (Current Page):

A hybrid approach. It scans the current list, finds item links, and visits them one by one.

We use a two-step logic: first, we request a navigation schema to identify all item links on the current page. Then, we split that list and use a loop to visit each URL individually to extract the full details.

D. Crawl List (All Pages):

Activates the Crawler (Phase 1) to loop through pagination and build a massive master list.

This enables the pagination loop. The workflow fetches the current page's list, saves the items to a global "Backpack" (memory), detects the "Next Page" link automatically, and loops back to repeat the process until it reaches the end.

E. Crawl Details (All Pages):

The ultimate mode. It crawls all pages (Phase 1) AND visits every single item found (Phase 2).

This uses our robust "Two-Phase" architecture. Phase 1 loops through pagination specifically to map out every item URL. Once the map is complete, Phase 2 takes over to visit every single URL one by one and extract the deep data.

Step 3: Pipeline 2 – SERP (Search Engine Results)

If you select "Search Engine Results" in the main form, the workflow takes a direct path to the SERP Node.

This is a single HTTP Request node configured with the serp schema.

Input: Your target Search URL (e.g., a query on a search engine).
Output: Structured JSON containing organic results, ad positions, and knowledge panels.

It is the fastest way to get reliable SERP data for rank tracking or brand monitoring, handling complex layouts automatically.

Step 4: Pipeline 3 – Manual / General Mode

What if sometimes you need to scrape a unique dashboard, a niche directory, or just want to debug the raw HTML yourself. That’s why we included the "Manual" path.

If you select "General / Other" in the form, you are presented with a secondary form offering 5 raw tools:

Browser HTML: Returns the full rendered DOM (great for the custom parsing logic we built in Part 1).
HTTP Response Body: Useful for API endpoints.
Network Capture: Intercepts background XHR/Fetch requests (as we learned in Part 2).
Infinite Scroll: Automatically scrolls to the bottom before capturing HTML. (see the infinite scroll guide in Part 2)
Screenshot: Returns a PNG snapshot of the page. (view the setup steps here).

This ensures your scraping tool is never useless, even on the most obscure websites.

The Result & Output

Regardless of which pipeline you chose (AI, SERP, or Manual), all data converges at a final Data Collector node.

We use a Convert to File node to transform that JSON into a clean CSV file or Image file (for screenshots), ready for download directly in the browser

Get the Workflow

We have packaged this entire logic – the forms, the smart routing, the crawler loops, and the safety checks, into a single template you can import right now from the n8n community.

👉 Automate Data Extraction with Zyte AI (Products, Jobs, Articles & More)

Wrapping Up

And that’s a wrap on our n8n web scraping series! 🎬

From building your first simple scraper in Part 1, to mastering pagination in Part 2, we have now arrived at the ultimate goal: an Intelligent Scraper that adapts to the web so you don't have to.

You now have a tool that:

Gets You Data With Ease: Automatically extracts structured fields (like prices, images, and articles) without you needing to hunt for CSS selectors or manage CAPTCHAs.
Reduces Maintenance: Adapts to layout changes automatically.
Gives You Control: Lets you switch between AI automation and manual debugging instantly.

This template is ready for you to fork, modify, and deploy.

Thanks for joining us on this journey! If you build something cool, or if you run into a challenge that stumps you, come share it in the Extract Data Community.

Happy scraping! 🚀🕷️

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.