DEV Community: Lakshay Nasa

The AI Web Scraper: One Workflow to Scrape Anything (n8n Part 3)

Lakshay Nasa — Tue, 16 Dec 2025 14:00:59 +0000

Welcome to the finale of our n8n web scraping series!

In Part 1, we covered the basics: fetching a single page and parsing it with CSS selectors.
In Part 2, we tackled the tricky mechanics: pagination loops, infinite scroll, and network capture.

But if you’ve been following along, you know there is still one massive headache in web scraping: "New Site = New Workflow."

Every time you want to scrape a different website, you have to open the browser inspector, hunt for new <div> classes, debug why price_color isn't working, and rewrite your entire flow. It’s exhausting.

Today, we changed that.

In this final part, we are going to build an Automated AI Scraper - a single n8n workflow that can scrape almost anything (Online Stores, Article/News Sites, Job Boards, and more) without you changing a single node.

Whether you are a developer looking to save hours of coding, or a non-technical user who just needs the data without the headache, this tool is designed for you.

Watch the Walkthrough 🎬

💡 TL;DR: Want to start scraping immediately? We have packaged this entire workflow into a ready-to-use template. 👉 Automate Data Extraction with Zyte AI (Products, Jobs, Articles & More)

The Concept: "AI-Driven Architecture”

So, we will make n8n stop looking for specific elements (CSS Selectors) and start caring about what we actually want (Data).

We are leveraging Zyte’s AI-powered extraction. Instead of saying "Find the text inside .product_pod h3 a, we simply send the URL and say product: true. The AI analyzes the visual layout of the page and figures it out, even if the website changes its code tomorrow.

To handle every scenario, we designed three distinct pipelines:

The "AI Extraction" Pipeline (Automatic)
This is the core of the workflow. You simply select a Category (e.g., E-commerce, Article etc) and a Goal, and the workflow automatically routes you to one of two paths:
- Direct Extraction (Fast): If you just need data from the current page (like a "Single Product" or a "Simple List"), the workflow sends a single smart request. No loops, no waiting.
- The "Two-Phase" Architecture (For AI crawling): If your goal involves "All Pages" or "Visiting Items," the workflow activates a robust recursive loop:
  - Phase 1 (The Crawler): It maps out the URLs you need (looping through pagination or grabbing item links from a list).
  - Phase 2 (The Scraper): It visits every mapped URL one by one to extract the rich details you asked for.
The "SERP" Pipeline (For SEO Data)
Need search rankings? We included a dedicated path for Search Engine Results Pages (SERP). It uses the specific serp schema to automatically extract organic results, ads, and knowledge panels without you needing to parse complex HTML.
The "Manual" Mode (For Raw Control)
Sometimes you don't need AI. We added a "General" path that gives you raw browserHtml, HTTP responses, or Screenshots so you can parse specific data yourself.

Let’s get building.

Step 1: The Control Center

In previous parts, we hardcoded URLs into our nodes. For this tool, that won’t work. We need a flexible User Interface.

1. The Main Interface (Form Trigger)

We use an n8n Form Trigger node as the entry point. This turns your workflow into a clean web app that anyone on your team can use.

The Main Form collects three key inputs:

Target URL (Text): The website you want to scrape.
Site Category (Dropdown): Options like Online Store, Article/News, Job Post , General & More .
Zyte API Key (Password): Securely input the key so it isn't hardcoded in the workflow.

2. Smart Routing (The Switch Node)

This is where the magic happens. Immediately after the form, we use a Switch Node ("Route by Category") that directs the traffic into distinct lanes.

This logic is crucial because different categories require different inputs:

The AI Lane (Store, News, Jobs): If you select a structured category, the workflow routes you to a Secondary Form asking for your "Extraction Goal" (e.g., Scrape this page vs. Crawl ALL pages).
The SEO Lane: If you select "SERP (Search Engine Results)," it bypasses extra forms and goes straight to the specialized SERP scraper.
The Manual Lane (General): If you select "General," it routes you to a different Manual Options Form where you can choose specific technical actions (e.g., Take Screenshot, Get Browser HTML, Network Capture).

This architecture ensures you only see options relevant to your goal.

Step 2: Pipeline 1 – AI Extraction

If the user selects E-commerce, News/Blog/Article, or Jobs, they enter the AI pipeline.

1. The "AI Extraction Goal Form" (Refining Scope)

Since "scraping" can mean anything from checking one price to archiving an entire blog, we present a secondary form here to define the scope. You simply tell the workflow what you need: a quick Single Item lookup, a List from the current page, or a full Multi-Page Crawl.

2. The Brain (Config Generator)

We place a Code Node (the "Zyte Config Generator") to translate your form choices into technical instructions.

For instance,

If you selected "Online Store" → It maps to the Zyte schema product.
If you select "Article Site" → It maps to articles.
If you choose "Get List" → It targets productList (or articleList) to extract an array of items.
If you choose "Crawl All Pages" → It switches the target to productNavigation (or articleNavigation) to activate the crawler loop.

3. The 5 Strategies

Based on your "Extraction Goal," the workflow automatically routes to one of 5 specific branches:

A. Single Item:

Fast execution. Scrapes details of one URL.

We send a single request to the Zyte API with our specific target schema (e.g., product: true). The AI analyzes the page layout and returns a structured JSON object with the price, name, and details instantly.

B. List (Current Page):

Returns a clean JSON array of items found on the provided URL.

Similar to the single item strategy, but instead of asking for one object, we request a List schema (like productList). The AI identifies the repeating elements on the page and returns them as a clean array.

💡 Design Note: You might notice that the nodes for Strategy 1 and 2 look identical. That is because the heavy lifting (choosing between product vs productList) is actually handled upstream by the Config Generator.

🧑‍💻 Best Practice: In your own production automations, you should usually combine these into a single node to keep your canvas clean. However, for this template, we kept them separate. This makes the logic visually intuitive and allows you to add specific post-processing (like a unique filter) to the List path without accidentally breaking the Single Item path.

C. Details (Current Page):

A hybrid approach. It scans the current list, finds item links, and visits them one by one.

We use a two-step logic: first, we request a navigation schema to identify all item links on the current page. Then, we split that list and use a loop to visit each URL individually to extract the full details.

D. Crawl List (All Pages):

Activates the Crawler (Phase 1) to loop through pagination and build a massive master list.

This enables the pagination loop. The workflow fetches the current page's list, saves the items to a global "Backpack" (memory), detects the "Next Page" link automatically, and loops back to repeat the process until it reaches the end.

E. Crawl Details (All Pages):

The ultimate mode. It crawls all pages (Phase 1) AND visits every single item found (Phase 2).

This uses our robust "Two-Phase" architecture. Phase 1 loops through pagination specifically to map out every item URL. Once the map is complete, Phase 2 takes over to visit every single URL one by one and extract the deep data.

Step 3: Pipeline 2 – SERP (Search Engine Results)

If you select "Search Engine Results" in the main form, the workflow takes a direct path to the SERP Node.

This is a single HTTP Request node configured with the serp schema.

Input: Your target Search URL (e.g., a query on a search engine).
Output: Structured JSON containing organic results, ad positions, and knowledge panels.

It is the fastest way to get reliable SERP data for rank tracking or brand monitoring, handling complex layouts automatically.

Step 4: Pipeline 3 – Manual / General Mode

What if sometimes you need to scrape a unique dashboard, a niche directory, or just want to debug the raw HTML yourself. That’s why we included the "Manual" path.

If you select "General / Other" in the form, you are presented with a secondary form offering 5 raw tools:

Browser HTML: Returns the full rendered DOM (great for the custom parsing logic we built in Part 1).
HTTP Response Body: Useful for API endpoints.
Network Capture: Intercepts background XHR/Fetch requests (as we learned in Part 2).
Infinite Scroll: Automatically scrolls to the bottom before capturing HTML. (see the infinite scroll guide in Part 2)
Screenshot: Returns a PNG snapshot of the page. (view the setup steps here).

This ensures your scraping tool is never useless, even on the most obscure websites.

The Result & Output

Regardless of which pipeline you chose (AI, SERP, or Manual), all data converges at a final Data Collector node.

We use a Convert to File node to transform that JSON into a clean CSV file or Image file (for screenshots), ready for download directly in the browser

Get the Workflow

We have packaged this entire logic – the forms, the smart routing, the crawler loops, and the safety checks, into a single template you can import right now from the n8n community.

👉 Automate Data Extraction with Zyte AI (Products, Jobs, Articles & More)

Wrapping Up

And that’s a wrap on our n8n web scraping series! 🎬

From building your first simple scraper in Part 1, to mastering pagination in Part 2, we have now arrived at the ultimate goal: an Intelligent Scraper that adapts to the web so you don't have to.

You now have a tool that:

Gets You Data With Ease: Automatically extracts structured fields (like prices, images, and articles) without you needing to hunt for CSS selectors or manage CAPTCHAs.
Reduces Maintenance: Adapts to layout changes automatically.
Gives You Control: Lets you switch between AI automation and manual debugging instantly.

This template is ready for you to fork, modify, and deploy.

Thanks for joining us on this journey! If you build something cool, or if you run into a challenge that stumps you, come share it in the Extract Data Community.

Happy scraping! 🚀🕷️

n8n Web Scraping || Part 2: Pagination, Infinite Scroll, Network Capture & More

Lakshay Nasa — Mon, 17 Nov 2025 18:15:13 +0000

This is Part 2 of our n8n web scraping series with Zyte API. If you’re new here, check out Part 1 first, it covers the basics: fetching pages, extracting HTML with the HTML node, cleaning + normalizing results, & exporting CSV/JSON.

Pagination
Infinite Scroll
Geolocation support
Screenshots from browser rendering
Capturing network requests
Handling cookies, sessions, headers & IP type

Let’s Begin!

In this part, we’ll explore some important scraping practices and nodes, along with a few hands on tricks that make your web scraping journey smoother.

Everything you learn here will also lay the foundation for our 3rd & final part, where we will build a universal scraper capable of scraping any website with minimal configuration.

Let’s start by taking the same workflow we built in Part 1, & extend it. Beginning with Pagination and Infinite Scroll.

Pagination across pages

A website can navigate in multiple ways & our scraper needs to adapt accordingly.

N8N gives us a default Pagination Mode inside the HTTP Request node under Options, and while it sounds convenient, it didn’t behave reliably in my experience for typical web scraping use cases.

After testing several patterns, the approach below is the one that has worked most consistently in my workflows.

💬 If you’re stuck or want to share your own approach, lets discuss it in the Extract Data Discord.

Step 1: Page Manager Node

Before calling the HTTP Request node, we introduce a small function called Page Manager, exactly what the name suggests: a node that controls the page number.

Add a Code node (JavaScript) and paste:

// Page Manager Function ( We use this node as both starter and incrementer)
const MAX_PAGE = 100;

// n8n provides `items` array. If no items => first run
let current = 0;
if (Array.isArray(items) && items.length > 0) {

  // Check for .json.page (from this node's first run)
  // OR .json.Page (from the Normalizer node's output)
  const p = items[0].json?.page || items[0].json?.Page;

  current = (typeof p !== 'undefined' && p !== null && p !== '') ? Number(p) || 0 : 0;
}

let next;
if (current === 0) {
  next = 1; // first run (still 1)
} else {
  next = current + 1;
}

if (next > MAX_PAGE) return []; // safety stop

return [{ json: { page: next } }];

What this does:

On the first run, it starts with page = 1.
Every time the loop returns here, it increments to the next page.
There’s a built in safety MAX_PAGE so you don’t accidentally infinite loop. ( Adjust accordingly )

Now update your URL in old HTTP Request node to use the page variable:

URL: https://books.toscrape.com/catalogue/page-{{ $json.page }}.html

This makes the node fetch the correct page each time.

The rest of the workflow remains the same, till the second HTML Extract node (where we parsed the book name, URL, price, rating etc. in Part 1).

Step 2: Modify the Normalizer Function Node to Save Results Across Pages

In Part 1, our Step 7 code simply cleaned and normalized items for one page.

Now we need it to do two things:

Normalize the results (same as before)
Store the results from every page inside n8n’s global static data bucket. Think of it like temporary workflow memory.

Update the node, code with:

// --- Normalizer (Code node) ---
// Get the global workflow static data bucket
const workflowStaticData = $getWorkflowStaticData('global');

// initialize storage if needed
workflowStaticData.workBooks = workflowStaticData.workBooks || [];

// normalization logic (kept minimal version)
const base = 'https://books.toscrape.com/';
const normalized = items.map(item => {
  const urlRel = item.json.url || '';
  const imgRel = item.json.image || '';
  const ratingClass = item.json.rating || '';
  const ratingParts = ratingClass.split(' ');
  const rating = ratingParts.length > 1 ? ratingParts[ratingParts.length - 1] : '';

  return {
    name: item.json.name || '',
    url: base + urlRel.replace(/^(\.\.\/)+/, ''),
    image: base + imgRel.replace(/^(\.\.\/)+/, ''),
    price: item.json.price || '',
    availability: (item.json.availability || '').toString().trim(),
    rating
  };
});

// append to global storage
workflowStaticData.workBooks.push(...normalized);

// return control info for IF node (not the items)
const currentPage = $('Page Manager').first().json.page || 1;
return [{
  json: {
    itemsFound: normalized.length,
    nextHref: $json.nextHref || null,
    Page: currentPage
  }
}];

We normalize the data exactly like Part 1.
Then we push all normalized items into workflowStaticData.workBooks.
Instead of returning the items themselves, we return only a small control object.
This object is used by the IF node to decide whether we continue scraping or stop.

Step 3: IF Node (Stop Scraping or Continue)

Add an IF node with two conditions and OR Type:

Condition 1:
{{ $json.itemsFound }} is equal to 0

Meaning → The current page returned no items → we’ve reached the end.

Condition 2:
{{ $json.Page }} is greater than or equal to YOUR_MAX_PAGE

Meaning → Stop when you reach the max page number you set.

Together these conditions help the workflow decide:

IF → True
Stop scraping and move to the export step.

IF → False
Go back to the Page Manager, increment the page number, and keep scraping.

This creates a complete and safe pagination loop.

Step 4: Collect All Results and Export

When the IF node returns True, add one more small Code node before the Convert To File node:

// Get the global data
const workflowStaticData = $getWorkflowStaticData('global');

// Get the array of books, or an empty array if it doesn't exist
const allBooks = workflowStaticData.workBooks || [];

// Return all the books as standard n8n items
return allBooks.map(book => ({ json: book }));

What this one does:

Pulls everything we stored in the temporary memory.
Returns it as normal n8n items.
These go straight into Convert To File → CSV.

And that’s the entire pagination workflow.

Infinite Scroll

This one is much simpler.

Some websites load content as you scroll, there's no traditional page numbers.
The Zyte API supports browser actions, which makes this easy.

Just add one line to our original cURL command:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{ "action": "scrollBottom" }]}' \
   https://api.zyte.com/v1/extract

Why this works

Zyte API loads the page in a headful browser session.
It scrolls to the bottom, triggering all JavaScript that loads additional items.
Then it returns the final, fully loaded browserHtml.
You can parse this HTML normally using the same nodes from Part 1.

Geolocation

Some websites return different data depending on your region.
Zyte API makes this super simple by allowing you to specify a geolocation.

Use this inside an HTTP Request node:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "http://ip-api.com/json", "browserHtml": true, "geolocation": "AU" }' \
   https://api.zyte.com/v1/extract

Setting "geolocation": "AU" makes Zyte perform the browser request from that region, check the list of all available CountryCodes.
Many websites use region based content (pricing, currencies, language, product availability), so this is extremely helpful.

Screenshots

If you’d like to grab a screenshot of what the browser rendered, you can do that too.

cURL:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://toscrape.com", "screenshot": true }' \
   https://api.zyte.com/v1/extract

It will return the screenshot as Base64 data.

To convert it into a proper image (PNG, JPEG, etc.) → Use Convert To File node in n8n.

Important:

n8n often converts boolean values like true into "true" when importing via cURL.
Fix it by clicking the gear icon → Add Expression → {{true}}.

Or switch body mode to Using JSON and paste:

{
  "url": "https://toscrape.com",
  "screenshot": true
}

Network Capture

Many modern websites load content through background API calls rather than raw HTML.
And you can just capture those network activity during rendering.

Example:

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true,  "networkCapture": [
        {
            "filterType": "url",
            "httpResponseBody": true,
            "value": "/api/",
            "matchType": "contains"
        }]}' \
   https://api.zyte.com/v1/extract

This returns a networkCapture array with all responses whose URL contains /api/.

Understanding above Parameters

filterType: "url" ⟶ filter network requests by URL
value: "/api/" ⟶ look for URLs containing /api/
matchType: "contains" ⟶ pattern match style
httpResponseBody: true ⟶ include the response body (Base64)

Extracting data from the captured network response

You can decode the Base64 response in two easy ways:

1. Using a Function node (Python)

(You can also use JS if you prefer)

# Get the network capture data
capture = _input.first().json["networkCapture"][0]

# Decode base64 and parse JSON
import base64
import json

decoded_data = base64.b64decode(capture["httpResponseBody"]).decode('utf-8')
data = json.loads(decoded_data)

# Return the result
return [{
    "json": {
        "quotes": data["quotes"],
        "firstAuthor": data["quotes"][0]["author"]["name"]
    }
}]

→ This method decodes the Base64 encoded HTTP response, parses it as JSON, and gives you structured data directly, very reliable and readable.

2. Using Edit Field Node (No code)

In this method, you still need to parse your data

Add an Edit Fields node
Mode: Add Field
Name: decodedData
Type: String
Value:

{{ $json.networkCapture[0].httpResponseBody.base64Decode().parseJson() }}

→ This takes the Base64 content, decodes it, parses JSON, and puts the result under decodedData automatically.

Cookies, sessions, headers & IP type (quick guide)

When you move from toy sites to real sites, a few extra controls matter a lot: which IP type you use, whether you keep a session, and what cookies or headers you send.

Zyte API exposes all these as request fields and you can use them the same way we used browserHtml, networkCapture or actions above (via curl → Import in n8n HTTP Request node → Adjust Fields as needed → Extract).

To keep this guide focused, we won’t dive into all code examples here, but here’s one small one for _ Setting a cookie and getting it back_ ( requestCookies ) just to show how it integrates.

Cookies (viarequestCookies/ responseCookies) ➜ Useful when a website relies on cookies for preferences, language, or maintaining continuity between requests.

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{ "url": "https://httpbin.org/cookies", "browserHtml": true,
    "requestCookies": [{ "name": "foo",
            "value": "bar",
            "domain": "httpbin.org"
        }]
}' \
   https://api.zyte.com/v1/extract

⟶ This example uses requestCookies, but responseCookies works the same way, you simply read cookies from one request and pass them into the next.

Learn more on cookies.

Everything else below (sessions, ipType, custom headers) plugs in the same way.

Sessions
➜ Sessions bundle IP address, cookie jar, and network settings so multiple requests look consistently related. Helpful for multi step interactions, region based content, or sites that hate stateless scraping..
Docs: Sessions
Custom Headers
➜ Add a User Agent, Referer, or any custom metadata the target site expects: simply define them inside the HTTP Request node headers.
Docs: Headers
IP Type (datacenter vs residential)
➜ Some sites vary content based on IP type. Zyte API automatically selects the best option, but you can override it with ipType.
Docs: IP Types

All of these follow the same pattern we’ve already used above.

Where This Takes Us Next

And that’s it for Part 2! 🎉

We covered a lot more than just pagination, from infinite scroll & geolocation to screenshots, network capture, and the key request fields you’ll use while scraping sites.

What we learned isn’t a complete workflow on its own, but it builds the foundation you’ll use again and again in your scraping workflows.

In Part 3, we’ll take everything one step further and combine these patterns into a universal scraper: a reusable, configurable template that can adapt to almost any site with minimal changes.

Thanks for following along, and feel free to share your workflow, questions, or improvements in the Extract Data Community.
Happy scraping! 🕸️✨

Inside Common Crawl: The Dataset Behind AI Models (and Its Real World Limits)

Lakshay Nasa — Thu, 30 Oct 2025 05:19:09 +0000

You've probably heard that LLMs are "trained on data from the web." But have you ever wonder how they actually get that data?

Did developers write scrapers to crawl the entire internet - building a massive web scraping solution from scratch?

For many, the answer is simple: Common Crawl.

Let’s explore what it is, how it fuels significant porion of AI world, and when to use it instead of building your own scraper.

What is Common Crawl? 🤔

Common Crawl is a non profit organization that has been crawling the web since 2008. Its mission is to provide free, large scale, publicly available archives of web data for researchers, developers, and organizations worldwide.

Think of it as a massive, open source library of the internet. Every month, its crawler, CCBot, scans billions of pages and archives them.

For instance, the August 2025 crawl added 2.42 billion pages, totaling over 419 TiB of data! This data is stored in Amazon S3 buckets and is accessible to anyone for free.

Why Common Crawl Matters for AI?

A 2024 Mozilla report found that 2/3 of 47 generative LLMs released between 2019-2023 relied on Common Crawl data.

Today, Wikipedia lists over 80 public LLMs, and aggregators like OpenRouter host 500+ models.
Even if not all disclose their datasets, Common Crawl (and its derivatives like RefinedWeb) are still among the most cited sources.

💬 In short: if you’ve used a modern LLM model, you’ve indirectly used data from Common Crawl.

How Common Crawl Organizes Data?

The data is offered in three main formats:

Web Archive (WARC) files - the rawest form, containing full HTTP responses (headers, HTML, etc.). Perfect if you need images, HTML parsing, or complete page reconstruction.
Web Archive Transformations (WAT files) - they like summaries of WARC files. They contain metadata in JSON format, such as all the links on the page, HTTP headers, and response codes. This is useful if you don’t need the full page, but want structured information like which URLs link to which pages.
Web Extracted Text (WET files) - plain text extracted from WARC files (no HTML or media). Ideal for NLP or training text based models.

How to Fetch a Page from Common Crawl?

Fetching archived pages from Common Crawl is easier than it sounds.
It’s a simple three step process, and the logic is the same no matter what website you’re looking at.

I’ve been researching good graphics cards, and while browsing I found this collection page: https://computerorbit.com/collections/graphics-cards. It lists all sorts of GPUs and variants.

Now I was curious, how does this page look inside Common Crawl’s archives?

Let’s try to find and fetch its archived version.

Import these first

import requests
import json
from warcio.archiveiterator import ArchiveIterator

Step 1: Common Crawl and Index Retrieval

Before fetching any page, we first need to know which crawl (index) it belongs to.
Common Crawl organizes its data into periodic crawls, for example, CC-MAIN-2025-33 or CC-MAIN-2025-19. Each crawl corresponds to a time period. For example, CC-MAIN-2025-33 means it’s the 33rd crawl of 2025.

So first, we’ll fetch a list of available indexes.

# --- Step 1: Find a valid Common Crawl index ---
def get_available_indexes():
    """Fetches the list of all available Common Crawl index collections."""
    print("[*] Fetching list of available Common Crawl indexes...")
    collections_url = "https://index.commoncrawl.org/collinfo.json"
    try:
        response = requests.get(collections_url, timeout=30)
        response.raise_for_status()
        collections = response.json()
        cdx_indexes = [col['cdx-api'] for col in collections if 'cdx-api' in col]
        cdx_indexes.sort(reverse=True)
        print(f"[+] Found {len(cdx_indexes)} available indexes.")
        return cdx_indexes
    except requests.exceptions.RequestException as e:
        print(f"[!] Error fetching collection info: {e}")
        return []

Here's what's happening:

We make a simple HTTP request to index.commoncrawl.org to get the list of all crawl indexes.
The response includes all the **CDX API URLs**, those are the entry points to query each crawl.
We then sort them in reverse (newest first), so we always check the latest crawls first.

💡 Think of this step as checking the table of contents of a huge web archive library.

It doesn’t give us the actual page yet - it only tells us which index might contain our target page.
Once we locate that index, we’ll move to **data.commoncrawl.org** to download the actual HTML in the next step.

Step 2: Querying the Common Crawl Index

Now that we have the list of indexes, we’ll search for our target URL in one of them.

# --- Step 2: Query the index to find the page data for the target URL ---
def get_cc_captures(url, index_url):
    """Queries a specific Common Crawl Index for captures of a URL."""
    print(f"[*] Querying index: {index_url}...")
    params = {
        'url': url,
        'output': 'json',
        'filter': '=status:200',
        'fl': 'filename,offset,length,timestamp'
    }
    try:
        response = requests.get(index_url, params=params, timeout=30)
        response.raise_for_status()
        if not response.text.strip():
            print(f"[-] No captures found for {url} in this index.")
            return []
        captures = [json.loads(line) for line in response.text.strip().split('\n')]
        captures.sort(key=lambda item: item['timestamp'], reverse=True)
        print(f"[+] Found {len(captures)} captures in this index.")
        return captures
    except requests.exceptions.RequestException as e:
        print(f"[-] Warning: Could not query index {index_url}. Error: {e}")
        return []

Here’s what we’re doing:

We query one specific index to find captures of our target URL.
The parameters tell Common Crawl what we want:
- url: The target page.
- output=json: Return metadata in JSON format (not the page itself).
- filter=status:200: Only include successful responses.
- fl: The specific fields we need (filename, offset, length, timestamp).

The response gives us metadata about where the actual HTML is stored.

Each result tells us:

Which WARC file to fetch (filename)
The byte range of our page in that file (offset, length)
Whe n it was captured (timestamp)

💡 This step was like finding a book in a massive library.

Step 3: Retrieving the Archived Content

Now that we know where our page lives, let’s fetch it.
Common Crawl stores everything in large WARC files on Amazon S3, but we don’t want to download those huge files entirely, they’re hundreds of gigabytes.

So instead, we’ll use a byte-range request to fetch just the part that contains our page.

# --- Step 3: Download the raw HTML from the archive (No changes needed) ---
def get_html_from_capture(capture):
    """Downloads a specific record from a WARC file and returns its HTML."""
    if not capture:
        return None
    filename = capture['filename']
    offset = int(capture['offset'])
    length = int(capture['length'])
    s3_url = f"https://data.commoncrawl.org/{filename}"
    range_header = f"bytes={offset}-{offset + length - 1}"
    print(f"[*] Fetching data from {s3_url} with range {range_header}...")
    try:
        response = requests.get(s3_url, headers={'Range': range_header}, timeout=60, stream=True)
        response.raise_for_status()
        temp_filename = "temp_warc.gz"
        print(f"[*] Saving downloaded chunk to '{temp_filename}'...")
        with open(temp_filename, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"[+] Successfully saved data to '{temp_filename}'.")
        print("[*] Decompressing and parsing WARC record from local file...")
        with open(temp_filename, "rb") as f:
            archive_iterator = ArchiveIterator(f)
            for record in archive_iterator:
                if record.rec_type == 'response':
                    print("[+] Successfully parsed WARC record.")
                    return record.content_stream().read().decode('utf-8', errors='ignore')
    except Exception as e:
        print(f"[!] An error occurred during the process: {e}")
        return None

Each capture record tells us which WARC file our page is stored in (filename) and where inside that file (offset, length).
We build the S3 URL using those values and add a Range header so we only download that small slice of the file.
The downloaded chunk is temporarily saved locally as temp_warc.gz.
We then open it with ArchiveIterator, which allows us to read the compressed archive and extract the HTML from the response record.

💡 Now we’re finally opening the book and reading the page we were looking for, without carrying the whole library home.

So, Step 1 and 2 tell us where to look, and Step 3 actually retrieves the HTML of the page, efficiently and precisely.

Running It All Together

Now let’s tie everything together.

# --- Main execution block ---
if __name__ == "__main__":
    # Example target (replace with your site)
    target_url = 'computerorbit.com/collections/graphics-cards'

    all_indexes = get_available_indexes()
    if not all_indexes:
        print("[!] Could not retrieve list of Common Crawl indexes. Exiting.")
    else:
        captures = []
        # Try the 5 most recent indexes until a result is found
        for index_url in all_indexes[:5]:
            captures = get_cc_captures(target_url, index_url)
            if captures:
                print(f"[*] Success! Found captures in index {index_url}")
                break

        if captures:
            latest_capture = captures[0]
            print(f"[*] Using most recent capture from {latest_capture['timestamp']}")
            html = get_html_from_capture(latest_capture)

            if html:
                with open("page.html", "w", encoding="utf-8") as f:
                    f.write(html)
                print("[+] Saved archived HTML as page.html")
        else:
            print(f"[!] No captures found for {target_url} in recent crawls.")

When you run this, you’ll get the archived HTML of your target page saved locally as page.html.

You can open that file, inspect its contents, or later build a parser around it to extract specific data (like product names or article text).

🖼️ Sample Output (Common Crawl)

The output you get here is historical, it reflects how the page looked when Common Crawl last captured it.

For live, real time data, we’ll now look at how this compares with a scraper built using Zyte’s API.

Common Crawl vs. Building Your Own Scrapers: Which Should You Use?

Common Crawl gives you access to web data at scale without worrying about proxies or blocks. Perfect for analysis, research, or benchmarking.

However, it comes with its own set of challenges.

The biggest one? Freshness.

What if you want fresh, real time data - for example, to check which new graphics cards were added this week or latest cost?

That’s where your own scraper makes all the difference.

Lets build a quick scraper for the same page and get the latest structured data instantly. We’ll use auto extract feature. Learn more about it in, here: Zyte API automatic extraction


import requests
import json
import os
from dotenv import load_dotenv
load_dotenv()

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

product_url = (
    "https://computerorbit.com/collections/graphics-cards"
)

api_response = requests.post(
    "https://api.zyte.com/v1/extract",
    auth=(ZYTE_API_KEY, ""),
    json={
        "url": product_url,
        "productList" : True
    },
)

products = api_response.json()['productList']['products']
print(products)
print(len(products))

output_file = "z_products.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(products, f, ensure_ascii=False, indent=4)

print(f"Products saved to {output_file}")

🖼️ Sample Output (Fresh Data Scraper)

Here, you can clearly see the fresh products and updated prices that weren’t present in Common Crawl’s archive.

And while freshness is one of the biggest challenges with Common Crawl, it’s not the only one.

Other Challenges with Common Crawl

Duplicate Data Common Crawl captures the same pages across multiple crawls, sometimes hundreds of times. This means a lot of duplicate data that needs deduplication before use.

“DeepSeek alone removed nearly 90% of repeated content across 91 Common Crawl dumps, just so it could train on high quality, diverse text.” Link

Messy Data
WARC files often contain ads, cookie banners, or partial HTML responses. You’ll need heavy preprocessing and filtering to get clean text or structured data.
Scale
Common Crawl data is measured in petabytes. Great for large research labs, but not always practical for smaller projects or individual developers.
Bias
Crawl frequency and seed URLs shape what gets captured.

So, some domains or regions are overrepresented while others barely appear.

Together, these make Common Crawl an incredible but challenging dataset excellent for research and experimentation, but rarely “plug and play.”

And that’s where your own scrapers, really shine. Let’s compare the two approaches side by side.

When to Use What

Use Common Crawl if...	Use Scraping APIs / Your Own Crawlers if...
You need vast amounts of raw, non specific data	You need fresh, up to date data
You’re doing research, LLM pretraining, or large scale analysis	You want consistent completeness (e.g., every product or listing captured)
You want access to historical web archives	You need structured data outputs like JSON or CSV
You’re exploring academic or experimental projects that don’t require perfection	You’re targeting specific sites or datasets for production use

Wrapping Up

Common Crawl is one of the most fascinating resources on the web - a time capsule of billions of pages, freely available for anyone to explore. It’s the foundation of countless research projects, datasets, and even large language models.

But as we’ve seen, it’s not perfect.

Its data is archived, not live, and working with it often means dealing with duplicates, noise, and scale. That’s fine if your goal is analysis or experimentation, but not if you need production grade, real time insights.

That’s where modern scraping solutions like Zyte’s API make all the difference. Instead of wading through terabytes of historical data, you can fetch fresh, structured, ready to use information from web in seconds.
It’s all about picking the right tool for the job.

If you enjoyed this, you’ll fit right in at the Extract Data Discord - a 20,000+ strong community of builders, scrapers, and data nerds exploring the web together.

Thank you for reading! 😄

Web Scraping with n8n | Part 1: Build Your First Web Scraper

Lakshay Nasa — Fri, 17 Oct 2025 13:10:41 +0000

What it will cover!

If you’ve ever wished you could automate scraping without setting up a bunch of scripts, proxies, or browser logic, you're in the right place.

We’ll use n8n, the low code automation tool, together with Zyte API to fetch structured data from https://books.toscrape.com/.

By the end, you’ll have a workflow that runs on its own, giving you clean JSON or CSV output of all books - their names, prices, ratings, and images. And a setup you can easily adapt for other publicly available or test websites with similar layouts.

Let’s get scraping!

The game plan:

Fetch the page using Zyte API (it handles rendering & manages blocks automatically)
Extract HTML content inside n8n
Parse book elements with CSS selectors
Clean and normalize the data
Export results as JSON or CSV

First, let’s get n8n ready to roll.
You can set it up for free locally, or in the cloud whichever you prefer.
If you’re going local, install it via Docker or npm, it only takes a few commands.

Once it’s up, the steps below will work exactly the same whether you’re using n8n Desktop or n8n Cloud.

Step 1: Create a new workflow in n8n

After logging in, create a new workflow.
Name it something like "Book Catalog Scraper" or you can tweak the same workflow later for similar pages or categories.
This blank canvas is where all your nodes will live.

Step 2: Add an HTTP Request Node

We’ll use the HTTP Request node to call the Zyte API.

We’ll use cURL to configure this node. Click on Import cURL, then paste the following command and hit Import.
(Don’t forget to replace the API key with your own, and change the URL if you’d like.)

curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html", "browserHtml": true}' \
   https://api.zyte.com/v1/extract

Once imported, you’ll see the node fields automatically populated.
Note: When you import via cURL, n8n often converts boolean values like true into the string "true".
To fix this, click the little gear icon → “Add Expression” next to the value and set it to {{true}}.
This is especially required for the browserHtml field, it ensures the Zyte API receives a real boolean, not a string.

Now hit Execute Node, and you should see a JSON response with a big block of HTML inside the "browserHtml" field.

Step 3: Extract the HTML content

Next, add an Edit Fields node (previously called Set node) to isolate that browserHtml content.

Mode: Add Field
Name: data
Value {{$json["browserHtml"]}}

This gives us a clean data field containing just the HTML we need.

Step 4: Parse book elements

Add the HTML node ( Extract HTML Content ).

Source Data: data
Key: books
CSS Selector: article.product_pod
Return Array: ✅ Enabled
Return Value: HTML

Run it once, you’ll see a new field - books, containing an array where each item represents a single book’s HTML block.

We have one array, with multiple products, each ready to be parsed individually in the next step.

Step 5: Split the list into items

Now we’ll process each product individually.
Add the Split Out node:

Fields To Split Out: books

Now each book becomes its own item for extraction. this makes it easier to handle or filter each record separately later on.n.

(You can skip this step if you only need a quick one-shot export, but keeping it helps if you plan to scale or tweak the workflow later.)

Step 6: Extract product details

Add another HTML node ( Extract HTML Content ) to grab the details inside each product.

Extraction Values:

Key	CSS Selector	Return Value
`name`	`h3 a`	Attribute → title
`url`	`h3 a`	Attribute → `href`
`price`	`.price_color`	Text
`availability`	`.instock.availability`	Text
`rating`	`p.star-rating`	Attribute → `class`
`image`	`.image_container img`	Attribute → `src`

Hit Execute, you’ll get a structured JSON for each book.

Step 7: Clean and normalize the data

We’ll make sure URLs and image links are full paths, and rating classes are readable.
Add a Code node ( Code in JavaScript ) and paste:

return items.map(item => {
  const base = 'https://books.toscrape.com/';
  const urlRel = item.json.url || '';
  const imgRel = item.json.image || '';
  const ratingClass = item.json.rating || '';
  const ratingParts = ratingClass.split(' ');
  const rating = ratingParts.length > 1 ? ratingParts[ratingParts.length - 1] : '';

  return {
    json: {
      name: item.json.name || '',
      url: base + urlRel.replace(/^(\.\.\/)+/, ''),
      image: base + imgRel.replace(/^(\.\.\/)+/, ''),
      price: item.json.price || '',
      availability: (item.json.availability || '').trim(),
      rating
    }
  };
});

Config

Mode: Run Once for All Items
Language: JavaScript

Note:
You can tweak this logic based on your own site or data structure, for instance, you might want to clean extra fields, adjust paths differently, or skip this step entirely if your data’s already in the format you want.

Now your output will have clean, structured data, ready to export or feed into your next automation.

Step 8: Export your data the way you want

Now that your data is clean and structured, let’s turn it into a downloadable file, whether that’s CSV, .txt, or something else.

Finally, drop in the Convert to File node

This node takes your structured data and converts it into different file types.

Here’s how to configure it:

Once done, click Execute Node and you’ll see a binary output with your file ready to download.

Wrapping up

And that’s it, we just built a full web scraping workflow in n8n, powered by the Zyte API.

You’ve just automated a complete workflow: fetching, parsing, cleaning, and exporting all visually inside n8n.

This same flow can be easily tweaked for other pages, just change the URL, update your selectors, and you’re good to go.

In the next part, we’ll take this further and scrape multiple pages automatically by adding pagination logic.

Stay tuned, thanks for reading!😄

Supercharge Your AI Agents with a Custom RAG Pipeline Powered by Live Web Data

Lakshay Nasa — Fri, 19 Sep 2025 18:41:16 +0000

Just think for a while, what if you could fed any web page data to your AI agent, to just get you the exact info, answer or the summary of the content you're looking?

Actually, you can that with ease with Scrapy + Zyte API

Meet Fab 👨‍💻
Fab’s a dev with years of experience. Lately, he’s been diving into finance, learning about promising stocks. But here’s the problem: keeping up with daily news, press releases, scrolling through 10 articles and updates every morning is hectic and manual.

So Fab decided to build an AI Agent that does it for him - fetching, reading, and summarizing everything in real time.

That’s basically a custom RAG pipeline, powered by live web data & no longer limited to static PDFs or outdated docs.

Why? bother
Because even the smartest AI agent is only as good as the data it can access:

LLMs have knowledge cutoffs
Real-time, domain-specific data (like finance) is crucial for decision making

By tapping into live web data, Fab’s agent can keep up with the world as it happens - always relevant, always ready.

But hold up ✋, summarizing/ answering isn't the same as taking reaal actions. That’s where AI Agents and Agentic AI differs.

AI Agents are software systems designed to automate specific, well defined tasks, like chatbots, email sorting tools, or voice assistants, usually based on predefined tools or prompts.
Agentic AI, on the other hand, has a broader scope of autonomy.

What we’ll walk through here is technically an AI Agent, but since both share the same foundation, it could evolve into Agentic AI.

Fab's Toolkit 🛠️

Scrapy → for structured data extraction
Zyte API → to handle dynamic & complex websites
DuckDuckGo + yfinance → for extra search and finance insights
Agno → to orchestrate a multi-agent workflow
GroqCloud → lightning fast LLM inference

The Architecture 🏗️

Why Scrapy + Zyte API?

You could try doing this with just Scrapy and rotating proxies. But anyone who has scraped at scale knows the pain: blocks, captchas, failed requests.

That’s where Zyte API shines. It offloads the heavy lifting, so you don’t have to babysit your scrapers, you just get clean, structured data.

Think of it like having a dedicated backend team making sure your spiders never get stuck.

Data Collection the Right Way! 📥

Instead of scraping everything, Fab’s agent first collects URLs only... then fetches only the important data based on a trend score.

To handle this efficiently, Fab designed a Scrapy project with one base spider and four specialized spiders for fetching:

News
Press releases
Transcripts
Comments

The base spider takes care of site specific scraping by:

Fetching URLs and metadata
Cleaning and normalizing dates
Generating unique IDs from URLs

import scrapy
from urllib.parse import urlparse, urlunparse
from datetime import datetime, timedelta
import hashlib

class BaseFinanceSpider(scrapy.Spider):
    name = "base_finance_spider"
    allowed_domains = ["finance-example.com"]

    def clean_url(self, url):
        """Normalize URLs"""
        parsed = urlparse(url)
        return urlunparse(parsed._replace(query="", fragment=""))

    def create_id(self, url):
        """Generate unique ID from cleaned URL"""
        return hashlib.sha256(self.clean_url(url).encode()).hexdigest()

    def convert_date(self, raw_date, now=None):
        """Convert relative dates like 'Today' or 'Yesterday' to ISO"""
        now = now or datetime.now()
        if "Yesterday" in raw_date:
            return (now - timedelta(days=1)).isoformat()
        if "Today" in raw_date:
            return now.isoformat()
        # For demo, we skip complex parsing
        return raw_date

Each specialized spider inherits from the base spider and focuses on site specific logic: navigating pages and extracting the key information for its data type.

At this stage, three of the specialized spiders collect only URLs and metadata, creating a JSON list for each data type. Comments are the exception, we scrape those right away. Think of it as preparing a “to do list” of pages for Fab’s agent to process later, keeping things organized and efficient.

When items are yielded, Scrapy Pipelines automatically handles the cross cutting tasks like URL normalization and ID assignment, deduplication, anonymization, comment linking, and saving items to JSON.

class UrlNormalizationPipeline:
    def process_item(self, item, spider):
        item['url'] = spider.clean_url(item.get('url'))
        item['id'] = spider.create_id(item['url'])
        return item

class DeduplicationPipeline:
    def __init__(self):
        self.seen_ids = set()
    def process_item(self, item, spider):
        if item['id'] in self.seen_ids:
            raise DropItem(f"Duplicate: {item['id']}")
        self.seen_ids.add(item['id'])
        return item

class AnonymizationPipeline:
    def process_item(self, item, spider):
        # Mask authors, publishers, or usernames
        return item

class JsonFileExportPipeline:
    def process_item(self, item, spider):
        # Save item to JSON file (with intermediate saves)
        return item

What Each Spider Produces →

Once the URLs and metadata are collected, Fab’s agent performs trend analysis, using comments as a central indicator to prioritize which pages to fetch in full.

Trend Analysis 📈

Now that we’ve gathered articles ( news, press releases, transcripts ) and comments, the next step is figuring out which topics are actually trending. Collecting raw content is only half the job, what makes it valuable is knowing where attention is going.

For this, we built a Trend Calculator. Its job is to take all the articles and comments we collected, connect them together, and then assign each article a trend score. The score is based on a few simple but powerful signals:

Comment activity – Articles with more comments get higher scores (up to a cap, so one viral post doesn’t skew everything).
Mentions inside comments – If people are discussing one article inside the comments of another, that’s a sign of influence.
Freshness – Recent articles get a bonus since trends fade quickly over time.
Cross source validation – If the same topic shows up across multiple sources (like news and press releases), it’s likely important.
Engagement quality – Longer, more thoughtful comments add extra weight compared to short ones.

# Example of scoring logic
comment_score = 3 * min(article.get('comment_count', 0), 10)  
mention_score = 2 * min(article.get('comment_mentions', 0), 5)  
date_score = calculate_date_bonus(article.get('date'))  
source_score = 2 if len(article.get('sources', [])) > 1 else 0  
engagement_score = quality_from_comments(article.get('comments', []))  

trend_score = comment_score + mention_score + date_score + source_score + engagement_score

Each factor contributes points that add up to a final trend_score, showing how much traction an article has.

Here’s the flow:

Link comments to articles – Attach every comment to its article.

self.article_comments[article_id].append(comment)
article['comment_count'] = len(self.article_comments[article_id])

Score calculation - For every article, the calculator looks at the signals above and assigns points.
Ranking - Articles are sorted by score so we can clearly see which ones are rising in popularity.
Filtering - We keep only those above a threshold score (say 5 or 10), to cut out noise.

Finally, the output is saved as JSON for later use:

trending = tc.get_top_articles(threshold=5.0, limit=100)
with open('trending_articles.json', 'w') as f:
    json.dump(trending, f, indent=2)

At this point, we’re not just storing articles list, we’re turning them into insights about what’s gaining traction in real time.

The output of this step is a trending_articles.json file: a ranked list of articles with their comment signals attached. Next, we’ll take this list and extract the full article content for deeper processing.

Processing the Articles 📑

Alright, time to move past the signals and actually grab the articles content. This is where Fab’s agent pulls in the full text so it can finally be read, summarized, and acted on — the real scraping and processing begins here.

Step 1: Smart Extraction with Zyte API
Instead of scraping blindly, we run each article URL through Zyte API. It tries multiple strategies under the hood:

Browser rendering for rich pages.
HTTP response fallback if the first pass fails.
And if all else fails → a graceful fallback object that notes the article couldn’t be extracted (paywalls, login walls, etc.).

Caching is baked in so we don’t re download the same article twice.

def extract_article_with_zyte_api(url):
    if is_cached(url):
        return get_from_cache(url)
    # Try browser mode first, fallback to HTTP
    for method in [extract_with_browser_simple, extract_with_http_response]:
        article = method(url)
        if article and len(article.get("content", "")) > 100:
            save_to_cache(url, article)
            return article
    return create_fallback_article(url)

Step 2: Batch Processing ⚡

Its not good to hammer a site with 50 requests at once, so Fab’s agent scrapes articles in small batches. This keeps things stable, avoids rate limits, and lets us resume midway if anything fails.

scraped_articles = process_articles_in_batches(urls, batch_size=3)

Step 3: Comments + Anonymization
Once the raw articles are in, we attach their associated comments (collected earlier) and anonymize usernames. That way, Fab can see the discussion signals without worrying about leaking personal data.

article['comments'] = matching_trending.get('comments', [])
article = processing_anonymizer.anonymize_comments_in_article(article)

Step 4: Summarization with LLMs

Finally, each article is summarized using Groq + Llama 3.3, with comments included in the context. The prompt ensures Fab gets:

A clear content-type tag ([Complete article with comments], [Partial article], etc.).
The main points of the article.
Highlights from user comments (agreements, debates, sentiment).
A note if the article looked incomplete or truncated.

summary = processor.summarize_article(article)

At this point, we’ve gone from:
just links + scores → full articles + anonymization + structured summaries.

This is the real handoff moment: the dataset is now clean, safe, and AI ready. Time to combine this with other data sources...

Turning Raw Summaries into Something Useful

So we’ve got cleaned up, summarized articles sitting neatly in JSON. That’s cool, but Fab doesn’t just want a folder full of summaries, he wants an agent that can reason over them, combine them with live market data, and give him answers on demand.

That’s exactly what Agno will be used for. Agno is a framework for building LLM powered agents where everything revolves around tools. We use some ready made tools, like yfinance for market data or DuckDuckGo for quick searches, and we’ll create our own custom tool using the scraped and summarized articles we’ve collected.

Step 1: Custom Data as a Tool

We wrap our summaries into a CustomDataTools class. This behaves just like any other tool in Fab’s agent, except instead of calling an external API, it pulls directly from our private dataset of scraped articles.

Load summaries from article_summaries.json file.
Filter them by stock ticker (NVDA in our case)
Format them into a neat digest with truncation rules so we don’t blow past token limits

class CustomDataTools(Toolkit):
    def get_custom_financial_summaries(self, stock_ticker="NVDA"):
        summaries = self.get_scraped_summaries()
        return self.format_summaries(summaries, stock_ticker=stock_ticker)

Step 2: Mixing with External Data

Of course, Fab doesn’t live on summaries alone. He still needs real time signals like stock prices, analyst ratings, and other fresh search. That’s where we combine:

yfinance → live stock + fundamentals
DuckDuckGo → fresh search
Our custom summaries → curated, domain-specific insights Now the agent has both breadth (search + finance APIs) and depth (our private dataset).

Step 3: Building the Agent 🧑‍💻
With Agno, stitching it together is dead simple:

Pick a model (Groq’s Llama 3.3 for speed, or Ollama locally if Fab prefers).
Load the toolset (custom data first, then finance APIs, then search).
Add guardrails: focus on NVDA, prefer bullet points, flag stale data, cite sources.

# 1️⃣ Import models and tools
from agno.agent import Agent
from agno.models.groq import Groq
from agno.models.ollama import Ollama
from agno.tools import Toolkit
from agno.tools.yfinance import YFinanceTools
from agno.tools.duckduckgo import DuckDuckGoTools

# 2️⃣ Define a custom tool for our scraped summaries ( given above ) 
class CustomDataTools(Toolkit):
    ...

# 3️⃣ Configure agent tools
def get_tools():
    return [
        CustomDataTools(),          # Priority: private scraped data
        YFinanceTools(),            # Priority: live stock & fundamentals
        DuckDuckGoTools()           # Priority: fresh search results
    ]

# 4️⃣ Pick a model
model = Groq(id="llama-3.3-70b-versatile")   # Or Ollama if you prefer local

# 5️⃣ Create the unified agent
finance_agent = Agent(
    name="Fab Finance Agent",
    model=model,
    tools=get_tools(),
    instructions=[
        "Focus on NVDA",
        "Prioritize custom summaries first, then live stock data, then fresh search",
        "Provide actionable insights in bullet points",
        "Cite sources and flag outdated info"
    ],
    show_tool_calls=True,
    markdown=True
)

Now when Fab asks:

“What’s the latest chatter around NVDA this week?”

The agent first checks our curated summaries, then layers in stock stats and fresh news.

This is where everything comes together:

Scrapy + Zyte API → fresh, structured raw data
Processing & scoring → signal + summaries
Finance Agent ( Agno ) → fusing custom + external tools into one workflow

Conclusion

What Fab ends up with is not just a scraper or a summarizer but a finance co pilot that stays current, context aware, and grounded in real web data.

With this workflow, what started as a manual, time-consuming task has transformed into a seamless, intelligent system, proving just how powerful AI Agents can be when paired with live web data.

💡 A small challenge for you:
If you’re feeling adventurous, try taking this project a step further, convert Fab’s AI Agent into a fully Agentic AI that can make decisions for you (of course, only with your approval, or you might risk your investments 😅). Connect it with the MCP of your stockbroker many of them provide one nowadays and scale it into something truly powerful, a next level finance companion!

If you get stuck or need guidance, don’t worry. Head over to the Extract Data Community, where 21,000+ data enthusiasts are ready to jump in and help you with your questions.

Dive in, experiment, and let us see your next move! 🙂

Thanks for reading!

Building a Discord Controlled Web Scraper with Scrapy & Zyte API

Lakshay Nasa — Thu, 14 Aug 2025 09:52:23 +0000

Introduction

It all started with a simple question in our Extract Data Discord: Extract Data Community:

“Hey, I’m trying to scrape this gaming leaderboard, but I keep getting blocked. Any idea how to get around it?”

A familiar problem for anyone in web scraping: modern websites block regular scraping with JavaScript rendering, rate limits, and IP restrictions. What began as a quick fix with Zyte API soon grew into a bigger idea.

After sharing a working demo, I asked myself:

What if this could be more than just a script?

What if it could scrape reliably, filter intelligently, and notify automatically - all while plugging into Discord?

And that’s how this project came to life!

In this article, I’ll share what I learned while building the system, from scraping and filtering data to sending real-time updates in Discord, and even triggering scrapes directly via a Discord bot. No heavy code walkthroughs, just insights you can apply to your own projects.

🧭 Overview: What We Built

At its core, the project does five simple but powerful things:

Scrapes leaderboard data from a gaming site using Scrapy
Bypasses anti-bot protections using Zyte API’s browser automation
Filters players based on customizable level thresholds
Notifies your Discord channel about new high-level players
Runs continuously on autopilot with scheduled checks

🎯 Scraping Goal: Build a scraper that scans the game’s leaderboard using custom input filters with in a defined page range, then instantly alerts our Discord community when matching high-level players are found.

Key Components

🕷️ Scrapy handles the scraping.
🛡️ Zyte API bypasses tough protections.
⏱️ Monitoring: Automated scheduling system
🤖 A Discord bot control center for commands/results

No manual refreshing. No getting blocked. Just clean, filtered data delivered where your community hangs out.

Architecture

Project Structure

scrape_filter_notify/
├── main.py                    # Main CLI entry point
├── discord_bot.py             # Discord bot with all commands
├── continuous_monitor.py      # Automated monitoring scheduler
├── requirements.txt           # Python dependencies
├── .env                       # Environment variables (create this)
├── .gitignore                 # Git ignore rules
│
└── scrape_filter_notify/     # Scrapy project
    ├── scrapy.cfg            # Scrapy configuration ( Default )
    └── scrape_filter_notify/
        ├── settings.py       # Scrapy settings ( Modified )
        ├── items.py          # Scrapy data models ( Modified )
        ├── pipelines.py      # Data processing ( Modified )
        ├── discord_notifier.py  # Discord integration ( New )
        └── spiders/
            └── leaderboard_spider.py  # Main web scraper ( Modified )

⚙️ Getting Started: Setting Up the Spider ( with Scrapy + Zyte API )

It began by setting up the scraper engine using a Scrapy spider, the gaming site in focus wasn’t friendly, it threw up JavaScript, rate limits, and the occasional CAPTCHAs at us.

Scrapy alone couldn’t get through, so we brought in Zyte API to handle rendering, retries, and anti-bot defenses. That way, the spider could focus on what matters: pulling clean data.

🧭 New to Scrapy?

If you’re just getting started, this tutorial will walk you through setting up your first Scrapy project from scratch.

Scraping Process

Here’s is the architecture for a smart and robust leaderboard_spider.py:

See, the Scrapy setup crawls through paginated leaderboard pages and extracts player info, with Zyte smart backend helping it navigate the websites tricky parts under the hood.

To keep things clean and easy to maintain, I split the logic into three main files - each doing exactly one job::

leaderboard_spider.py - does the crawling and parsing
items.py - defines the structure for raw data
pipelines.py - filters, saves, and notifies

One important step before starting the spider is configuring Scrapy to use Zyte API as the backend for all requests. This goes into our Scrapy settings.py file:

# Load Zyte API key securely from environment (recommended)
ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

# Enable transparent mode for better debugging and easier dev experience
ZYTE_API_TRANSPARENT_MODE = True

# Use Zyte’s download handler and middleware
DOWNLOAD_HANDLERS = {
    "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}

DOWNLOADER_MIDDLEWARES = {
    "scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 633,
}

📝 Tip: keep sensitive info like API keys in environment variables, never hardcode credentials directly.

Defining Raw Data

In Scrapy, items basically define how we want to shape the raw data we’re scraping. They’re like organized containers that hold everything the spider grabs. Later on, pipelines handle cleaning and validation.

For instance, here’s the simple RawPlayerItem class I used:

# items.py

import scrapy

# 🧱 Defines the structure of raw player data scraped from the leaderboard
class RawPlayerItem(scrapy.Item):
    player_name_raw = scrapy.Field()
    kingdom_raw = scrapy.Field()
    level_raw = scrapy.Field()
    game_exp_raw = scrapy.Field()
    page = scrapy.Field()

💡 Tip: If your setup includes pagination, it’s helpful to capture the page number as part of your data. For me, this was useful for estimating how long the scraping would take & debugging issues...

I set the spider to run on three main functions:

Initialize settings
Send requests
Parse the data

Inside the __init__ method, I just set up some basic configurations, like:

Minimum player level to consider
Number of pages to scrape
Output location
Whether or not to send a Discord notification

When the spider starts sending requests, it’s not just grabbing plain HTML. Because the site uses a lot of JavaScript, we rely on Zyte API’s browser automation to fully load content before scraping.

A couple of things to keep in mind while sending requests:

Add a little wait time using actions with a timeout, because sometimes the page content takes a few seconds to fully load.
Set the geolocation to US – this was a key discovery. The site sometimes shows incomplete or blocked content depending on the request’s region. Setting it to the US gave consistent, clean data every time.

Example request setup:

meta = {
    'zyte_api': {
        'browserHtml': True,          # Get full browser-rendered HTML
        'javascript': True,           # Enable JS execution
        'actions': [                  # Wait time before scraping
            {'action': 'waitForTimeout', 'timeout': 15}
        ],
        'geolocation': 'US'           # Set location to US for consistent data
    },
    'page': page,
}

One of the nice things Scrapy handles for us behind the scenes is retries and error handling.
If you were working with just plain Python + Zyte API, you’d have to write your own retry logic for bans, 520 errors, and other hiccups.

Just add these to your settings.py to handle retries automatically:

# Retry settings
RETRY_HTTP_CODES = [403, 429, 500, 502, 503, 504, 520, 524]
RETRY_TIMES = 5

Once Zyte sends back the fully rendered HTML, the spider’s parse() method gets to work. It uses CSS selectors to sift through the messy HTML and pick out exactly what we need: player names, kingdoms, levels, and more.

📝Pro Tip: I actually use two CSS selectors as a backup plan, because sometimes the page’s HTML is a little different, like text wrapped in <font> tags on some pages but not others. This helps the spider stay flexible and not break - something you learn while debugging!

Pipelines: Cleaning, Filtering & Notifying

Once the spider scrapes raw data, the pipelines take over to clean, validate, save, and notify. I split the pipeline into two main parts:

PlayerProcessingPipeline: This part cleans up the raw data, filters out players below the minimum level, avoids duplicates, and saves the final list.
DiscordNotificationPipeline: At the end, this pipeline checks for any new players and shoots a neat summary over to Discord to keep everyone in the loop.

One cool thing I learned about how pipelines work in Scrapy is that each pipeline class gets its own “wrap-up” moment when the spider finishes running.

Scrapy lets every pipeline class define its own finishing method - close_spider(), and it runs these automatically in the order you set in ITEM_PIPELINES in settings.py.

# Enable pipelines
ITEM_PIPELINES = {
    "scrape_filter_notify.pipelines.PlayerProcessingPipeline": 300,
    "scrape_filter_notify.pipelines.DiscordNotificationPipeline": 800,
}

That’s why in my case the processing pipeline runs first, to clean and save the data and the Discord pipeline runs right after to send notifications based on that data.

Remember this all happens under the hood, once the spider starts crawling, the pipeline quietly takes over in the background, it filters out duplicates, skips players below the level threshold, and stores clean data into a JSON file.

That wraps up our Scraper Engine.

But scraping data is only useful if it reaches the people who need it.

Next I set up the Discord notifier to deliver the fresh data we just scraped right to the Discord server.

Sending Updates with Discord Notifier

Building a Discord bot isn’t hard, libraries like:

Discord.py - a modern, easy to use, feature-rich, and async ready API wrapper for Discord.
Discord.js - a powerful Node.js module that allows you to interact with the Discord API very easily.

…make it pretty straightforward.

Since our scraper is all in Python, I went with Discord.py. That way, everything runs in one language with no extra headaches, no child processes, no separate API layer just to talk to the scraper engine. That said, Discord.js has its own perks and can be the better pick if you’re already deep in the Node.js. We’ll explore that route another time.

With discord_notifier.py, the workflow is pretty simple: :

1️⃣ Load secrets (bot token & channel ID) securely via environment variables.

2️⃣ Log in to Discord, find the target channel, and build a polished embed message with the top new players.

3️⃣ Send the message, then log out cleanly.

The fun part was dealing with the event loop clash between Scrapy and Discord.py

Here’s the thing, Scrapy runs asynchronously on top of Twisted, which is its own networking framework - Twisted is a networking library that provides the asynchronous framework that Scrapy uses for its operations. Which means scrapy manages a lot of things (like web requests and processing ) concurrently within its own Twisted event loop.

When the spider finishes scraping, Scrapy begins shutting down. But in my second pipeline class (DiscordNotificationPipeline), we still need to run the notifier - but we’re still inside Scrapy’s Twisted event loop.

On the other hand, when we run discord_notifier using discord.py library, it uses asyncio, which runs its own separate event loop. And the key problem is that:

🔥 You cannot start an asyncio loop while another event loop (like Twisted’s) is already running.

Python will raise a RuntimeError, because you're trying to start one event loop inside another.

To avoid that, I added a check:

If the Scrapy loop is already active, the notification runs in a separate thread with its own event loop.
If not, it runs normally on the main loop.

Something like this does the trick:

loop = asyncio.get_event_loop()
if loop.is_running():
    # Run Discord notifier in a new thread
else:
    # Run notifier on the current loop

That little workaround ensures the scraper finishes, and your Discord server gets a clean summary message every time the job completes, no crashes, no conflicts.

Note: The discord_notfier.py script we discussed isn’t a full-fledged bot - it just logs in, sends a summary message, and logs out. It’s great for running the scraper on a schedule, and pushing updates to Discord automatically. I created a separate Discord bot that gives us full control over the scraping process directly from Discord. This setup keeps the scraper independent and flexible!

Before we move on,

here’s a quick visual that ties everything together, from fetching the rendered HTML to storing filtered data to JSON, sending updates to Discord, and setting up the scheduler in the next step::

Pretty solid, right?

Now that we’ve got scraping and notifications working, the next question is: what if we want this whole flow to run automatically, without having to trigger it manually every hour or so?

That is exactly what continuous_monitor.py was set up for. It's a smart loop that runs our spider at regular intervals...

🔁Autopilot Mode: Let the Spider Run Itself

Here’s what I did for scheduling the scraper:

Keeps track of run stats: started time, last run, next scheduled run, and total runs completed.
Handles shutdown signals cleanly, so we never leave half-finished runs hanging.
Launches the spider as a subprocess, waits for it to finish, and then sleeps for the interval you’ve set

monitor = ContinuousMonitor()
monitor.start_monitoring(min_level=75, max_pages=2, interval_minutes=60)

Then it runs again… and again… automatically. Something like this captures the core idea:

while monitoring_active:
    run_spider_subprocess()
    report_status()  # print or send to Discord
    sleep_for_interval()

I set it up using asyncio so everything runs smoothly without blocking, even when integrated with Discord notifications. The async loop handles spider runs, reporting, and sleep intervals efficiently — no interference between tasks.

This script could run in two ways:

Standalone: Just schedule it with a cron job, or even run it manually. It scrapes, saves JSON data, and optionally sends Discord notifications.
Inside a bot: Later, we can plug it into a Discord bot to give us full control - start, stop, or check stats directly from Discord.

Now everything wired up — the spider does the scraping, the notifier sends updates, and the monitor keeps things running on a loop.
But let’s be real, running a spider manually every time wasn’t exactly goal. So we built a Discord bot!

Here’s a quick look at the full bot lifecycle to visualize how it all works (just make sure the .env file has the bot token and channel ID set up before running it)::

This bot isn't just a helper. It’s a full control panel for our scraper, right inside the Discord server. Want to scrape once on-demand? Run /scrape. Want it to auto-run every 60 minutes? Do /monitor_start interval:60. Want to stop it? Check status? It’s all there, and the responses look good too (with progress bars, timestamps, and interactive result buttons).

🤖 Discord Bot: A quick Walkthrough

The bot launches and registers slash commands like /scrape, /monitor_start, /monitor_status, etc.
We can interact with it via those commands, depending on the command, it either:
- Runs a single scraping job using the parameters we give (or defaults),
- Starts the monitor, which loops and runs jobs periodically,
- Or just gives helpful info with /help_sccrape or lets us stop ongoing monitoring with /monitor_stop
While the scraping is in progress, we get live updates with visually satisfying progress bars, estimated times, and player counts.
Once it's done, it gives back a clean summary with a “View Results” button that opens an embedded, paginated view of the players it found in Discord itself...

So far, I built out all the pieces >

A spider that scrapes,
A Discord bot that commands it
A monitor that loops it in the background.

But I needed one more thing… A way to integrate all components together, for ease of control.

That’s why I created main.py: it’s the single command-line interface that ties the whole project together. Whether we want to… Run a quick scrape, Start the Discord bot Or launch background monitoring, it does all for us.

Next up! Let’s see how the results actually look when this thing runs.

🌀 Output Preview: What Happens When It Runs

Triggering a Scrape (Terminal Output) Here’s what it looks like when we run a scrape directly from the CLI. It kicks off the spider, runs through the pages, and wraps up with notifying us on our Discord Channel..

Notified on Discord

Running Background Monitoring (Terminal Output)

When we want the spider to keep working in the background, automatically running every X minutes, just trigger:

python main.py monitor --interval 30

We’ll see something like this in our terminal:

💬 And Just like before, it’ll ping us on Discord with updates!

Live Discord Bot in Action

Once we start the Discord Bot using the below command.

python main.py bot

…it boots up and gets right to work behind the scenes. It registers all the slash commands we built. Now we don’t need to touch the terminal. Just head to Discord and start interacting with the bot directly -

Available Commands on Discord

a. Run scrape Instantly

Just type /scrape, hit enter, and the bot takes care of the rest.

Controlling the monitoring loop right from Discord:

And that’s it, we’ve seen it all in action.

From scraping and filtering to live Discord alerts and full automation via CLI and bot commands, every part of this project works together to keep us and the community updated on the latest leaderboard shifts with minimal effort.

🎯 Final Thoughts: Scraping That Talks Back

This project started with a simple goal to help someone to get past anti-bot walls and grab some game data with Zyte API. But along the way, it became something more - a full system that scrapes, filters, and talks back to us in real-time via Discord.

The best part? It’s modular. Want to tweak the filter logic? Modify the pipeline. Want to plug it into another Discord server? Just update the .env. Need to scrape something entirely different? Swap out the spider logic, and keep the rest.

Just imagine, a single question turned into a full-fledged project…

That’s exactly the kind of spark our community runs on. If you're into this kind of stuff, scraping tricky sites, building smarter automations, or just geeking out over ideas, come hang out in the Extract Data Discord. We’re 20,000+ strong and growing, with data lovers, scraping pros, and creative hackers sharing projects, questions, and solutions every single day.

And as for this project - I hope walking through this gave you a solid blueprint for how to go beyond just writing a spider and instead, build scraping workflows that feel more interactive, automated, and fun.

Play around, and let us know what you build next!

Thanks for reading, 🙂
Catch you in the Discord!

DEV Community: Lakshay Nasa

The AI Web Scraper: One Workflow to Scrape Anything (n8n Part 3)

Watch the Walkthrough 🎬

The Concept: "AI-Driven Architecture”

Step 1: The Control Center

1. The Main Interface (Form Trigger)

2. Smart Routing (The Switch Node)

Step 2: Pipeline 1 – AI Extraction

1. The "AI Extraction Goal Form" (Refining Scope)

2. The Brain (Config Generator)

3. The 5 Strategies

A. Single Item:

B. List (Current Page):

C. Details (Current Page):

D. Crawl List (All Pages):

E. Crawl Details (All Pages):

Step 3: Pipeline 2 – SERP (Search Engine Results)

Step 4: Pipeline 3 – Manual / General Mode

The Result & Output

Get the Workflow

Wrapping Up

n8n Web Scraping || Part 2: Pagination, Infinite Scroll, Network Capture & More

Table of Contents

Let’s Begin!

Pagination across pages

Step 1: Page Manager Node

Step 2: Modify the Normalizer Function Node to Save Results Across Pages

Step 3: IF Node (Stop Scraping or Continue)

Step 4: Collect All Results and Export

Infinite Scroll

Geolocation

Screenshots

Network Capture

1. Using a Function node (Python)

2. Using Edit Field Node (No code)

Cookies, sessions, headers & IP type (quick guide)

Where This Takes Us Next

Inside Common Crawl: The Dataset Behind AI Models (and Its Real World Limits)

What is Common Crawl? 🤔

Why Common Crawl Matters for AI?

How Common Crawl Organizes Data?

How to Fetch a Page from Common Crawl?

Running It All Together

🖼️ Sample Output (Common Crawl)

Common Crawl vs. Building Your Own Scrapers: Which Should You Use?

🖼️ Sample Output (Fresh Data Scraper)

Other Challenges with Common Crawl

When to Use What

Wrapping Up

Web Scraping with n8n | Part 1: Build Your First Web Scraper

What it will cover!

The game plan:

Step 1: Create a new workflow in n8n

Step 2: Add an HTTP Request Node

Step 3: Extract the HTML content

Step 4: Parse book elements

Step 5: Split the list into items

Step 6: Extract product details

Step 7: Clean and normalize the data

Step 8: Export your data the way you want

Wrapping up

Supercharge Your AI Agents with a Custom RAG Pipeline Powered by Live Web Data

Fab's Toolkit 🛠️

The Architecture 🏗️

Data Collection the Right Way! 📥

Trend Analysis 📈

Processing the Articles 📑

Turning Raw Summaries into Something Useful

Conclusion

Building a Discord Controlled Web Scraper with Scrapy & Zyte API

Introduction

What if this could be more than just a script?

What if it could scrape reliably, filter intelligently, and notify automatically - all while plugging into Discord?

🧭 Overview: What We Built

Architecture

Project Structure

⚙️ Getting Started: Setting Up the Spider ( with Scrapy + Zyte API )

Scraping Process

Defining Raw Data

Pipelines: Cleaning, Filtering & Notifying