<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lakshay Nasa</title>
    <description>The latest articles on DEV Community by Lakshay Nasa (@lakshay_nasa).</description>
    <link>https://dev.to/lakshay_nasa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3319391%2F36a0d6ed-9695-4831-a90f-c93ce43a9960.png</url>
      <title>DEV Community: Lakshay Nasa</title>
      <link>https://dev.to/lakshay_nasa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lakshay_nasa"/>
    <language>en</language>
    <item>
      <title>The AI Web Scraper: One Workflow to Scrape Anything (n8n Part 3)</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Tue, 16 Dec 2025 14:00:59 +0000</pubDate>
      <link>https://dev.to/extractdata/web-scraping-with-n8n-part-3-the-ai-web-scraper-one-workflow-scrape-anything-3e4n</link>
      <guid>https://dev.to/extractdata/web-scraping-with-n8n-part-3-the-ai-web-scraper-one-workflow-scrape-anything-3e4n</guid>
      <description>&lt;p&gt;Welcome to the finale of our n8n web scraping series!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf"&gt;Part 1&lt;/a&gt;, we covered the basics: fetching a single page and parsing it with CSS selectors.&lt;/li&gt;
&lt;li&gt;In &lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365"&gt;Part 2&lt;/a&gt;, we tackled the tricky mechanics: pagination loops, infinite scroll, and network capture.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;But if you’ve been following along, you know there is still one massive headache in web scraping: &lt;strong&gt;"New Site = New Workflow."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every time you want to scrape a different website, you have to open the browser inspector, hunt for new &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; classes, debug why &lt;code&gt;price_color&lt;/code&gt; isn't working, and rewrite your entire flow. It’s exhausting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Today, we changed that.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this final part, we are going to build an &lt;strong&gt;Automated AI Scraper&lt;/strong&gt; - a single n8n workflow that can scrape almost anything (Online Stores, Article/News Sites, Job Boards, and more) without you changing a single node.&lt;/p&gt;

&lt;p&gt;Whether you are a developer looking to save hours of coding, or a non-technical user who just needs the data without the headache, this tool is designed for you.&lt;/p&gt;

&lt;center&gt;
  &lt;h4&gt;Watch the Walkthrough 🎬&lt;/h4&gt;
&lt;/center&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/QLuvyOCwYT4"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 TL;DR:&lt;/strong&gt; Want to start scraping immediately? We have packaged this entire workflow into a ready-to-use template. 👉 &lt;a href="https://n8n.io/workflows/11637-automate-data-extraction-with-zyte-ai-products-jobs-articles-and-more/" rel="noopener noreferrer"&gt;Automate Data Extraction with Zyte AI (Products, Jobs, Articles &amp;amp; More)&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The Concept: "AI-Driven Architecture”
&lt;/h1&gt;

&lt;p&gt;So, we will make n8n stop looking for specific elements (CSS Selectors) and start caring about what we actually want (Data).&lt;/p&gt;

&lt;p&gt;We are leveraging &lt;a href="https://docs.zyte.com/zyte-api/usage/extract/index.html?utm_campaign=Discord_n8n_blog_p3_z_docs_auto_extract&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte’s AI-powered extraction&lt;/a&gt;. Instead of saying "Find the text inside &lt;code&gt;.product_pod h3 a&lt;/code&gt;, we simply send the URL and say product: true. The AI analyzes the visual layout of the page and figures it out, even if the website changes its code tomorrow.&lt;/p&gt;

&lt;p&gt;To handle every scenario, we designed &lt;strong&gt;three distinct pipelines:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The "AI Extraction" Pipeline (Automatic)&lt;/strong&gt;&lt;br&gt;
This is the core of the workflow. You simply select a Category (e.g., &lt;code&gt;E-commerce&lt;/code&gt;, &lt;code&gt;Article&lt;/code&gt; etc) and a &lt;code&gt;Goal&lt;/code&gt;, and the workflow automatically routes you to one of two paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct Extraction (Fast):&lt;/strong&gt; If you just need data from the current page (like a "Single Product" or a "Simple List"), the workflow sends a single smart request. No loops, no waiting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Two-Phase" Architecture (For AI crawling):&lt;/strong&gt; If your goal involves "All Pages" or "Visiting Items," the workflow activates a robust recursive loop:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1 (The Crawler):&lt;/strong&gt; It maps out the URLs you need (looping through pagination or grabbing item links from a list).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2 (The Scraper):&lt;/strong&gt; It visits every mapped URL one by one to extract the rich details you asked for.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The "SERP" Pipeline (For SEO Data)&lt;/strong&gt;&lt;br&gt;
Need search rankings? We included a dedicated path for &lt;strong&gt;Search Engine Results Pages (SERP)&lt;/strong&gt;. It uses the specific serp schema to automatically extract organic results, ads, and knowledge panels without you needing to parse complex HTML.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The "Manual" Mode (For Raw Control)&lt;/strong&gt;&lt;br&gt;
Sometimes you don't need AI. We added a "General" path that gives you raw &lt;code&gt;browserHtml&lt;/code&gt;, HTTP responses, or Screenshots so you can parse specific data yourself.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s get building.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: The Control Center
&lt;/h2&gt;

&lt;p&gt;In previous parts, we hardcoded URLs into our nodes. For this tool, that won’t work. We need a flexible User Interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Main Interface (Form Trigger)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh417c7sm4xjtwu6t74sp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh417c7sm4xjtwu6t74sp.png" alt="N8N Form AI Web Scraper Submission"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use an &lt;strong&gt;n8n Form Trigger&lt;/strong&gt; node as the entry point. This turns your workflow into a clean web app that anyone on your team can use.&lt;/p&gt;

&lt;p&gt;The Main Form collects three key inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Target URL&lt;/strong&gt; (Text): The website you want to scrape.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Site Category&lt;/strong&gt; (Dropdown): Options like &lt;code&gt;Online Store&lt;/code&gt;, &lt;code&gt;Article/News&lt;/code&gt;, &lt;code&gt;Job Post&lt;/code&gt; , &lt;code&gt;General &amp;amp; More&lt;/code&gt; .
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zyte API Key&lt;/strong&gt; (Password): Securely input the key so it isn't hardcoded in the workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Smart Routing (The Switch Node)
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. Immediately after the form, we use a &lt;strong&gt;Switch Node&lt;/strong&gt; ("Route by Category") that directs the traffic into distinct lanes.&lt;/p&gt;

&lt;p&gt;This logic is crucial because different categories require different inputs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The AI Lane (Store, News, Jobs):&lt;/strong&gt; If you select a structured category, the workflow routes you to a &lt;strong&gt;Secondary Form&lt;/strong&gt; asking for your "Extraction Goal" (e.g., &lt;em&gt;Scrape this page&lt;/em&gt; vs. &lt;em&gt;Crawl ALL pages&lt;/em&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The SEO Lane:&lt;/strong&gt; If you select "SERP (Search Engine Results)," it bypasses extra forms and goes straight to the specialized SERP scraper.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Manual Lane (General):&lt;/strong&gt; If you select "General," it routes you to a different &lt;strong&gt;Manual Options Form&lt;/strong&gt; where you can choose specific technical actions (e.g., &lt;em&gt;Take Screenshot&lt;/em&gt;, &lt;em&gt;Get Browser HTML&lt;/em&gt;, &lt;em&gt;Network Capture&lt;/em&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This architecture ensures you only see options relevant to your goal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Pipeline 1 – AI Extraction
&lt;/h2&gt;

&lt;p&gt;If the user selects E-commerce, News/Blog/Article, or Jobs, they enter the AI pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The "AI Extraction Goal Form" (Refining Scope)
&lt;/h3&gt;

&lt;p&gt;Since "scraping" can mean anything from checking one price to archiving an entire blog, we present a secondary form here to define the scope. You simply tell the workflow what you need: a quick &lt;strong&gt;Single Item&lt;/strong&gt; lookup, a &lt;strong&gt;List&lt;/strong&gt; from the current page, or a full &lt;strong&gt;Multi-Page Crawl&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j4jy725csctiadkvw6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j4jy725csctiadkvw6y.png" alt="AI Extraction Goal"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Brain (Config Generator)
&lt;/h3&gt;

&lt;p&gt;We place a &lt;strong&gt;Code Node&lt;/strong&gt; (the "Zyte Config Generator") to translate your form choices into technical instructions.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejpt2m8v567dzbovrnou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejpt2m8v567dzbovrnou.png" alt="N8N Code node: Zyte Config Generator"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For instance,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you selected &lt;strong&gt;"Online Store"&lt;/strong&gt; → It maps to the Zyte schema &lt;code&gt;product&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;If you select &lt;strong&gt;"Article Site"&lt;/strong&gt; → It maps to &lt;code&gt;articles&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;If you choose &lt;strong&gt;"Get List"&lt;/strong&gt; → It targets &lt;code&gt;productList&lt;/code&gt; (or &lt;code&gt;articleList&lt;/code&gt;) to extract an array of items.
&lt;/li&gt;
&lt;li&gt;If you choose &lt;strong&gt;"Crawl All Pages"&lt;/strong&gt; → It switches the target to &lt;code&gt;productNavigation&lt;/code&gt; (or &lt;code&gt;articleNavigation&lt;/code&gt;) to activate the crawler loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. The 5 Strategies
&lt;/h3&gt;

&lt;p&gt;Based on your "Extraction Goal," the workflow automatically routes to one of 5 specific branches:&lt;/p&gt;

&lt;h4&gt;
  
  
  A. &lt;strong&gt;Single Item:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Fast execution. Scrapes details of one URL.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We send a single request to the Zyte API with our specific target schema (e.g., &lt;code&gt;product: true&lt;/code&gt;). The AI analyzes the page layout and returns a structured JSON object with the price, name, and details instantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  B. &lt;strong&gt;List (Current Page):&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Returns a clean JSON array of items found on the provided URL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8gf3goep2mm652d2x25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8gf3goep2mm652d2x25.png" alt="Scrape Details AI This Page - N8N"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Similar to the single item strategy, but instead of asking for one object, we request a List schema (like &lt;code&gt;productList&lt;/code&gt;). The AI identifies the repeating elements on the page and returns them as a clean array.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Design Note:&lt;/strong&gt; You might notice that the nodes for Strategy 1 and 2 look identical. That is because the heavy lifting (choosing between &lt;code&gt;product&lt;/code&gt; vs &lt;code&gt;productList&lt;/code&gt;) is actually handled upstream by the &lt;strong&gt;Config Generator&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧑‍💻 Best Practice:&lt;/strong&gt; In your own production automations, you should usually &lt;strong&gt;combine these into a single node&lt;/strong&gt; to keep your canvas clean. However, for this template, we kept them separate. This makes the logic visually intuitive and allows you to add specific post-processing (like a unique filter) to the &lt;em&gt;List&lt;/em&gt; path without accidentally breaking the &lt;em&gt;Single Item&lt;/em&gt; path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  C. &lt;strong&gt;Details (Current Page):&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;A hybrid approach. It scans the current list, finds item links, and visits them one by one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav0mtxeob8ibbw711rzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav0mtxeob8ibbw711rzx.png" alt="Scrape List AI - N8N"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We use a two-step logic: first, we request a &lt;code&gt;navigation&lt;/code&gt; schema to identify all item links on the current page. Then, we split that list and use a loop to visit each URL individually to extract the full details.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  D. Crawl List (All Pages):
&lt;/h4&gt;

&lt;p&gt;Activates the &lt;strong&gt;Crawler (Phase 1)&lt;/strong&gt; to loop through pagination and build a massive master list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71sejledynmnmg1ly4oh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71sejledynmnmg1ly4oh.png" alt="Scrape List - n8n"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This enables the pagination loop. The workflow fetches the current page's list, saves the items to a global "Backpack" (memory), detects the "Next Page" link automatically, and loops back to repeat the process until it reaches the end.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  E. Crawl Details (All Pages):
&lt;/h4&gt;

&lt;p&gt;The ultimate mode. It crawls all pages (Phase 1) AND visits every single item found (Phase 2).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx24quf6lcfjw47lwev69.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx24quf6lcfjw47lwev69.png" alt="scrape details AI - n8n"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This uses our robust &lt;strong&gt;"Two-Phase" architecture&lt;/strong&gt;. Phase 1 loops through pagination specifically to map out every item URL. Once the map is complete, Phase 2 takes over to visit every single URL one by one and extract the deep data.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: Pipeline 2 – SERP (Search Engine Results)
&lt;/h2&gt;

&lt;p&gt;If you select "Search Engine Results" in the main form, the workflow takes a direct path to the &lt;strong&gt;SERP Node&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is a single HTTP Request node configured with the &lt;code&gt;serp&lt;/code&gt; schema.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input:&lt;/strong&gt; Your target Search URL (e.g., a query on a search engine).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; Structured JSON containing organic results, ad positions, and knowledge panels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmrzrt20779qz6v3phuj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmrzrt20779qz6v3phuj.png" alt="SERP Extraction: Scrape with n8n"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is the fastest way to get reliable &lt;strong&gt;SERP&lt;/strong&gt; &lt;strong&gt;data&lt;/strong&gt; for rank tracking or brand monitoring, handling complex layouts automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Pipeline 3 – Manual / General Mode
&lt;/h2&gt;

&lt;p&gt;What if sometimes you need to scrape a unique dashboard, a niche directory, or just want to debug the raw HTML yourself. That’s why we included the &lt;strong&gt;"Manual"&lt;/strong&gt; path.&lt;/p&gt;

&lt;p&gt;If you select "General / Other" in the form, you are presented with a secondary form offering 5 raw tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Browser HTML:&lt;/strong&gt; Returns the full rendered DOM (great for the &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf#:~:text=Step%203%3A%20Extract%20the%20HTML%20content"&gt;custom parsing logic we built in Part 1&lt;/a&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP Response Body:&lt;/strong&gt; Useful for API endpoints.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Capture:&lt;/strong&gt; Intercepts background XHR/Fetch requests (&lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365#network-capture"&gt;as we learned in Part 2&lt;/a&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinite Scroll:&lt;/strong&gt; Automatically scrolls to the bottom before capturing HTML. (&lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365#infinite-scroll"&gt;see the infinite scroll guide in Part 2&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot:&lt;/strong&gt; Returns a PNG snapshot of the page. (&lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365#screenshots"&gt;view the setup steps here&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This ensures your scraping tool is never useless, even on the most obscure websites.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;The Result &amp;amp; Output&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Regardless of which pipeline you chose (AI, SERP, or Manual), all data converges at a final &lt;strong&gt;Data Collector&lt;/strong&gt; node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v2klg4a7j20laopexq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v2klg4a7j20laopexq3.png" alt="AI Scrape, General, Image - n8n output"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use a &lt;strong&gt;Convert to File&lt;/strong&gt; node to transform that JSON into a clean &lt;strong&gt;CSV file&lt;/strong&gt; or &lt;strong&gt;Image file&lt;/strong&gt; (for screenshots), ready for download directly in the browser&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr6cabe6x1notavtkejep.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr6cabe6x1notavtkejep.png" alt="n8n scrape results "&gt;&lt;/a&gt;&lt;/p&gt;



&lt;h3&gt;
  
  
  Get the Workflow
&lt;/h3&gt;

&lt;p&gt;We have packaged this entire logic – the forms, the smart routing, the crawler loops, and the safety checks, into a single template you can import right now from the n8n community.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://n8n.io/workflows/11637-automate-data-extraction-with-zyte-ai-products-jobs-articles-and-more/" rel="noopener noreferrer"&gt;&lt;strong&gt;Automate Data Extraction with Zyte AI (Products, Jobs, Articles &amp;amp; More)&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;And that’s a wrap on our n8n web scraping series! 🎬&lt;/p&gt;

&lt;p&gt;From building your first simple scraper in &lt;strong&gt;Part 1&lt;/strong&gt;, to mastering pagination in &lt;strong&gt;Part 2&lt;/strong&gt;, we have now arrived at the ultimate goal: an &lt;strong&gt;Intelligent Scraper&lt;/strong&gt; that adapts to the web so you don't have to.&lt;/p&gt;

&lt;p&gt;You now have a tool that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gets You Data With Ease:&lt;/strong&gt; Automatically extracts structured fields (like prices, images, and articles) without you needing to hunt for CSS selectors or manage CAPTCHAs.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduces Maintenance:&lt;/strong&gt; Adapts to layout changes automatically.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gives You Control:&lt;/strong&gt; Lets you switch between AI automation and manual debugging instantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This template is ready for you to fork, modify, and deploy.&lt;/p&gt;

&lt;p&gt;Thanks for joining us on this journey! If you build something cool, or if you run into a challenge that stumps you, come share it in the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;&lt;strong&gt;Extract Data Community&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy scraping! 🚀🕷️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webscraping</category>
      <category>data</category>
      <category>python</category>
    </item>
    <item>
      <title>n8n Web Scraping || Part 2: Pagination, Infinite Scroll, Network Capture &amp; More</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Mon, 17 Nov 2025 18:15:13 +0000</pubDate>
      <link>https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365</link>
      <guid>https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365</guid>
      <description>&lt;p&gt;This is Part 2 of our n8n web scraping series with &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;. If you’re new here, check out &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf"&gt;Part 1&lt;/a&gt; first, it covers the basics: &lt;code&gt;fetching pages&lt;/code&gt;, &lt;code&gt;extracting HTML&lt;/code&gt; with the &lt;em&gt;HTML node&lt;/em&gt;, &lt;code&gt;cleaning&lt;/code&gt; + &lt;code&gt;normalizing results&lt;/code&gt;, &amp;amp; exporting &lt;code&gt;CSV/JSON&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Table of Contents
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Pagination&lt;/li&gt;
&lt;li&gt;Infinite Scroll&lt;/li&gt;
&lt;li&gt;Geolocation support&lt;/li&gt;
&lt;li&gt;Screenshots from browser rendering&lt;/li&gt;
&lt;li&gt;Capturing network requests&lt;/li&gt;
&lt;li&gt;Handling cookies, sessions, headers &amp;amp; IP type&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Let’s Begin!
&lt;/h2&gt;

&lt;p&gt;In this part, we’ll explore some important scraping practices and nodes, along with a few hands on tricks that make your web scraping journey smoother.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Everything you learn here will also lay the foundation for our 3rd &amp;amp; final part, where we will build a universal scraper capable of scraping any website with minimal configuration.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s start by taking the same workflow we built in Part 1, &amp;amp; extend it. Beginning with Pagination and  Infinite Scroll.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9o8874utbkb31wxjubie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9o8874utbkb31wxjubie.png" alt="N8N Scraping Workflow" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Pagination across pages
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A website can navigate in multiple ways &amp;amp; our scraper needs to adapt accordingly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;N8N gives us a default &lt;a href="https://docs.n8n.io/code/cookbook/http-node/pagination/" rel="noopener noreferrer"&gt;Pagination Mode&lt;/a&gt; inside the HTTP Request node under Options, and while it sounds convenient, it didn’t behave reliably in my experience for typical web scraping use cases. &lt;/p&gt;

&lt;p&gt;After testing several patterns, &lt;em&gt;the approach below is the one that has worked most consistently in my workflows.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 If you’re stuck or want to share your own approach, lets discuss it in the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;&lt;strong&gt;Extract Data Discord&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 1: Page Manager Node
&lt;/h4&gt;

&lt;p&gt;Before calling the HTTP Request node, we introduce a small function called &lt;strong&gt;Page Manager&lt;/strong&gt;, exactly what the name suggests: a node that controls the page number.&lt;/p&gt;

&lt;p&gt;Add a Code node (JavaScript) and paste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Page Manager Function ( We use this node as both starter and incrementer)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MAX_PAGE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// n8n provides `items` array. If no items =&amp;gt; first run&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="c1"&gt;// Check for .json.page (from this node's first run)&lt;/span&gt;
  &lt;span class="c1"&gt;// OR .json.Page (from the Normalizer node's output)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;undefined&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nc"&gt;Number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// first run (still 1)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_PAGE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// safety stop&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On the first run, it starts with &lt;code&gt;page = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Every time the loop returns here, it increments to the next page.&lt;/li&gt;
&lt;li&gt;There’s a built in safety &lt;code&gt;MAX_PAGE&lt;/code&gt; so you don’t accidentally infinite loop. ( Adjust accordingly )&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F719ovxjuzauj6u0o1d1x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F719ovxjuzauj6u0o1d1x.png" alt="Scraping Function N8N" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now update your URL in old &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf#:~:text=Step%202%3A%20Add%20an%20HTTP%20Request%20Node"&gt;HTTP Request node&lt;/a&gt; to use the page variable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;code&gt;https://books.toscrape.com/catalogue/page-{{ $json.page }}.html&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9j84u81ohzikcb7iiag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9j84u81ohzikcb7iiag.png" alt="Pagination Scraping URL" width="800" height="853"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This makes the node fetch the correct page each time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The rest of the workflow remains the same, till the second HTML Extract node (where we parsed the book name, URL, price, rating etc. in Part 1).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 2: Modify the Normalizer Function Node to Save Results Across Pages
&lt;/h4&gt;

&lt;p&gt;In Part 1, our &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf#:~:text=Step%207%3A%20Clean%20and%20normalize%20the%20data"&gt;Step 7&lt;/a&gt; code simply cleaned and normalized items for one page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now we need it to do two things:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Normalize the results (same as before)&lt;/li&gt;
&lt;li&gt;Store the results from every page inside n8n’s global static data bucket. Think of it like temporary workflow memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Update the node, code with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// --- Normalizer (Code node) ---&lt;/span&gt;
&lt;span class="c1"&gt;// Get the global workflow static data bucket&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$getWorkflowStaticData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;global&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// initialize storage if needed&lt;/span&gt;
&lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

&lt;span class="c1"&gt;// normalization logic (kept minimal version)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://books.toscrape.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;urlRel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imgRel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratingClass&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ratingClass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;urlRel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;(\.\.\/)&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;imgRel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;(\.\.\/)&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;availability&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="nx"&gt;rating&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// append to global storage&lt;/span&gt;
&lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// return control info for IF node (not the items)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;currentPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Page Manager&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;itemsFound&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;nextHref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextHref&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentPage&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7fs6bgzf794rja4wy0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7fs6bgzf794rja4wy0t.png" alt="Save Data N8N Function" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We normalize the data exactly like Part 1.&lt;/li&gt;
&lt;li&gt;Then we push all normalized items into &lt;code&gt;workflowStaticData.workBooks&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Instead of returning the items themselves, we return only a small control object.&lt;/li&gt;
&lt;li&gt;This object is used by the &lt;code&gt;IF node&lt;/code&gt; to decide whether we continue scraping or stop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 3: IF Node (Stop Scraping or Continue)
&lt;/h4&gt;

&lt;p&gt;Add an &lt;a href="https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.if/" rel="noopener noreferrer"&gt;&lt;code&gt;IF node&lt;/code&gt;&lt;/a&gt; with two conditions and &lt;code&gt;OR&lt;/code&gt; Type:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition 1:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;{{ $json.itemsFound }}&lt;/code&gt; is equal to &lt;code&gt;0&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Meaning → The current page returned no items → we’ve reached the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition 2:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;{{ $json.Page }}&lt;/code&gt; is greater than or equal to &lt;code&gt;YOUR_MAX_PAGE&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Meaning → Stop when you reach the max page number you set.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7y5c3w4xg3660vc40djo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7y5c3w4xg3660vc40djo.png" alt="If loop node n8n" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Together these conditions help the workflow decide:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IF → True&lt;/strong&gt;&lt;br&gt;
Stop scraping and move to the export step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IF → False&lt;/strong&gt;&lt;br&gt;
Go back to the &lt;code&gt;Page Manager&lt;/code&gt;, increment the page number, and keep scraping.&lt;/p&gt;

&lt;p&gt;This creates a complete and safe pagination loop.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 4: Collect All Results and Export
&lt;/h4&gt;

&lt;p&gt;When the IF node returns True, add one more small Code node before the Convert To File node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Get the global data&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$getWorkflowStaticData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;global&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Get the array of books, or an empty array if it doesn't exist&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allBooks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

&lt;span class="c1"&gt;// Return all the books as standard n8n items&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;allBooks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;book&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynjomyhxecffj4klmpab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynjomyhxecffj4klmpab.png" alt="Data Scraping N8N" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this one does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pulls everything we stored in the temporary memory.&lt;/li&gt;
&lt;li&gt;Returns it as normal n8n items.&lt;/li&gt;
&lt;li&gt;These go straight into Convert To File → CSV.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;And that’s the entire pagination workflow.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjdz86w1ynumt4xj1a3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjdz86w1ynumt4xj1a3x.png" alt="Pagination in N8N" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Infinite Scroll
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;This one is much simpler.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some websites load content as you scroll, there's no traditional page numbers.&lt;br&gt;
The &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; supports browser actions, which makes this easy.&lt;/p&gt;

&lt;p&gt;Just add one line to our original &lt;strong&gt;cURL command&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{ "action": "scrollBottom" }]}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6028qtzxo3b3klmo3llu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6028qtzxo3b3klmo3llu.png" alt="Infinite Scroll in N8N" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zyte API loads the page in a headful browser session.&lt;/li&gt;
&lt;li&gt;It scrolls to the bottom, triggering all JavaScript that loads additional items.&lt;/li&gt;
&lt;li&gt;Then it returns the final, fully loaded &lt;code&gt;browserHtml&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You can parse this HTML normally using the same nodes from Part 1.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Geolocation
&lt;/h2&gt;

&lt;p&gt;Some websites return different data depending on your region.&lt;br&gt;
Zyte API makes this super simple by allowing you to specify a geolocation.&lt;/p&gt;

&lt;p&gt;Use this inside an HTTP Request node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "http://ip-api.com/json", "browserHtml": true, "geolocation": "AU" }' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F638etacjkchl2641595w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F638etacjkchl2641595w.png" alt="Geolocation Scraping" width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting &lt;code&gt;"geolocation": "AU"&lt;/code&gt; makes Zyte perform the browser request from that region, check the list of all available &lt;a href="https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation/?utm_campaign=Discord_geo&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;CountryCodes&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Many websites use region based content (pricing, currencies, language, product availability), so this is extremely helpful.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Screenshots
&lt;/h2&gt;

&lt;p&gt;If you’d like to grab a screenshot of what the browser rendered, you can do that too.&lt;/p&gt;

&lt;p&gt;cURL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://toscrape.com", "screenshot": true }' \
   https://api.zyte.com/v1/extract

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;It will return the screenshot as &lt;code&gt;Base64&lt;/code&gt; data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft288ynokh6slmc89w6g8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft288ynokh6slmc89w6g8.png" alt="Base64 Scraping" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To convert it into a proper image (PNG, JPEG, etc.) → Use &lt;strong&gt;Convert To File&lt;/strong&gt; node in n8n.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1ffw7a2ljtl3dd7e4yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1ffw7a2ljtl3dd7e4yu.png" alt="Scraping Screenshot" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;n8n often converts boolean values like &lt;code&gt;true&lt;/code&gt; into &lt;code&gt;"true"&lt;/code&gt; when importing via cURL.&lt;br&gt;
Fix it by clicking the gear icon → &lt;strong&gt;Add Expression&lt;/strong&gt; → &lt;code&gt;{{true}}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3b4gelimpj8u34ydx2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3b4gelimpj8u34ydx2g.png" alt="Field Scraping" width="800" height="1273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or switch body mode to &lt;strong&gt;Using JSON&lt;/strong&gt; and paste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://toscrape.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"screenshot"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futluvudf5o2ja7fhfojm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futluvudf5o2ja7fhfojm.png" alt="JSON Scraping" width="800" height="1261"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Network Capture
&lt;/h2&gt;

&lt;p&gt;Many modern websites load content through background API calls rather than raw HTML.&lt;br&gt;
And you can just capture those network activity during rendering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true,  "networkCapture": [
        {
            "filterType": "url",
            "httpResponseBody": true,
            "value": "/api/",
            "matchType": "contains"
        }]}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns a &lt;code&gt;networkCapture&lt;/code&gt; array with all responses whose URL contains &lt;code&gt;/api/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5r4gbnbcfgrteaqz9r5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5r4gbnbcfgrteaqz9r5.png" alt="Network Capture Scraping" width="800" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding above Parameters&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;filterType: "url"&lt;/code&gt; ⟶ filter network requests by URL&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;value: "/api/"&lt;/code&gt; ⟶ look for URLs containing &lt;code&gt;/api/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;matchType: "contains"&lt;/code&gt; ⟶ pattern match style&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;httpResponseBody: true&lt;/code&gt; ⟶ include the response body (Base64)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Extracting data from the captured network response&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can decode the Base64 response in two easy ways:&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;1. Using a Function node (Python)&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;&lt;em&gt;(You can also use JS if you prefer)&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get the network capture data
&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkCapture&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Decode base64 and parse JSON
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;decoded_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;httpResponseBody&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decoded_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Return the result
&lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firstAuthor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg36qwxllt7j0rh3r82ez.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg36qwxllt7j0rh3r82ez.png" alt="Decode Base64" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ This method decodes the &lt;code&gt;Base64&lt;/code&gt; encoded HTTP response, parses it as JSON, and gives you structured data directly, very reliable and readable.&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;2. Using Edit Field Node (No code)&lt;/strong&gt;
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;In this method, you still need to parse your data&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Add an &lt;strong&gt;Edit Fields&lt;/strong&gt; node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mode:&lt;/strong&gt; Add Field&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; decodedData&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type:&lt;/strong&gt; String&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{{ $json.networkCapture[0].httpResponseBody.base64Decode().parseJson() }}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54z28rdgbmagj6ex253v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54z28rdgbmagj6ex253v.png" alt="Decode Base64 in N8N" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ This takes the Base64 content, decodes it, parses JSON, and puts the result under &lt;code&gt;decodedData&lt;/code&gt; automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cookies, sessions, headers &amp;amp; IP type (quick guide)
&lt;/h2&gt;

&lt;p&gt;When you move from toy sites to real sites, a few extra controls matter a lot: which IP type you use, whether you keep a session, and what cookies or headers you send.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; exposes all these as request fields and you can use them the same way we used &lt;code&gt;browserHtml&lt;/code&gt;, &lt;code&gt;networkCapture&lt;/code&gt; or &lt;code&gt;actions&lt;/code&gt; above (via &lt;code&gt;curl&lt;/code&gt; → &lt;code&gt;Import in n8n HTTP Request node&lt;/code&gt; → Adjust Fields as needed → Extract).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To keep this guide focused, we won’t dive into all code examples here, but here’s one small one for _ Setting a cookie and getting it back_ ( &lt;code&gt;requestCookies&lt;/code&gt; ) just to show how it integrates.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cookies&lt;/strong&gt; (&lt;em&gt;via&lt;/em&gt;&lt;code&gt;requestCookies&lt;/code&gt;/ &lt;code&gt;responseCookies&lt;/code&gt;)
➜ Useful when a website relies on cookies for preferences, language, or maintaining continuity between requests.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{ "url": "https://httpbin.org/cookies", "browserHtml": true,
    "requestCookies": [{ "name": "foo",
            "value": "bar",
            "domain": "httpbin.org"
        }]
}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7x0xeg6y7gvd4n01r6f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7x0xeg6y7gvd4n01r6f.png" alt="Manage Scraping Cookies" width="800" height="638"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⟶ This example uses &lt;code&gt;requestCookies&lt;/code&gt;, but &lt;code&gt;responseCookies&lt;/code&gt; works the same way, you simply read cookies from one request and pass them into the next.&lt;/p&gt;

&lt;p&gt;Learn more on &lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#cookies/?utm_campaign=Discord_cookies&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;cookies&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Everything else below (sessions, ipType, custom headers) plugs in the same way.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sessions&lt;/strong&gt;&lt;br&gt;
➜ Sessions bundle IP address, cookie jar, and network settings so multiple requests look consistently related. &lt;em&gt;Helpful for multi step interactions, region based content, or sites that hate stateless scraping..&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#sessions/?utm_campaign=Discord_sessions&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Sessions&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom Headers&lt;/strong&gt;&lt;br&gt;
➜ Add a User Agent, Referer, or any custom metadata the target site expects: simply define them inside the &lt;code&gt;HTTP Request node&lt;/code&gt; headers.&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt;&lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#response-headers/?utm_campaign=Discord_headers&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt; Headers&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IP Type&lt;/strong&gt; (&lt;code&gt;datacenter&lt;/code&gt; vs &lt;code&gt;residential&lt;/code&gt;)&lt;br&gt;
➜ Some sites vary content based on IP type. Zyte API automatically selects the best option, but you can override it with &lt;code&gt;ipType&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#ip-type/?utm_campaign=Discord_Iptype&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;IP Types&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;All of these follow the same pattern we’ve already used above.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Where This Takes Us Next
&lt;/h2&gt;

&lt;p&gt;And that’s it for Part 2! 🎉&lt;/p&gt;

&lt;p&gt;We covered a lot more than just pagination, from infinite scroll &amp;amp; geolocation to screenshots, network capture, and the key request fields you’ll use while scraping sites. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What we learned isn’t a complete workflow on its own, but it builds the foundation you’ll use again and again in your scraping workflows.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Part 3, we’ll take everything one step further and combine these patterns into a universal scraper: a reusable, configurable template that can adapt to almost any site with minimal changes.&lt;/p&gt;

&lt;p&gt;Thanks for following along, and feel free to share your workflow, questions, or improvements in the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;Extract Data Community&lt;/a&gt;. &lt;br&gt;
Happy scraping! 🕸️✨&lt;/p&gt;

</description>
      <category>programming</category>
      <category>webscraping</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>Inside Common Crawl: The Dataset Behind AI Models (and Its Real World Limits)</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Thu, 30 Oct 2025 05:19:09 +0000</pubDate>
      <link>https://dev.to/extractdata/inside-common-crawl-the-dataset-behind-ai-models-and-its-real-world-limits-2eo2</link>
      <guid>https://dev.to/extractdata/inside-common-crawl-the-dataset-behind-ai-models-and-its-real-world-limits-2eo2</guid>
      <description>&lt;p&gt;You've probably heard that LLMs are "trained on data from the web." But have you ever wonder how they actually get that data?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did developers write scrapers to crawl the entire internet - &lt;em&gt;building a massive web scraping solution from scratch?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For many, the answer is simple: Common Crawl.&lt;/p&gt;

&lt;p&gt;Let’s explore what it is, how it fuels significant porion of AI world, and when to use it instead of building your own scraper.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Common Crawl? 🤔
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://commoncrawl.org/" rel="noopener noreferrer"&gt;Common Crawl&lt;/a&gt; is a non profit organization that has been crawling the web since 2008. Its mission is to provide free, large scale, publicly available archives of web data for researchers, developers, and organizations worldwide.&lt;/p&gt;

&lt;p&gt;Think of it as a massive, open source library of the internet. Every month, its crawler, &lt;code&gt;CCBot&lt;/code&gt;, scans billions of pages and archives them. &lt;/p&gt;

&lt;p&gt;For instance, the August 2025 crawl added 2.42 billion pages, totaling over 419 TiB of data! This data is stored in Amazon S3 buckets and is accessible to anyone for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Common Crawl Matters for AI?
&lt;/h2&gt;

&lt;p&gt;A &lt;a href="https://www.mozillafoundation.org/en/blog/Mozilla-Report-How-Common-Crawl-Data-Infrastructure-Shaped-the-Battle-Royale-over-Generative-AI/" rel="noopener noreferrer"&gt;2024 Mozilla report&lt;/a&gt; found that 2/3 of 47 generative LLMs released between 2019-2023 relied on Common Crawl data.&lt;/p&gt;

&lt;p&gt;Today, &lt;a href="https://en.wikipedia.org/wiki/List_of_large_language_models" rel="noopener noreferrer"&gt;Wikipedia&lt;/a&gt; lists over 80 public LLMs, and aggregators like &lt;a href="https://openrouter.ai/models" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; host 500+ models.&lt;br&gt;
Even if not all disclose their datasets, Common Crawl (and its derivatives like RefinedWeb) are &lt;em&gt;still among the most cited sources.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 In short: if you’ve used a modern LLM model, you’ve indirectly used data from Common Crawl.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  How Common Crawl Organizes Data?
&lt;/h2&gt;

&lt;p&gt;The data is offered in three main formats:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Web Archive (WARC) files&lt;/strong&gt; - the rawest form, containing full HTTP responses (headers, HTML, etc.). Perfect if you need images, HTML parsing, or complete page reconstruction.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Web Archive Transformations (WAT files)&lt;/strong&gt; - they like summaries of WARC files. They contain metadata in JSON format, such as all the links on the page, HTTP headers, and response codes. This is useful if you don’t need the full page, but want structured information like which URLs link to which pages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Web Extracted Text (WET files)&lt;/strong&gt; - plain text extracted from WARC files (no HTML or media). Ideal for NLP or training text based models.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  How to Fetch a Page from Common Crawl?
&lt;/h2&gt;

&lt;p&gt;Fetching archived pages from Common Crawl is easier than it sounds.&lt;br&gt;
It’s a simple three step process, and the logic is the same no matter what website you’re looking at.&lt;/p&gt;

&lt;p&gt;I’ve been researching good graphics cards, and while browsing I found this collection page: &lt;a href="https://computerorbit.com/collections/graphics-cards" rel="noopener noreferrer"&gt;https://computerorbit.com/collections/graphics-cards&lt;/a&gt;. It lists all sorts of GPUs and variants.&lt;/p&gt;

&lt;p&gt;Now I was curious, how does this page look inside Common Crawl’s archives?&lt;/p&gt;

&lt;p&gt;Let’s try to find and fetch its archived version.&lt;/p&gt;

&lt;p&gt;Import these first&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;warcio.archiveiterator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ArchiveIterator&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 1: Common Crawl and Index Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before fetching any page, we first need to know which crawl (index) it belongs to.&lt;br&gt;
Common Crawl organizes its data into periodic crawls, for example, &lt;code&gt;CC-MAIN-2025-33&lt;/code&gt; or &lt;code&gt;CC-MAIN-2025-19&lt;/code&gt;. Each crawl corresponds to a time period. For example, &lt;code&gt;CC-MAIN-2025-33&lt;/code&gt; means it’s the 33rd crawl of 2025.&lt;/p&gt;

&lt;p&gt;So first, we’ll fetch a list of available indexes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# --- Step 1: Find a valid Common Crawl index ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_available_indexes&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetches the list of all available Common Crawl index collections.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Fetching list of available Common Crawl indexes...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;collections_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://index.commoncrawl.org/collinfo.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collections_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;cdx_indexes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cdx-api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cdx-api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;cdx_indexes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[+] Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cdx_indexes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; available indexes.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cdx_indexes&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[!] Error fetching collection info: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what's happening:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We make a simple &lt;code&gt;HTTP&lt;/code&gt; request to &lt;code&gt;index.commoncrawl.org&lt;/code&gt; to get the list of all crawl indexes.&lt;/li&gt;
&lt;li&gt;The response includes all the &lt;code&gt;**CDX API URLs**&lt;/code&gt;, those are the entry points to query each crawl.&lt;/li&gt;
&lt;li&gt;We then sort them in reverse (newest first), so we always check the latest crawls first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Think of this step as checking the table of contents of a huge web archive library.&lt;/p&gt;

&lt;p&gt;It doesn’t give us the actual page yet - it only tells us which index might contain our target page.&lt;br&gt;
Once we locate that index, we’ll move to &lt;code&gt;**data.commoncrawl.org**&lt;/code&gt; to download the actual HTML in the next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Querying the Common Crawl Index&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we have the list of indexes, we’ll search for our target URL in one of them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# --- Step 2: Query the index to find the page data for the target URL ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cc_captures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Queries a specific Common Crawl Index for captures of a URL.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Querying index: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;index_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;filter&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=status:200&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;filename,offset,length,timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[-] No captures found for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; in this index.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;captures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="n"&gt;captures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[+] Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;captures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; captures in this index.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;captures&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[-] Warning: Could not query index &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;index_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what we’re doing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We query one specific index to find captures of our target URL.&lt;/li&gt;
&lt;li&gt;The parameters tell Common Crawl what we want:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;url&lt;/code&gt;: The target page.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;output=json&lt;/code&gt;: Return metadata in JSON format (not the page itself).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;filter=status:200&lt;/code&gt;: Only include successful responses.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fl&lt;/code&gt;: The specific fields we need (filename, offset, length, timestamp).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The response gives us metadata about where the actual HTML is stored.&lt;/p&gt;

&lt;p&gt;Each result tells us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which WARC file to fetch (&lt;code&gt;filename&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The byte range of our page in that file (&lt;code&gt;offset&lt;/code&gt;, &lt;code&gt;length&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Whe n it was captured (&lt;code&gt;timestamp&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 This step was like finding a book in a massive library.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Retrieving the Archived Content&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we know where our page lives, let’s fetch it.&lt;br&gt;
Common Crawl stores everything in large WARC files on Amazon S3, but we don’t want to download those huge files entirely, they’re hundreds of gigabytes.&lt;/p&gt;

&lt;p&gt;So instead, we’ll use a byte-range request to fetch just the part that contains our page.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# --- Step 3: Download the raw HTML from the archive (No changes needed) ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_html_from_capture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Downloads a specific record from a WARC file and returns its HTML.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;filename&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;offset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;length&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;s3_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://data.commoncrawl.org/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;range_header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Fetching data from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s3_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; with range &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;range_header&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Range&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;range_header&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;temp_filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_warc.gz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Saving downloaded chunk to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;temp_filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temp_filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[+] Successfully saved data to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;temp_filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Decompressing and parsing WARC record from local file...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temp_filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;archive_iterator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ArchiveIterator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;archive_iterator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rec_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[+] Successfully parsed WARC record.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content_stream&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ignore&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[!] An error occurred during the process: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Each capture record tells us which &lt;code&gt;WARC file&lt;/code&gt; our page is stored in (&lt;code&gt;filename&lt;/code&gt;) and where inside that file (&lt;code&gt;offset&lt;/code&gt;, &lt;code&gt;length&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;We build the &lt;code&gt;S3 URL&lt;/code&gt; using those values and add a Range header so we only download that small slice of the file.&lt;/li&gt;
&lt;li&gt;The downloaded chunk is temporarily saved locally as &lt;code&gt;temp_warc.gz&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;We then open it with &lt;code&gt;ArchiveIterator&lt;/code&gt;, which allows us to read the compressed archive and extract the HTML from the response record.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Now we’re finally opening the book and reading the page we were looking for, without carrying the whole library home.&lt;/p&gt;

&lt;p&gt;So, Step 1 and 2 tell us where to look, and Step 3 actually retrieves the HTML of the page, efficiently and precisely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It All Together
&lt;/h2&gt;

&lt;p&gt;Now let’s tie everything together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# --- Main execution block ---
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Example target (replace with your site)
&lt;/span&gt;    &lt;span class="n"&gt;target_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;computerorbit.com/collections/graphics-cards&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="n"&gt;all_indexes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_available_indexes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;all_indexes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[!] Could not retrieve list of Common Crawl indexes. Exiting.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;captures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="c1"&gt;# Try the 5 most recent indexes until a result is found
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index_url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_indexes&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;captures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_cc_captures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;captures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Success! Found captures in index &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;index_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;captures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;latest_capture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;captures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Using most recent capture from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest_capture&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_html_from_capture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latest_capture&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[+] Saved archived HTML as page.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[!] No captures found for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; in recent crawls.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run this, you’ll get the archived HTML of your target page saved locally as &lt;code&gt;page.html&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can open that file, inspect its contents, or later build a parser around it to extract specific data (like product names or article text).&lt;/p&gt;

&lt;h2&gt;
  
  
  🖼️ Sample Output (Common Crawl)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukq9ig9ivbdq74tmp0xp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fukq9ig9ivbdq74tmp0xp.png" alt=" " width="800" height="829"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The output you get here is historical, it reflects how the page looked when Common Crawl last captured it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For live, real time data, we’ll now look at how this compares with a scraper built using &lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte’s API&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Crawl vs. Building Your Own Scrapers: Which Should You Use?
&lt;/h2&gt;

&lt;p&gt;Common Crawl gives you access to web data at scale without worrying about proxies or blocks. Perfect for analysis, research, or benchmarking.&lt;/p&gt;

&lt;p&gt;However, it comes with its own set of challenges.&lt;/p&gt;

&lt;p&gt;The biggest one? Freshness.&lt;/p&gt;

&lt;p&gt;What if you want fresh, real time data - for example, to check which new graphics cards were added &lt;em&gt;this week&lt;/em&gt; or &lt;em&gt;latest cost&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;That’s where &lt;strong&gt;your own scraper&lt;/strong&gt; makes all the difference.&lt;/p&gt;

&lt;p&gt;Lets build a quick scraper for the same page and get the latest structured data instantly. We’ll use &lt;code&gt;auto extract&lt;/code&gt; feature. Learn more about it in, here: &lt;em&gt;&lt;a href="https://docs.zyte.com/zyte-api/usage/extract/#zyte-api-automatic-extraction" rel="noopener noreferrer"&gt;Zyte API automatic extraction&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;ZYTE_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;product_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://computerorbit.com/collections/graphics-cards&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;api_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.zyte.com/v1/extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;productList&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;productList&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;output_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;z_products.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Products saved to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🖼️ Sample Output (Fresh Data Scraper)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngfyy1lu4gxq7our5ooe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngfyy1lu4gxq7our5ooe.png" alt=" " width="800" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, you can clearly see the fresh products and updated prices that weren’t present in Common Crawl’s archive.&lt;/p&gt;

&lt;p&gt;And while freshness is one of the biggest challenges with Common Crawl, it’s not the only one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Challenges with Common Crawl
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate Data&lt;/strong&gt;
Common Crawl captures the same pages across multiple crawls, sometimes hundreds of times.
This means a lot of duplicate data that needs deduplication before use.
&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;“&lt;a href="https://www.deepseek.com/en" rel="noopener noreferrer"&gt;DeepSeek&lt;/a&gt; alone removed nearly 90% of repeated content across 91 Common Crawl dumps, just so it could train on high quality, diverse text.”  &lt;a href="https://arxiv.org/html/2401.02954v1#:~:text=Section%C2%A06.-,2,Deduplication%20ratios%20for%20various%20Common%20Crawl%20dumps.,-In%20the%20filtering" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Messy Data&lt;/strong&gt;&lt;br&gt;
WARC files often contain ads, cookie banners, or partial HTML responses. You’ll need heavy preprocessing and filtering to get clean text or structured data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;br&gt;
Common Crawl data is measured in &lt;em&gt;petabytes&lt;/em&gt;. Great for large research labs, but not always practical for smaller projects or individual developers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bias&lt;/strong&gt;&lt;br&gt;
Crawl frequency and seed URLs shape what gets captured.&lt;br&gt;&lt;br&gt;
So, some domains or regions are overrepresented while others barely appear.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Together, these make Common Crawl an incredible but challenging dataset excellent for research and experimentation, but rarely “&lt;em&gt;plug and play&lt;/em&gt;.”&lt;/p&gt;

&lt;p&gt;And that’s where &lt;strong&gt;your own scrapers&lt;/strong&gt;, really shine. Let’s compare the two approaches side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Use Common Crawl if...&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Use Scraping APIs / Your Own Crawlers if...&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;You need &lt;strong&gt;vast amounts of raw, non specific data&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;You need &lt;strong&gt;fresh, up to date data&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You’re doing &lt;strong&gt;research, LLM pretraining&lt;/strong&gt;, or large scale analysis&lt;/td&gt;
&lt;td&gt;You want &lt;strong&gt;consistent completeness&lt;/strong&gt; (e.g., every product or listing captured)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You want access to &lt;strong&gt;historical web archives&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;You need &lt;strong&gt;structured data outputs&lt;/strong&gt; like JSON or CSV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You’re exploring &lt;strong&gt;academic or experimental projects&lt;/strong&gt; that don’t require perfection&lt;/td&gt;
&lt;td&gt;You’re &lt;strong&gt;targeting specific sites or datasets&lt;/strong&gt; for production use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Common Crawl is one of the most fascinating resources on the web - a time capsule of billions of pages, freely available for anyone to explore. It’s the foundation of countless research projects, datasets, and even large language models.&lt;/p&gt;

&lt;p&gt;But as we’ve seen, it’s not perfect.&lt;/p&gt;

&lt;p&gt;Its data is archived, not live, and working with it often means dealing with duplicates, noise, and scale. That’s fine if your goal is analysis or experimentation, but not if you need production grade, real time insights.&lt;/p&gt;

&lt;p&gt;That’s where modern scraping solutions like &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte’s API&lt;/a&gt; make all the difference. Instead of wading through terabytes of historical data, you can fetch fresh, structured, ready to use information from web in seconds.&lt;br&gt;
&lt;em&gt;It’s all about picking the right tool for the job.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you enjoyed this, you’ll fit right in at the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;&lt;strong&gt;Extract Data Discord&lt;/strong&gt;&lt;/a&gt; - a 20,000+ strong community of builders, scrapers, and data nerds exploring the web together.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Thank you for reading! 😄&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>webscraping</category>
      <category>coding</category>
    </item>
    <item>
      <title>Web Scraping with n8n | Part 1: Build Your First Web Scraper</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Fri, 17 Oct 2025 13:10:41 +0000</pubDate>
      <link>https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf</link>
      <guid>https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf</guid>
      <description>&lt;h2&gt;
  
  
  What it will cover!
&lt;/h2&gt;

&lt;p&gt;If you’ve ever wished you could automate scraping without setting up a bunch of scripts, proxies, or browser logic, you're in the right place.&lt;/p&gt;

&lt;p&gt;We’ll use &lt;a href="https://n8n.io/" rel="noopener noreferrer"&gt;n8n&lt;/a&gt;, the low code automation tool, together with &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; to fetch structured data from &lt;a href="https://books.toscrape.com/" rel="noopener noreferrer"&gt;https://books.toscrape.com/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;By the end, you’ll have a workflow that runs on its own, giving you clean JSON or CSV output of all books - their names, prices, ratings, and images. And a setup you can easily adapt for other publicly available or test websites with similar layouts.&lt;/p&gt;

&lt;p&gt;Let’s get scraping!&lt;/p&gt;

&lt;h2&gt;
  
  
  The game plan:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Fetch the page using Zyte API (it handles rendering &amp;amp; manages blocks automatically)&lt;/li&gt;
&lt;li&gt;Extract HTML content inside n8n&lt;/li&gt;
&lt;li&gt;Parse book elements with CSS selectors&lt;/li&gt;
&lt;li&gt;Clean and normalize the data&lt;/li&gt;
&lt;li&gt;Export results as JSON or CSV&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First, let’s get n8n ready to roll.&lt;br&gt;
You can set it up for free locally, or in the cloud whichever you prefer.&lt;br&gt;
If you’re going local, install it via &lt;a href="https://docs.n8n.io/hosting/installation/docker/" rel="noopener noreferrer"&gt;Docker&lt;/a&gt; or &lt;a href="https://docs.n8n.io/hosting/installation/npm/" rel="noopener noreferrer"&gt;npm&lt;/a&gt;, it only takes a few commands.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Once it’s up, the steps below will work exactly the same whether you’re using n8n Desktop or n8n Cloud.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Step 1: Create a new workflow in n8n
&lt;/h2&gt;

&lt;p&gt;After logging in, create a new workflow.&lt;br&gt;
Name it something like "&lt;strong&gt;Book Catalog Scraper&lt;/strong&gt;" or you can tweak the same workflow later for similar pages or categories.&lt;br&gt;
This blank canvas is where all your nodes will live.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furbbnqgvo5n3xtw2udkp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furbbnqgvo5n3xtw2udkp.png" alt="N8N Blank Canvas" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Add an HTTP Request Node
&lt;/h2&gt;

&lt;p&gt;We’ll use the &lt;a href="https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.httprequest/" rel="noopener noreferrer"&gt;HTTP Request node&lt;/a&gt; to call the Zyte API.&lt;/p&gt;

&lt;p&gt;We’ll use cURL to configure this node. Click on Import cURL, then paste the following command and hit Import.&lt;br&gt;
(Don’t forget to replace the API key with your own, and change the URL if you’d like.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://books.toscrape.com/catalogue/category/books/travel_2/index.html", "browserHtml": true}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once imported, you’ll see the node fields automatically populated.&lt;br&gt;
&lt;strong&gt;Note:&lt;/strong&gt; When you import via cURL, n8n often converts boolean values like true into the string "true".&lt;br&gt;
To fix this, click the little gear icon → “Add Expression” next to the value and set it to {{true}}.&lt;br&gt;
This is especially required for the browserHtml field, it ensures the Zyte API receives a real boolean, not a string.&lt;/p&gt;

&lt;p&gt;Now hit Execute Node, and you should see a JSON response with a big block of HTML inside the "browserHtml" field.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hyj0thvoyknx6999let.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hyj0thvoyknx6999let.png" alt="HTTP Request Node N8N Zyte API" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: Extract the HTML content
&lt;/h2&gt;

&lt;p&gt;Next, add an &lt;strong&gt;&lt;a href="https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.set/" rel="noopener noreferrer"&gt;Edit Fields&lt;/a&gt;&lt;/strong&gt; node (previously called Set node) to isolate that browserHtml content.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mode:&lt;/strong&gt; Add Field&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; &lt;code&gt;data&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value&lt;/strong&gt; &lt;code&gt;{{$json["browserHtml"]}}&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39wjrwumiomitg1yjhet.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F39wjrwumiomitg1yjhet.png" alt="Extract HTML N8N" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This gives us a clean &lt;code&gt;data&lt;/code&gt; field containing just the HTML we need.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 4: Parse book elements
&lt;/h2&gt;

&lt;p&gt;Add the HTML node ( Extract HTML Content ).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source Data:&lt;/strong&gt; &lt;code&gt;data&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key:&lt;/strong&gt; books&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CSS Selector:&lt;/strong&gt; &lt;code&gt;article.product_pod&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return Array:&lt;/strong&gt; ✅ Enabled&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return Value:&lt;/strong&gt; HTML&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run it once, you’ll see a new field - books, containing an array where each item represents a single book’s HTML block.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah4fooetbqafbmy46o0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah4fooetbqafbmy46o0u.png" alt="Parse book elements" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have one array, with multiple products, each ready to be parsed individually in the next step.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 5: Split the list into items
&lt;/h2&gt;

&lt;p&gt;Now we’ll process each product individually.&lt;br&gt;
Add the &lt;a href="https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.splitout/" rel="noopener noreferrer"&gt;Split Out node&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fields To Split Out:&lt;/strong&gt; &lt;code&gt;books&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now each book becomes its own item for extraction. this makes it easier to handle or filter each record separately later on.n.&lt;/p&gt;

&lt;p&gt;(You can skip this step if you only need a quick one-shot export, but keeping it helps if you plan to scale or tweak the workflow later.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwazieuukqo73ujiwbxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnwazieuukqo73ujiwbxx.png" alt="Split Out N8N" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 6: Extract product details
&lt;/h2&gt;

&lt;p&gt;Add another HTML node ( Extract HTML Content ) to grab the details inside each product.&lt;/p&gt;

&lt;p&gt;Extraction Values:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Key&lt;/th&gt;
&lt;th&gt;CSS Selector&lt;/th&gt;
&lt;th&gt;Return Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;h3 a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Attribute → title&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;url&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;h3 a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Attribute → &lt;code&gt;href&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;price&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.price_color&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;availability&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.instock.availability&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rating&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;p.star-rating&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Attribute → &lt;code&gt;class&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;image&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.image_container img&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Attribute → &lt;code&gt;src&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Hit Execute, you’ll get a structured JSON for each book.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flaeokllxbs8p37qqjm7e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flaeokllxbs8p37qqjm7e.png" alt="Extract product details" width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 7: Clean and normalize the data
&lt;/h2&gt;

&lt;p&gt;We’ll make sure URLs and image links are full paths, and rating classes are readable.&lt;br&gt;
Add a &lt;strong&gt;&lt;a href="https://docs.n8n.io/code/code-node/" rel="noopener noreferrer"&gt;Code node&lt;/a&gt;&lt;/strong&gt; ( Code in JavaScript ) and paste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://books.toscrape.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;urlRel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imgRel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratingClass&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ratingClass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;urlRel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;(\.\.\/)&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;imgRel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;(\.\.\/)&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;availability&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="nx"&gt;rating&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Config&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mode:&lt;/strong&gt; Run Once for All Items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language:&lt;/strong&gt; JavaScript&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt;&lt;br&gt;
You can tweak this logic based on your own site or data structure, for instance, you might want to clean extra fields, adjust paths differently, or skip this step entirely if your data’s already in the format you want.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt0bqb5a6f27br5n7n8q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt0bqb5a6f27br5n7n8q.png" alt=" " width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now your output will have clean, structured data, ready to export or feed into your next automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Export your data the way you want
&lt;/h2&gt;

&lt;p&gt;Now that your data is clean and structured, let’s turn it into a downloadable file, whether that’s CSV, .txt, or something else.&lt;/p&gt;

&lt;p&gt;Finally, drop in the &lt;strong&gt;&lt;a href="https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.converttofile/" rel="noopener noreferrer"&gt;Convert to File node&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This node takes your structured data and converts it into different file types.&lt;/p&gt;

&lt;p&gt;Here’s how to configure it:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuaghqogtcv8t425u8ga.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxuaghqogtcv8t425u8ga.png" alt="Convert Node N8N" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once done, click Execute Node and you’ll see a binary output with your file ready to download.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;And that’s it, we just built a full web scraping workflow in n8n, powered by the Zyte API.&lt;/p&gt;

&lt;p&gt;You’ve just automated a complete workflow: fetching, parsing, cleaning, and exporting all visually inside n8n.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9o8874utbkb31wxjubie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9o8874utbkb31wxjubie.png" alt=" " width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This same flow can be easily tweaked for other pages, just change the URL, update your selectors, and you’re good to go.&lt;/p&gt;

&lt;p&gt;In the next part, we’ll take this further and scrape multiple pages automatically by adding pagination logic.&lt;/p&gt;

&lt;p&gt;Stay tuned, thanks for reading!😄&lt;/p&gt;

</description>
      <category>programming</category>
      <category>webscraping</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>Supercharge Your AI Agents with a Custom RAG Pipeline Powered by Live Web Data</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Fri, 19 Sep 2025 18:41:16 +0000</pubDate>
      <link>https://dev.to/extractdata/supercharge-your-ai-agents-with-a-custom-rag-pipeline-powered-by-live-web-data-57fl</link>
      <guid>https://dev.to/extractdata/supercharge-your-ai-agents-with-a-custom-rag-pipeline-powered-by-live-web-data-57fl</guid>
      <description>&lt;p&gt;Just think for a while, what if you could fed any web page data to your AI agent, to just get you the exact info, answer or the summary of the content you're looking?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Actually, you can that with ease with Scrapy + Zyte API&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Meet Fab 👨‍💻&lt;/strong&gt;&lt;br&gt;
Fab’s a dev with years of experience. Lately, he’s been diving into finance, learning about promising stocks. But here’s the problem: keeping up with daily news, press releases, scrolling through 10 articles and updates every morning is hectic and manual.&lt;/p&gt;

&lt;p&gt;So Fab decided to build an AI Agent that does it for him - fetching, reading, and summarizing everything in real time.&lt;/p&gt;

&lt;p&gt;That’s basically a custom RAG pipeline, powered by live web data &amp;amp; no longer limited to static PDFs or outdated docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why? bother&lt;/strong&gt;&lt;br&gt;
Because even the smartest AI agent is only as good as the &lt;strong&gt;data it can access&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLMs have knowledge cutoffs&lt;/li&gt;
&lt;li&gt;Real-time, domain-specific data (like finance) is crucial for decision making&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By tapping into live web data, Fab’s agent can keep up with the world as it happens - always relevant, always ready.&lt;/p&gt;

&lt;p&gt;But hold up ✋, summarizing/ answering isn't the same as taking reaal actions. That’s where AI Agents and Agentic AI differs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtmmsd06encoa8wd8ebb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtmmsd06encoa8wd8ebb.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI Agents&lt;/strong&gt; are software systems designed to automate specific, well defined tasks, like chatbots, email sorting tools, or voice assistants, usually based on predefined tools or prompts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic AI&lt;/strong&gt;, on the other hand, has a broader scope of autonomy.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;What we’ll walk through here is technically an AI Agent, but since both share the same foundation, it could evolve into Agentic AI.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  Fab's Toolkit 🛠️
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.scrapy.org/" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt;&lt;/strong&gt; → for structured data extraction&lt;br&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_ai_agents&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;&lt;/strong&gt; → to handle dynamic &amp;amp; complex websites&lt;br&gt;
&lt;strong&gt;DuckDuckGo&lt;/strong&gt; + &lt;strong&gt;yfinance&lt;/strong&gt; → for extra search and finance insights&lt;br&gt;
&lt;strong&gt;&lt;a href="https://docs.agno.com/introduction" rel="noopener noreferrer"&gt;Agno&lt;/a&gt;&lt;/strong&gt; → to orchestrate a multi-agent workflow&lt;br&gt;
&lt;strong&gt;&lt;a href="https://console.groq.com/docs/overview" rel="noopener noreferrer"&gt;GroqCloud&lt;/a&gt;&lt;/strong&gt; → lightning fast LLM inference&lt;/p&gt;
&lt;h2&gt;
  
  
  The Architecture 🏗️
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feum27z9gvilxy0chcjdk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feum27z9gvilxy0chcjdk.png" alt=" " width="800" height="558"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Scrapy + Zyte API?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You could try doing this with just Scrapy and rotating proxies. But anyone who has scraped at scale knows the pain: blocks, captchas, failed requests.&lt;/p&gt;

&lt;p&gt;That’s where &lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_ai_agents&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;&lt;/strong&gt; shines. It offloads the heavy lifting, so you don’t have to babysit your scrapers, you just get clean, structured data.&lt;/p&gt;

&lt;p&gt;Think of it like having a dedicated backend team making sure your spiders never get stuck.&lt;/p&gt;
&lt;h2&gt;
  
  
  Data Collection the Right Way! 📥
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Instead of scraping everything, Fab’s agent first collects URLs only... then fetches only the important data based on a trend score.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To handle this efficiently, Fab designed a Scrapy project with one base spider and four specialized spiders for fetching:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;News&lt;/li&gt;
&lt;li&gt;Press releases&lt;/li&gt;
&lt;li&gt;Transcripts&lt;/li&gt;
&lt;li&gt;Comments&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The base spider takes care of site specific scraping by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetching URLs and metadata&lt;/li&gt;
&lt;li&gt;Cleaning and normalizing dates&lt;/li&gt;
&lt;li&gt;Generating unique IDs from URLs
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urlunparse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BaseFinanceSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base_finance_spider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finance-example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Normalize URLs&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;urlunparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate unique ID from cleaned URL&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clean_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;convert_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Convert relative dates like &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Today&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Yesterday&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to ISO&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Yesterday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Today&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# For demo, we skip complex parsing
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw_date&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each specialized spider inherits from the base spider and focuses on &lt;strong&gt;site specific logic&lt;/strong&gt;: navigating pages and extracting the key information for its data type.&lt;/p&gt;

&lt;p&gt;At this stage, three of the specialized spiders collect only URLs and metadata, creating a JSON list for each data type. Comments are the exception, we scrape those right away. Think of it as preparing a “to do list” of pages for Fab’s agent to process later, keeping things organized and efficient.&lt;/p&gt;

&lt;p&gt;When items are yielded, Scrapy Pipelines automatically handles the cross cutting tasks like URL normalization and ID assignment, deduplication, anonymization, comment linking, and saving items to JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UrlNormalizationPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clean_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DeduplicationPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seen_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seen_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;DropItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Duplicate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seen_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AnonymizationPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Mask authors, publishers, or usernames
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;JsonFileExportPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Save item to JSON file (with intermediate saves)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What Each Spider Produces →&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frytddvhzq7ac6ff7uujg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frytddvhzq7ac6ff7uujg.png" alt="News JSON Sample" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79tygtsh2zzon3acig93.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79tygtsh2zzon3acig93.png" alt="Press Releases JSON Sample" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frn7fhz54r8nxnikhyjn1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frn7fhz54r8nxnikhyjn1.png" alt="Transcript JSON Sample" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6337z3chbty5uf2c20z4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6337z3chbty5uf2c20z4.png" alt="Comments JSON Sample" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the URLs and metadata are collected, Fab’s agent performs trend analysis, using comments as a central indicator to prioritize which pages to fetch in full.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trend Analysis 📈
&lt;/h2&gt;

&lt;p&gt;Now that we’ve gathered articles ( news, press releases, transcripts ) and comments, the next step is figuring out which topics are actually trending. Collecting raw content is only half the job, what makes it valuable is knowing where attention is going.&lt;/p&gt;

&lt;p&gt;For this, we built a Trend Calculator. Its job is to take all the articles  and comments we collected, connect them together, and then assign each article a trend score. The score is based on a few simple but powerful signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Comment activity&lt;/strong&gt; – Articles with more comments get higher scores (up to a cap, so one viral post doesn’t skew everything).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mentions inside comments&lt;/strong&gt; – If people are discussing one article inside the comments of another, that’s a sign of influence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Freshness&lt;/strong&gt; – Recent articles get a bonus since trends fade quickly over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross source validation&lt;/strong&gt; – If the same topic shows up across multiple sources (like news and press releases), it’s likely important.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engagement quality&lt;/strong&gt; – Longer, more thoughtful comments add extra weight compared to short ones.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example of scoring logic
&lt;/span&gt;&lt;span class="n"&gt;comment_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comment_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;mention_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comment_mentions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;date_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_date_bonus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  
&lt;span class="n"&gt;source_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  
&lt;span class="n"&gt;engagement_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;quality_from_comments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;  

&lt;span class="n"&gt;trend_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;comment_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;mention_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;date_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;source_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;engagement_score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each factor contributes points that add up to a final &lt;code&gt;trend_score&lt;/code&gt;, showing how much traction an article has.&lt;/p&gt;

&lt;p&gt;Here’s the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Link comments to articles&lt;/strong&gt; – Attach every comment to its article.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;article_comments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comment_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;article_comments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Score calculation&lt;/strong&gt; - For every article, the calculator looks at the signals above and assigns points.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ranking&lt;/strong&gt; - Articles are sorted by score so we can clearly see which ones are rising in popularity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filtering&lt;/strong&gt; - We keep only those above a threshold score (say 5 or 10), to cut out noise.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, the output is saved as JSON for later use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_top_articles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;trending_articles.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trending&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, we’re not just storing articles list, we’re turning them into insights about what’s gaining traction in real time.&lt;/p&gt;

&lt;p&gt;The output of this step is a &lt;code&gt;trending_articles.json&lt;/code&gt; file: a ranked list of articles with their comment signals attached. Next, we’ll take this list and extract the full article content for deeper processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Processing the Articles 📑
&lt;/h2&gt;

&lt;p&gt;Alright, time to move past the signals and actually grab the articles content. This is where Fab’s agent pulls in the full text so it can finally be read, summarized, and acted on — the real scraping and processing begins here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Smart Extraction with Zyte API&lt;/strong&gt;&lt;br&gt;
Instead of scraping blindly, we run each article URL through Zyte API. It tries multiple strategies under the hood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser rendering&lt;/strong&gt; for rich pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP response fallback&lt;/strong&gt; if the first pass fails.&lt;/li&gt;
&lt;li&gt;And if all else fails → a &lt;strong&gt;graceful fallback object&lt;/strong&gt; that notes the article couldn’t be extracted (paywalls, login walls, etc.).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Caching is baked in so we don’t re download the same article twice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_article_with_zyte_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_cached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;get_from_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Try browser mode first, fallback to HTTP
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;extract_with_browser_simple&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extract_with_http_response&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;method&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;save_to_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create_fallback_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Batch Processing ⚡&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Its not good to hammer a site with 50 requests at once, so Fab’s agent scrapes articles in small batches. This keeps things stable, avoids rate limits, and lets us resume midway if anything fails.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scraped_articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_articles_in_batches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Comments + Anonymization&lt;/strong&gt;&lt;br&gt;
Once the raw articles are in, we attach their associated comments (collected earlier) and anonymize usernames. That way, Fab can see the discussion signals without worrying about leaking personal data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;matching_trending&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processing_anonymizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anonymize_comments_in_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Summarization with LLMs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Finally, each article is summarized using Groq + Llama 3.3, with comments included in the context. The prompt ensures Fab gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A clear content-type tag ([Complete article with comments], [Partial article], etc.).&lt;/li&gt;
&lt;li&gt;The main points of the article.&lt;/li&gt;
&lt;li&gt;Highlights from user comments (agreements, debates, sentiment).&lt;/li&gt;
&lt;li&gt;A note if the article looked incomplete or truncated.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summarize_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, we’ve gone from:&lt;br&gt;
just links + scores → full articles + anonymization + structured summaries.&lt;/p&gt;

&lt;p&gt;This is the real handoff moment: the dataset is now clean, safe, and AI ready. Time to combine this with other data sources...&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning Raw Summaries into Something Useful
&lt;/h2&gt;

&lt;p&gt;So we’ve got cleaned up, summarized articles sitting neatly in JSON. That’s cool, but Fab doesn’t just want a folder full of summaries, he wants an agent that can reason over them, combine them with live market data, and give him answers on demand.&lt;/p&gt;

&lt;p&gt;That’s exactly what Agno will be used for. Agno is a framework for building LLM powered agents where everything revolves around tools. We use some ready made tools, like yfinance for market data or DuckDuckGo for quick searches, and we’ll create our own custom tool using the scraped and summarized articles we’ve collected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Custom Data as a Tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We wrap our summaries into a &lt;code&gt;CustomDataTools&lt;/code&gt; class. This behaves just like any other tool in Fab’s agent, except instead of calling an external API, it pulls directly from our private dataset of scraped articles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load summaries from &lt;code&gt;article_summaries.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;Filter them by stock ticker (NVDA in our case)&lt;/li&gt;
&lt;li&gt;Format them into a neat digest with truncation rules so we don’t blow past token limits
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CustomDataTools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Toolkit&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_custom_financial_summaries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stock_ticker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NVDA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;summaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_scraped_summaries&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format_summaries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summaries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stock_ticker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stock_ticker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Mixing with External Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Of course, Fab doesn’t live on summaries alone. He still needs real time signals like stock prices, analyst ratings, and other fresh search. That’s where we combine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;yfinance → live stock + fundamentals&lt;/li&gt;
&lt;li&gt;DuckDuckGo → fresh search&lt;/li&gt;
&lt;li&gt;Our custom summaries → curated, domain-specific insights
Now the agent has both breadth (search + finance APIs) and depth (our private dataset).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Building the Agent 🧑‍💻&lt;/strong&gt;&lt;br&gt;
With Agno, stitching it together is dead simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick a model (Groq’s Llama 3.3 for speed, or Ollama locally if Fab prefers).&lt;/li&gt;
&lt;li&gt;Load the toolset (custom data first, then finance APIs, then search).&lt;/li&gt;
&lt;li&gt;Add guardrails: focus on NVDA, prefer bullet points, flag stale data, cite sources.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1️⃣ Import models and tools
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.models.groq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Groq&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.models.ollama&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Ollama&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Toolkit&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.tools.yfinance&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;YFinanceTools&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agno.tools.duckduckgo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DuckDuckGoTools&lt;/span&gt;

&lt;span class="c1"&gt;# 2️⃣ Define a custom tool for our scraped summaries ( given above ) 
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CustomDataTools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Toolkit&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# 3️⃣ Configure agent tools
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_tools&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;CustomDataTools&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;          &lt;span class="c1"&gt;# Priority: private scraped data
&lt;/span&gt;        &lt;span class="nc"&gt;YFinanceTools&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;            &lt;span class="c1"&gt;# Priority: live stock &amp;amp; fundamentals
&lt;/span&gt;        &lt;span class="nc"&gt;DuckDuckGoTools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;           &lt;span class="c1"&gt;# Priority: fresh search results
&lt;/span&gt;    &lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# 4️⃣ Pick a model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Groq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.3-70b-versatile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# Or Ollama if you prefer local
&lt;/span&gt;
&lt;span class="c1"&gt;# 5️⃣ Create the unified agent
&lt;/span&gt;&lt;span class="n"&gt;finance_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fab Finance Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_tools&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Focus on NVDA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prioritize custom summaries first, then live stock data, then fresh search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Provide actionable insights in bullet points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cite sources and flag outdated info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;show_tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;markdown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when Fab asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;“What’s the latest chatter around NVDA this week?”&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent first checks our curated summaries, then layers in stock stats and fresh news.&lt;/p&gt;

&lt;p&gt;This is where everything comes together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scrapy + Zyte API → fresh, structured raw data&lt;/li&gt;
&lt;li&gt;Processing &amp;amp; scoring → signal + summaries&lt;/li&gt;
&lt;li&gt;Finance Agent ( Agno ) → fusing custom + external tools into one workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;What Fab ends up with is not just a scraper or a summarizer but a finance co pilot that stays current, context aware, and grounded in real web data.&lt;/p&gt;

&lt;p&gt;With this workflow, what started as a manual, time-consuming task has transformed into a seamless, intelligent system, proving just how powerful AI Agents can be when paired with live web data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 A small challenge for you:&lt;/strong&gt;&lt;br&gt;
If you’re feeling adventurous, try taking this project a step further, convert Fab’s AI Agent into a fully Agentic AI that can make decisions for you (of course, only with your approval, or you might risk your investments 😅). Connect it with the MCP of your stockbroker many of them provide one nowadays and scale it into something truly powerful, a next level finance companion!&lt;/p&gt;

&lt;p&gt;If you get stuck or need guidance, don’t worry. Head over to the Extract Data Community, where 21,000+ data enthusiasts are ready to jump in and help you with your questions.&lt;/p&gt;

&lt;p&gt;Dive in, experiment, and let us see your next move! 🙂&lt;/p&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Building a Discord Controlled Web Scraper with Scrapy &amp; Zyte API</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Thu, 14 Aug 2025 09:52:23 +0000</pubDate>
      <link>https://dev.to/extractdata/building-a-discord-controlled-web-scraper-with-scrapy-zyte-api-4mad</link>
      <guid>https://dev.to/extractdata/building-a-discord-controlled-web-scraper-with-scrapy-zyte-api-4mad</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;It all started with a simple question in our Extract Data Discord: &lt;a href="https://discord.gg/eN83rMWqAt" rel="noopener noreferrer"&gt;Extract Data Community&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Hey, I’m trying to scrape this gaming leaderboard, but I keep getting blocked. Any idea how to get around it?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A familiar problem for anyone in web scraping: modern websites block regular scraping with JavaScript rendering, rate limits, and IP restrictions. What began as a quick fix with &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_sfnd_blog&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; soon grew into a bigger idea.&lt;/p&gt;

&lt;p&gt;After sharing a &lt;a href="https://discord.com/channels/993441606642446397/1369996292100591757" rel="noopener noreferrer"&gt;working demo&lt;/a&gt;, I asked myself:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;h6&gt;
  
  
  What if this could be more than just a script?
&lt;/h6&gt;
&lt;h6&gt;
  
  
  What if it could scrape reliably, filter intelligently, and notify automatically - all while plugging into Discord?
&lt;/h6&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;And that’s how this project came to life!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In this article, I’ll share what I learned while building the system, from scraping and filtering data to sending real-time updates in Discord, and even triggering scrapes directly via a Discord bot. No heavy code walkthroughs, just insights you can apply to your own projects.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🧭 Overview: What We Built
&lt;/h2&gt;

&lt;p&gt;At its core, the project does five simple but powerful things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scrapes&lt;/strong&gt; leaderboard data from a gaming site using Scrapy
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bypasses&lt;/strong&gt; anti-bot protections using &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_sfnd_blog&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte API’s&lt;/a&gt; browser automation
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filters&lt;/strong&gt; players based on customizable level thresholds
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notifies&lt;/strong&gt; your Discord channel about new high-level players
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runs continuously&lt;/strong&gt; on autopilot with scheduled checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🎯 Scraping Goal:&lt;/strong&gt; Build a scraper that scans the game’s leaderboard using custom input filters with in a defined page range, then instantly alerts our Discord community when matching high-level players are found.&lt;/p&gt;

&lt;p&gt;Key Components&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🕷️ &lt;strong&gt;&lt;em&gt;Scrapy&lt;/em&gt;&lt;/strong&gt; handles the scraping.&lt;/li&gt;
&lt;li&gt;🛡️ &lt;strong&gt;&lt;em&gt;Zyte API&lt;/em&gt;&lt;/strong&gt; bypasses tough protections.&lt;/li&gt;
&lt;li&gt;⏱️ &lt;strong&gt;&lt;em&gt;Monitoring&lt;/em&gt;&lt;/strong&gt;: Automated scheduling system&lt;/li&gt;
&lt;li&gt;🤖 &lt;strong&gt;&lt;em&gt;A Discord bot&lt;/em&gt;&lt;/strong&gt; control center for commands/results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;No manual refreshing. No getting blocked. Just clean, filtered data delivered where your community hangs out.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpv6bk9rp38x6rngo3f7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpv6bk9rp38x6rngo3f7b.png" alt=" " width="800" height="965"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Project Structure&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrape_filter_notify/
├── main.py                    # Main CLI entry point
├── discord_bot.py             # Discord bot with all commands
├── continuous_monitor.py      # Automated monitoring scheduler
├── requirements.txt           # Python dependencies
├── .env                       # Environment variables (create this)
├── .gitignore                 # Git ignore rules
│
└── scrape_filter_notify/     # Scrapy project
    ├── scrapy.cfg            # Scrapy configuration ( Default )
    └── scrape_filter_notify/
        ├── settings.py       # Scrapy settings ( Modified )
        ├── items.py          # Scrapy data models ( Modified )
        ├── pipelines.py      # Data processing ( Modified )
        ├── discord_notifier.py  # Discord integration ( New )
        └── spiders/
            └── leaderboard_spider.py  # Main web scraper ( Modified )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ⚙️ Getting Started: Setting Up the Spider ( with Scrapy + Zyte API )
&lt;/h2&gt;

&lt;p&gt;It began by setting up the scraper engine using a Scrapy spider, the gaming site in focus wasn’t friendly, it threw up JavaScript, rate limits, and the occasional CAPTCHAs at us.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Scrapy alone couldn’t get through, so we brought in Zyte API to handle rendering, retries, and anti-bot defenses. That way, the spider could focus on what matters: pulling clean data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧭 New to Scrapy?&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you’re just getting started, this &lt;a href="https://docs.zyte.com/web-scraping/tutorials/main/setup.html" rel="noopener noreferrer"&gt;tutorial&lt;/a&gt; will walk you through setting up your first Scrapy project from scratch.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Scraping Process
&lt;/h4&gt;

&lt;p&gt;Here’s is the architecture for a smart and robust &lt;code&gt;leaderboard_spider.py&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hhnndegrjpvr0zq36sw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hhnndegrjpvr0zq36sw.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;See, the &lt;code&gt;Scrapy setup&lt;/code&gt; crawls through paginated leaderboard pages and extracts player info, with Zyte smart backend helping it navigate the websites tricky parts under the hood.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To keep things clean and easy to maintain, I split the logic into three main files - each doing exactly one job::&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;leaderboard_spider.py&lt;/strong&gt; - does the crawling and parsing
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;items.py&lt;/strong&gt; - defines the structure for raw data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pipelines.py&lt;/strong&gt; - filters, saves, and notifies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One important step before starting the spider is configuring Scrapy to use Zyte API as the backend for all requests. This goes into our &lt;code&gt;Scrapy&lt;/code&gt; &lt;code&gt;settings.py&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load Zyte API key securely from environment (recommended)
&lt;/span&gt;&lt;span class="n"&gt;ZYTE_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Enable transparent mode for better debugging and easier dev experience
&lt;/span&gt;&lt;span class="n"&gt;ZYTE_API_TRANSPARENT_MODE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="c1"&gt;# Use Zyte’s download handler and middleware
&lt;/span&gt;&lt;span class="n"&gt;DOWNLOAD_HANDLERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_zyte_api.ScrapyZyteAPIDownloadHandler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_zyte_api.ScrapyZyteAPIDownloadHandler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;DOWNLOADER_MIDDLEWARES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;633&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;📝 Tip:&lt;/strong&gt; keep sensitive info like API keys in environment variables, never hardcode credentials directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Defining Raw Data
&lt;/h4&gt;

&lt;p&gt;In Scrapy, &lt;code&gt;items&lt;/code&gt; basically define how we want to shape the raw data we’re scraping. They’re like organized containers that hold everything the spider grabs. Later on, pipelines handle cleaning and validation.&lt;/p&gt;

&lt;p&gt;For instance, here’s the simple &lt;code&gt;RawPlayerItem&lt;/code&gt; class I used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# items.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="c1"&gt;# 🧱 Defines the structure of raw player data scraped from the leaderboard
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RawPlayerItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;player_name_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;kingdom_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;level_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;game_exp_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Tip:&lt;/strong&gt; If your setup includes &lt;strong&gt;pagination&lt;/strong&gt;, it’s helpful to capture the page number as part of your data. &lt;em&gt;For me, this was useful for estimating how long the scraping would take &amp;amp; debugging issues...&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I set the spider to run on &lt;strong&gt;three main functions&lt;/strong&gt;: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Initialize settings&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send requests&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parse the data&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Inside the &lt;code&gt;__init__&lt;/code&gt; method, I just set up some basic configurations, like:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimum player level to consider
&lt;/li&gt;
&lt;li&gt;Number of pages to scrape
&lt;/li&gt;
&lt;li&gt;Output location
&lt;/li&gt;
&lt;li&gt;Whether or not to send a Discord notification
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the spider starts &lt;code&gt;sending requests&lt;/code&gt;, it’s not just grabbing plain HTML. Because the site uses a lot of JavaScript, we rely on &lt;strong&gt;Zyte API’s browser automation&lt;/strong&gt; to fully load content before scraping.&lt;/p&gt;

&lt;p&gt;A couple of things to keep in mind while sending requests:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add a little wait time&lt;/strong&gt; using &lt;code&gt;actions&lt;/code&gt; with a timeout, because sometimes the page content takes a few seconds to fully load.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set the &lt;code&gt;geolocation&lt;/code&gt; to &lt;code&gt;US&lt;/code&gt;&lt;/strong&gt; – this was a key discovery. The site sometimes shows incomplete or blocked content depending on the request’s region. Setting it to the US gave consistent, clean data every time.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example request setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;zyte_api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;browserHtml&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Get full browser-rendered HTML
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;javascript&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Enable JS execution
&lt;/span&gt;        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;actions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;                  &lt;span class="c1"&gt;# Wait time before scraping
&lt;/span&gt;            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;waitForTimeout&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;geolocation&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;US&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;           &lt;span class="c1"&gt;# Set location to US for consistent data
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;One of the nice things Scrapy handles for us behind the scenes is retries and error handling.&lt;br&gt;
If you were working with just plain Python + Zyte API, you’d have to write your own retry logic for bans, 520 errors, and other hiccups.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Just add these to your &lt;code&gt;settings.py&lt;/code&gt; to handle retries automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Retry settings
&lt;/span&gt;&lt;span class="n"&gt;RETRY_HTTP_CODES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;520&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;524&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;RETRY_TIMES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once Zyte sends back the fully rendered HTML, the spider’s &lt;code&gt;parse()&lt;/code&gt; method gets to work. It uses &lt;strong&gt;CSS selectors&lt;/strong&gt; to sift through the messy HTML and pick out exactly what we need: player names, kingdoms, levels, and more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📝Pro Tip:&lt;/strong&gt; I actually use &lt;strong&gt;two CSS selectors as a backup plan&lt;/strong&gt;, because sometimes the page’s HTML is a little different, like text wrapped in &lt;code&gt;&amp;lt;font&amp;gt;&lt;/code&gt; tags on some pages but not others. This helps the spider stay flexible and not break - something you learn while debugging!&lt;/p&gt;

&lt;h4&gt;
  
  
  Pipelines: Cleaning, Filtering &amp;amp; Notifying
&lt;/h4&gt;

&lt;p&gt;Once the spider scrapes raw data, the pipelines take over to clean, validate, save, and notify. I split the pipeline into two main parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;PlayerProcessingPipeline&lt;/code&gt;: This part cleans up the raw data, filters out players below the minimum level, avoids duplicates, and saves the final list.
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DiscordNotificationPipeline&lt;/code&gt;: At the end, this pipeline checks for any new players and shoots a neat summary over to Discord to keep everyone in the loop.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One cool thing I learned about how pipelines work in Scrapy is that each pipeline &lt;code&gt;class&lt;/code&gt; gets its own “wrap-up” moment when the spider finishes running.&lt;/p&gt;

&lt;p&gt;Scrapy lets every pipeline class define its own finishing method - &lt;code&gt;close_spider()&lt;/code&gt;, and it runs these automatically in the order you set in &lt;code&gt;ITEM_PIPELINES&lt;/code&gt; in &lt;code&gt;settings.py&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enable pipelines
&lt;/span&gt;&lt;span class="n"&gt;ITEM_PIPELINES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrape_filter_notify.pipelines.PlayerProcessingPipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrape_filter_notify.pipelines.DiscordNotificationPipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s why in my case the processing pipeline runs first, to clean and save the data and the Discord pipeline runs right after to send notifications based on that data.&lt;/p&gt;

&lt;p&gt;Remember this all happens under the hood, once the spider starts crawling, the pipeline quietly takes over in the background, it filters out duplicates, skips players below the level threshold, and stores clean data into a JSON file. &lt;/p&gt;

&lt;p&gt;That wraps up our Scraper Engine. &lt;/p&gt;

&lt;p&gt;But scraping data is only useful if it reaches the people who need it.&lt;br&gt;&lt;br&gt;
Next I set up the &lt;code&gt;Discord notifier&lt;/code&gt; to deliver the fresh data we just scraped right to the Discord server.&lt;/p&gt;


&lt;h2&gt;
  
  
  Sending Updates with Discord Notifier
&lt;/h2&gt;

&lt;p&gt;Building a Discord bot isn’t hard, libraries like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://discordpy.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Discord.py&lt;/a&gt; - a modern, easy to use, feature-rich, and async ready API wrapper for Discord.
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://discord.js.org/" rel="noopener noreferrer"&gt;Discord.js&lt;/a&gt; - a powerful Node.js module that allows you to interact with the Discord API very easily.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…make it pretty straightforward.&lt;/p&gt;

&lt;p&gt;Since our scraper is all in Python, I went with &lt;code&gt;Discord.py&lt;/code&gt;. That way, everything runs in one language with no extra headaches, no child processes, no separate API layer just to talk to the scraper engine. That said, Discord.js has its own perks and can be the better pick if you’re already deep in the Node.js. We’ll explore that route another time.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;discord_notifier.py&lt;/code&gt;, the workflow is pretty simple: :&lt;/p&gt;

&lt;p&gt;1️⃣ &lt;strong&gt;Load secrets&lt;/strong&gt; (bot token &amp;amp; channel ID) securely via environment variables.&lt;br&gt;&lt;br&gt;
2️⃣ &lt;strong&gt;Log in to Discord&lt;/strong&gt;, find the target channel, and build a polished embed message with the top new players.&lt;br&gt;&lt;br&gt;
3️⃣ &lt;strong&gt;Send the message&lt;/strong&gt;, then log out cleanly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The fun part was dealing with the event loop clash between Scrapy and Discord.py&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s the thing, Scrapy runs asynchronously on top of Twisted, which is its own networking framework - &lt;em&gt;Twisted is a networking library that provides the asynchronous framework that Scrapy uses for its operations.&lt;/em&gt; Which means scrapy manages a lot of things (like web requests and processing ) concurrently within its own Twisted event loop.&lt;/p&gt;

&lt;p&gt;When the spider finishes scraping, Scrapy begins shutting down. But in my second pipeline class (&lt;code&gt;DiscordNotificationPipeline&lt;/code&gt;), we still need to run the notifier - but we’re still inside Scrapy’s Twisted event loop.&lt;/p&gt;

&lt;p&gt;On the other hand, when we run &lt;code&gt;discord_notifier&lt;/code&gt; using &lt;code&gt;discord.py&lt;/code&gt; library, it uses asyncio, which runs its own separate event loop. And the key problem is that:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔥 You cannot start an asyncio loop while another event loop (like Twisted’s) is already running.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Python will raise a &lt;strong&gt;&lt;code&gt;RuntimeError&lt;/code&gt;&lt;/strong&gt;, because you're trying to start one event loop inside another.&lt;/p&gt;

&lt;p&gt;To avoid that, I added a check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;If the Scrapy loop is already active&lt;/strong&gt;, the notification runs in a separate thread with its own event loop.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;If not&lt;/strong&gt;, it runs normally on the main loop.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Something like this does the trick:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_running&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Run Discord notifier in a new thread
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Run notifier on the current loop
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That little workaround ensures the scraper finishes, and your Discord server gets a clean summary message every time the job completes, no crashes, no conflicts.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The &lt;code&gt;discord_notfier.py&lt;/code&gt; script we discussed isn’t a full-fledged bot - it just logs in, sends a summary message, and logs out. It’s great for running the scraper on a schedule, and pushing updates to Discord automatically. I created a separate Discord bot that gives us full control over the scraping process directly from Discord. This setup keeps the scraper independent and flexible!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Before we move on,
&lt;/h4&gt;

&lt;p&gt;here’s a quick visual that ties everything together, from fetching the rendered HTML to storing filtered data to JSON, sending updates to Discord, and setting up the scheduler in the next step::&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb80ya5eiz8yjbwwpx0pl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb80ya5eiz8yjbwwpx0pl.png" alt=" " width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pretty solid, right?&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Now that we’ve got scraping and notifications working, the next question is: what if we want this whole flow to run automatically, without having to trigger it manually every hour or so?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is exactly what &lt;code&gt;continuous_monitor.py&lt;/code&gt; was set up for. It's a smart loop that runs our spider at regular intervals...&lt;/p&gt;


&lt;h2&gt;
  
  
  🔁Autopilot Mode: Let the Spider Run Itself
&lt;/h2&gt;

&lt;p&gt;Here’s what I did for scheduling the scraper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keeps track of run stats:&lt;/strong&gt; started time, last run, next scheduled run, and total runs completed.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handles shutdown signals cleanly&lt;/strong&gt;, so we never leave half-finished runs hanging.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Launches the spider as a subprocess&lt;/strong&gt;, waits for it to finish, and then sleeps for the interval you’ve set
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;monitor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContinuousMonitor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_monitoring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interval_minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;Then it runs again… and again… automatically. Something like this captures the core idea:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;monitoring_active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;run_spider_subprocess&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;report_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# print or send to Discord
&lt;/span&gt;    &lt;span class="nf"&gt;sleep_for_interval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;I set it up using asyncio so everything runs smoothly without blocking, even when integrated with Discord notifications. The async loop handles spider runs, reporting, and sleep intervals efficiently — no interference between tasks.&lt;/p&gt;

&lt;p&gt;This script could run in two ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Standalone:&lt;/strong&gt; Just schedule it with a cron job, or even run it manually. It scrapes, saves JSON data, and optionally sends Discord notifications.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inside a bot:&lt;/strong&gt; Later, we can plug it into a Discord bot to give us full control - start, stop, or check stats directly from Discord.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now everything wired up — the spider does the scraping, the notifier sends updates, and the monitor keeps things running on a loop.&lt;br&gt;
But let’s be real, running a spider manually every time wasn’t exactly goal. So we built a Discord bot!&lt;/p&gt;

&lt;p&gt;Here’s a quick look at the full bot lifecycle to visualize how it all works (just make sure the &lt;code&gt;.env&lt;/code&gt; file has the bot token and channel ID set up before running it)::&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3ulunpgmgf7b9mfzua2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3ulunpgmgf7b9mfzua2.png" alt=" " width="800" height="1089"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This bot isn't just a helper. It’s a full control panel for our scraper, right inside the Discord server. Want to scrape once on-demand? Run &lt;code&gt;/scrape&lt;/code&gt;. Want it to auto-run every 60 minutes? Do &lt;code&gt;/monitor_start interval:60&lt;/code&gt;. Want to stop it? Check status? It’s all there, and the responses look good too (with progress bars, timestamps, and interactive result buttons).&lt;/p&gt;
&lt;h2&gt;
  
  
  🤖 Discord Bot: A quick Walkthrough
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The bot launches and registers slash commands&lt;/strong&gt; like &lt;code&gt;/scrape&lt;/code&gt;, &lt;code&gt;/monitor_start&lt;/code&gt;, &lt;code&gt;/monitor_status&lt;/code&gt;, etc.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We can interact with it via those commands&lt;/strong&gt;, depending on the command, it either:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runs a single scraping job&lt;/strong&gt; using the parameters we give (or defaults),
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starts the monitor&lt;/strong&gt;, which loops and runs jobs periodically,
&lt;/li&gt;
&lt;li&gt;Or just gives helpful info with &lt;code&gt;/help_sccrape&lt;/code&gt; or lets us stop ongoing monitoring with &lt;code&gt;/monitor_stop&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;While the scraping is in progress, &lt;strong&gt;we get live updates&lt;/strong&gt; with visually satisfying progress bars, estimated times, and player counts.
&lt;/li&gt;
&lt;li&gt;Once it's done, it gives back a clean summary with a “View Results” button that opens an embedded, paginated view of the players it found in Discord itself...&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So far, I built out all the pieces &amp;gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A spider that scrapes,
&lt;/li&gt;
&lt;li&gt;A Discord bot that commands it
&lt;/li&gt;
&lt;li&gt;A monitor that loops it in the background.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But I needed one more thing… A way to integrate all components together, for ease of control.&lt;/p&gt;

&lt;p&gt;That’s why I created &lt;code&gt;main.py&lt;/code&gt;: it’s the single command-line interface that ties the whole project together. Whether we want to… Run a quick scrape, Start the Discord bot Or launch background monitoring, it does all for us.&lt;/p&gt;

&lt;p&gt;Next up! Let’s see how the results actually look when this thing runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;🌀 Output Preview: What Happens When It Runs&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Triggering a Scrape (Terminal Output)&lt;/strong&gt;
Here’s what it looks like when we run a scrape directly from the CLI. It kicks off the spider, runs through the pages, and wraps up with notifying us on our Discord Channel..&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknni8hcqzy10cxvch4eo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknni8hcqzy10cxvch4eo.png" alt=" " width="800" height="40"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfsw1oweagtlbt6bxtcu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfsw1oweagtlbt6bxtcu.png" alt=" " width="800" height="193"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far9k0dx46hqfxw1cay4p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far9k0dx46hqfxw1cay4p.png" alt=" " width="800" height="69"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Notified on Discord&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6maacrjqkayav964vesh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6maacrjqkayav964vesh.png" alt=" " width="800" height="763"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Running Background Monitoring (Terminal Output)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When we want the spider to keep working in the background, automatically running every X minutes, just trigger:&lt;br&gt;
&lt;/p&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py monitor &lt;span class="nt"&gt;--interval&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll see something like this in our terminal:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5qk4b5dmpx335mwe2o22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5qk4b5dmpx335mwe2o22.png" alt=" " width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 And Just like before, it’ll ping us on Discord with updates!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Live Discord Bot in Action&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once we start the Discord Bot using the below command.&lt;br&gt;
&lt;/p&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python main.py bot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…it boots up and gets right to work behind the scenes. It registers all the slash commands we built. Now we don’t need to touch the terminal. Just head to Discord and start interacting with the bot directly - &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Available Commands on Discord&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxan1zzmonngkdc4u1dm6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxan1zzmonngkdc4u1dm6.png" alt=" " width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;a. &lt;strong&gt;Run scrape Instantly&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
   Just type &lt;code&gt;/scrape&lt;/code&gt;, hit enter, and the bot takes care of the rest.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v9n2qy1gcq2nt1jf1xo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v9n2qy1gcq2nt1jf1xo.png" alt=" " width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Controlling the monitoring loop right from Discord:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuu7cskbto6cef996nyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuu7cskbto6cef996nyc.png" alt=" " width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that’s it, we’ve seen it all in action.&lt;/p&gt;

&lt;p&gt;From scraping and filtering to live Discord alerts and full automation via CLI and bot commands, every part of this project works together to keep us and the community updated on the latest leaderboard shifts with minimal effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 &lt;strong&gt;Final Thoughts: Scraping That Talks Back&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This project started with a simple goal to help someone to get past anti-bot walls and grab some game data with &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_sfnd_blog&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;. But along the way, it became something more - a full system that scrapes, filters, and talks back to us in real-time via Discord.&lt;/p&gt;

&lt;p&gt;The best part? It’s modular. Want to tweak the filter logic? Modify the pipeline. Want to plug it into another Discord server? Just update the &lt;code&gt;.env&lt;/code&gt;. Need to scrape something entirely different? Swap out the spider logic, and keep the rest.&lt;/p&gt;

&lt;p&gt;Just imagine, a single question turned into a full-fledged project… &lt;/p&gt;

&lt;p&gt;That’s exactly the kind of spark our community runs on. If you're into this kind of stuff, scraping tricky sites, building smarter automations, or just geeking out over ideas, come hang out in the &lt;a href="https://discord.gg/eN83rMWqAt" rel="noopener noreferrer"&gt;&lt;strong&gt;Extract Data Discord&lt;/strong&gt;&lt;/a&gt;. We’re &lt;strong&gt;20,000+&lt;/strong&gt; strong and growing, with data lovers, scraping pros, and creative hackers sharing projects, questions, and solutions every single day.&lt;/p&gt;

&lt;p&gt;And as for this project - I hope walking through this gave you a solid blueprint for how to go beyond just writing a spider and instead, build scraping workflows that feel more interactive, automated, and fun.&lt;/p&gt;

&lt;p&gt;Play around, and let us know what you build next!&lt;/p&gt;

&lt;p&gt;Thanks for reading, 🙂 &lt;br&gt;
Catch you in the Discord!&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>data</category>
      <category>discord</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
